SeekrFlow supports reinforcement tuning using group relative policy optimization (GRPO). Reinforcement tuning trains a model to generate higher-quality outputs by scoring candidates against reference answers using one or more graders.To train a model with reinforcement tuning, follow the same process as standard fine-tuning with a few modifications.First, ensure your dataset has a reference_answer field containing the correct answer for each problem. Include a system prompt instructing the model to use the reasoning format:
{ "messages":[ { "role": "system", "content": "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>" }, { "role": "user", "content": "Find the smallest positive $a$ such that $a$ is a multiple of $4$ and $a$ is a multiple of $14.$" } ], "reference_answer": "28"}
Upload this dataset with the purpose reinforcement-fine-tune. See Upload file for the full schema reference.Set fine_tune_type and define a reward function using reward_components in your TrainingConfig:
A reward function defines how model outputs are scored during training. In SeekrFlow, reward functions are built from one or more graders — individual scoring operations that each evaluate a specific quality of the output.
Assign weight percentages to combine multiple graders into a single reward function. Weights must sum to 1.0. If no weights are provided, graders are weighted equally.
By default, 10% of the reward score is based on whether the model uses the correct output format (<think> and <answer> tags). You can adjust this with format_reward_weight: