Fine-tune a reasoning model with GRPO

Seekr also supports training reasoning models with reinforcement fine-tuning using Group Relative Policy Optimization (GRPO). With GRPO, we train a model to reason through a problem and output the correct answer. This is effective for domains that have definite answers that can be verified as correct.

To train a model with GRPO, we follow much the same process as standard fine-tuning, with just a few modifications.

First, we need to make sure our dataset has a reference_answer field, containing the correct answer for each problem we give the model. We also include a system prompt instructing the model to use our reasoning format. So our data points will look like this:

{
    "messages":[
        {
            "role": "system",
            "content": "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>"
        },
        {
            "role": "user",
            "content": "Find the smallest positive $a$ such that $a$ is a multiple of $4$ and $a$ is a multiple of $14.$"
        }
    ], 
    "reference_answer": "28"
}

We upload this dataset with the purpose reinforcement-fine-tune.

When instantiating TrainingConfig, add the parameter, fine_tune_type.

from seekrai.types.finetune import FineTuneType

training_config = TrainingConfig(
    training_files=['<your-reinforcement-fine-tuning-file-id>'], 
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    n_epochs=1,
    n_checkpoints=1,
    batch_size=4,
    learning_rate=1e-6, # we typically use a lower batch size for reinforcement fine-tuning
    experiment_name="helperbot_grpo_v1",
    fine_tune_type=FineTuneType.GRPO
)

We can now create the fine-tuning job using the client.

Reinforcement learning uses a reward signal to train the model rather than a loss.