Reinforcement tuning

SeekrFlow supports reinforcement tuning using group relative policy optimization (GRPO). Reinforcement tuning trains a model to generate higher-quality outputs by scoring candidates against reference answers using one or more graders.

To train a model with reinforcement tuning, follow the same process as standard fine-tuning with a few modifications.

First, ensure your dataset has a reference_answer field containing the correct answer for each problem. Include a system prompt instructing the model to use the reasoning format:

{
    "messages":[
        {
            "role": "system",
            "content": "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>"
        },
        {
            "role": "user",
            "content": "Find the smallest positive $a$ such that $a$ is a multiple of $4$ and $a$ is a multiple of $14.$"
        }
    ],
    "reference_answer": "28"
}

Upload this dataset with the purpose reinforcement-fine-tune. See Prepare and ingest files for details.

Set fine_tune_type and define a reward function using reward_components in your TrainingConfig:

from seekrai import SeekrFlow
from seekrai.types import TrainingConfig, InfrastructureConfig
from seekrai.types.finetune import (
    FineTuneType, Grader, GraderType, RewardComponents,
    StringOperation, TextSimilarityOperation
)

client = SeekrFlow()

training_config = TrainingConfig(
    training_files=['<your-reinforcement-fine-tuning-file-id>'],
    model="meta-llama/Llama-3.2-1B",
    n_epochs=1,
    n_checkpoints=1,
    batch_size=4,
    learning_rate=1e-6, # lower learning rate is typical for reinforcement tuning
    experiment_name="helperbot_grpo_v1",
    fine_tune_type=FineTuneType.GRPO,
    reward_components=RewardComponents(
        graders=[Grader(type=GraderType.MATH_ACCURACY)]
    )
)

Create the fine-tuning job using the standard workflow. See Create a fine-tuning job for the full process.

ℹ️

Note

LoRA can be used with reinforcement tuning to reduce memory requirements. See LoRA for configuration details.

Reward functions

A reward function defines how model outputs are scored during training. In SeekrFlow, reward functions are built from one or more graders — individual scoring operations that each evaluate a specific quality of the output.

Grader types

TypeEnum valueDescriptionOperations
Numerical accuracyGraderType.MATH_ACCURACYReturns 1 if the model output is numerically equal to the reference answer, else 0.None
String checkGraderType.STRING_CHECKReturns 1 if the model output matches the reference based on the selected operation.equals, not_equals, contains, case_insensitive_contains
Text similarityGraderType.TEXT_SIMILARITYReturns a similarity score between the model output and the reference answer.bleu, rouge

Create graders

# Numerical accuracy — no operation needed
math_grader = Grader(type=GraderType.MATH_ACCURACY)

# String check — requires an operation
keyword_grader = Grader(
    type=GraderType.STRING_CHECK,
    operation=StringOperation.CONTAINS
)

# Text similarity — requires an operation
similarity_grader = Grader(
    type=GraderType.TEXT_SIMILARITY,
    operation=TextSimilarityOperation.BLEU
)

Combine graders with weights

Assign weight percentages to combine multiple graders into a single reward function. Weights must sum to 1.0. If no weights are provided, graders are weighted equally.

reward = RewardComponents(
    graders=[
        Grader(type=GraderType.MATH_ACCURACY, weight=0.4),
        Grader(type=GraderType.STRING_CHECK, weight=0.3, operation=StringOperation.EQUALS),
        Grader(type=GraderType.TEXT_SIMILARITY, weight=0.3, operation=TextSimilarityOperation.ROUGE)
    ]
)

Format reward weight

By default, 10% of the reward score is based on whether the model uses the correct output format (<think> and <answer> tags). You can adjust this with format_reward_weight:

reward = RewardComponents(
    format_reward_weight=0.2,
    graders=[
        Grader(type=GraderType.MATH_ACCURACY, weight=0.3),
        Grader(type=GraderType.TEXT_SIMILARITY, weight=0.5, operation=TextSimilarityOperation.BLEU)
    ]
)
ℹ️

Note

When format_reward_weight is set explicitly, the sum of all weights (format + graders) must equal 1.0.