Reinforcement tuning
SeekrFlow supports reinforcement tuning using group relative policy optimization (GRPO). Reinforcement tuning trains a model to generate higher-quality outputs by scoring candidates against reference answers using one or more graders.
To train a model with reinforcement tuning, follow the same process as standard fine-tuning with a few modifications.
First, ensure your dataset has a reference_answer field containing the correct answer for each problem. Include a system prompt instructing the model to use the reasoning format:
{
"messages":[
{
"role": "system",
"content": "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>"
},
{
"role": "user",
"content": "Find the smallest positive $a$ such that $a$ is a multiple of $4$ and $a$ is a multiple of $14.$"
}
],
"reference_answer": "28"
}Upload this dataset with the purpose reinforcement-fine-tune. See Prepare and ingest files for details.
Set fine_tune_type and define a reward function using reward_components in your TrainingConfig:
from seekrai import SeekrFlow
from seekrai.types import TrainingConfig, InfrastructureConfig
from seekrai.types.finetune import (
FineTuneType, Grader, GraderType, RewardComponents,
StringOperation, TextSimilarityOperation
)
client = SeekrFlow()
training_config = TrainingConfig(
training_files=['<your-reinforcement-fine-tuning-file-id>'],
model="meta-llama/Llama-3.2-1B",
n_epochs=1,
n_checkpoints=1,
batch_size=4,
learning_rate=1e-6, # lower learning rate is typical for reinforcement tuning
experiment_name="helperbot_grpo_v1",
fine_tune_type=FineTuneType.GRPO,
reward_components=RewardComponents(
graders=[Grader(type=GraderType.MATH_ACCURACY)]
)
)Create the fine-tuning job using the standard workflow. See Create a fine-tuning job for the full process.
NoteLoRA can be used with reinforcement tuning to reduce memory requirements. See LoRA for configuration details.
Reward functions
A reward function defines how model outputs are scored during training. In SeekrFlow, reward functions are built from one or more graders — individual scoring operations that each evaluate a specific quality of the output.
Grader types
| Type | Enum value | Description | Operations |
|---|---|---|---|
| Numerical accuracy | GraderType.MATH_ACCURACY | Returns 1 if the model output is numerically equal to the reference answer, else 0. | None |
| String check | GraderType.STRING_CHECK | Returns 1 if the model output matches the reference based on the selected operation. | equals, not_equals, contains, case_insensitive_contains |
| Text similarity | GraderType.TEXT_SIMILARITY | Returns a similarity score between the model output and the reference answer. | bleu, rouge |
Create graders
# Numerical accuracy — no operation needed
math_grader = Grader(type=GraderType.MATH_ACCURACY)
# String check — requires an operation
keyword_grader = Grader(
type=GraderType.STRING_CHECK,
operation=StringOperation.CONTAINS
)
# Text similarity — requires an operation
similarity_grader = Grader(
type=GraderType.TEXT_SIMILARITY,
operation=TextSimilarityOperation.BLEU
)Combine graders with weights
Assign weight percentages to combine multiple graders into a single reward function. Weights must sum to 1.0. If no weights are provided, graders are weighted equally.
reward = RewardComponents(
graders=[
Grader(type=GraderType.MATH_ACCURACY, weight=0.4),
Grader(type=GraderType.STRING_CHECK, weight=0.3, operation=StringOperation.EQUALS),
Grader(type=GraderType.TEXT_SIMILARITY, weight=0.3, operation=TextSimilarityOperation.ROUGE)
]
)Format reward weight
By default, 10% of the reward score is based on whether the model uses the correct output format (<think> and <answer> tags). You can adjust this with format_reward_weight:
reward = RewardComponents(
format_reward_weight=0.2,
graders=[
Grader(type=GraderType.MATH_ACCURACY, weight=0.3),
Grader(type=GraderType.TEXT_SIMILARITY, weight=0.5, operation=TextSimilarityOperation.BLEU)
]
)
NoteWhen
format_reward_weightis set explicitly, the sum of all weights (format + graders) must equal1.0.
Updated 12 days ago
