> ## Documentation Index
> Fetch the complete documentation index at: https://docs.seekr.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Reinforcement tuning

SeekrFlow supports reinforcement tuning using group relative policy optimization (GRPO). Reinforcement tuning trains a model to generate higher-quality outputs by scoring candidates against reference answers using one or more graders.

To train a model with reinforcement tuning, follow the same process as standard fine-tuning with a few modifications.

First, ensure your dataset has a `reference_answer` field containing the correct answer for each problem. Include a system prompt instructing the model to use the reasoning format:

<CodeGroup>
  ```python Python  theme={null}
  {
      "messages":[
          {
              "role": "system",
              "content": "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>"
          },
          {
              "role": "user",
              "content": "Find the smallest positive $a$ such that $a$ is a multiple of $4$ and $a$ is a multiple of $14.$"
          }
      ],
      "reference_answer": "28"
  }
  ```
</CodeGroup>

Upload this dataset with the purpose `reinforcement-fine-tune`. See [Upload file](/flow/reference/file_upload_v1_flow_files_put) for the full schema reference.

Set `fine_tune_type` and define a reward function using `reward_components` in your `TrainingConfig`:

<CodeGroup>
  ```python Python expandable theme={null}
  from seekrai import SeekrFlow
  from seekrai.types import TrainingConfig, InfrastructureConfig
  from seekrai.types.finetune import (
      FineTuneType, Grader, GraderType, RewardComponents,
      StringOperation, TextSimilarityOperation
  )

  client = SeekrFlow()

  training_config = TrainingConfig(
      training_files=['<your-reinforcement-fine-tuning-file-id>'],
      model="meta-llama/Llama-3.2-1B",
      n_epochs=1,
      n_checkpoints=1,
      batch_size=4,
      learning_rate=1e-6, # lower learning rate is typical for reinforcement tuning
      experiment_name="helperbot_grpo_v1",
      fine_tune_type=FineTuneType.REINFORCEMENT,
      reward_components=RewardComponents(
          graders=[Grader(type=GraderType.MATH_ACCURACY)]
      )
  )
  ```
</CodeGroup>

Create the fine-tuning job using the standard workflow. See [Create a fine-tuning job](/flow/sdk/fine-tuning/create-fine-tuning-job) for the full process.

<Tip>
  LoRA can be used with reinforcement tuning to reduce memory requirements. See [LoRA](/flow/sdk/fine-tuning/lora) for configuration details.
</Tip>

## Reward functions

A reward function defines how model outputs are scored during training. In SeekrFlow, reward functions are built from one or more **graders** — individual scoring operations that each evaluate a specific quality of the output.

### Grader types

| Type               | Enum value                   | Description                                                                             | Operations                                                      |
| ------------------ | ---------------------------- | --------------------------------------------------------------------------------------- | --------------------------------------------------------------- |
| Numerical accuracy | `GraderType.MATH_ACCURACY`   | Returns `1` if the model output is numerically equal to the reference answer, else `0`. | None                                                            |
| String check       | `GraderType.STRING_CHECK`    | Returns `1` if the model output matches the reference based on the selected operation.  | `equals`, `not_equals`, `contains`, `case_insensitive_contains` |
| Text similarity    | `GraderType.TEXT_SIMILARITY` | Returns a similarity score between the model output and the reference answer.           | `bleu`, `rouge`                                                 |

### Create graders

<CodeGroup>
  ```python Python  theme={null}
  # Numerical accuracy — no operation needed
  math_grader = Grader(type=GraderType.MATH_ACCURACY)

  # String check — requires an operation
  keyword_grader = Grader(
      type=GraderType.STRING_CHECK,
      operation=StringOperation.CONTAINS
  )

  # Text similarity — requires an operation
  similarity_grader = Grader(
      type=GraderType.TEXT_SIMILARITY,
      operation=TextSimilarityOperation.BLEU
  )
  ```
</CodeGroup>

### Combine graders with weights

Assign weight percentages to combine multiple graders into a single reward function. Weights must sum to `1.0`. If no weights are provided, graders are weighted equally.

<CodeGroup>
  ```python Python  theme={null}
  reward = RewardComponents(
      graders=[
          Grader(type=GraderType.MATH_ACCURACY, weight=0.4),
          Grader(type=GraderType.STRING_CHECK, weight=0.3, operation=StringOperation.EQUALS),
          Grader(type=GraderType.TEXT_SIMILARITY, weight=0.3, operation=TextSimilarityOperation.ROUGE)
      ]
  )
  ```
</CodeGroup>

### Format reward weight

By default, 10% of the reward score is based on whether the model uses the correct output format (`<think>` and `<answer>` tags). You can adjust this with `format_reward_weight`:

<CodeGroup>
  ```python Python  theme={null}
  reward = RewardComponents(
      format_reward_weight=0.2,
      graders=[
          Grader(type=GraderType.MATH_ACCURACY, weight=0.3),
          Grader(type=GraderType.TEXT_SIMILARITY, weight=0.5, operation=TextSimilarityOperation.BLEU)
      ]
  )
  ```
</CodeGroup>

<Note>
  When `format_reward_weight` is set explicitly, the sum of all weights (format + graders) must equal `1.0`.
</Note>
