Reinforcement tuning (GRPO)
Uses reward functions to reinforce models toward higher-quality responses.
Reinforcement tuning trains models to generate higher-quality responses by scoring candidate outputs and reinforcing preferred behaviors. SeekrFlow implements reinforcement tuning using group relative policy optimization (GRPO).
For implementation details including dataset format and code examples, see the Reinforcement tuning SDK guide.
UI supportReinforcement tuning is available in the UI with mathematical accuracy as a reward function. User-defined reward functions are not currently supported in the UI but are available through the API and SDK.
How it works
Reinforcement tuning operates through a reinforcement learning process:
- Generation: The model generates multiple candidate responses to each prompt.
- Evaluation: A reward function scores each candidate based on quality criteria.
- Optimization: Training gradients push the model toward generating higher-scoring responses.
Reward functions use one or more graders to score outputs on specific qualities like numerical accuracy, keyword matching, or text similarity. The model learns to maximize expected reward across diverse prompts.
When to use reinforcement tuning
Reinforcement tuning provides value when:
- Improving overall response quality beyond what demonstration data captures
- Aligning model outputs with subjective preferences or style guidelines
- Reducing unwanted behaviors (verbosity, hedging, unsafe content)
- Optimizing for measurable quality metrics (accuracy, user satisfaction)
- Teaching models to balance multiple competing objectives
Training requirements
Effective reinforcement tuning requires:
- Reward function: Clear criteria for evaluating response quality through one or more graders
- Diverse prompts: Training examples covering the range of scenarios where quality preferences apply
- Reference answers: Answers against which the model can score its generated outputs
- Base model capability: Strong starting model that can generate reasonable candidates before preference optimization
Reward functions
A reward function scores model outputs during training to reinforce desired behaviors. In SeekrFlow, reward functions are built from one or more graders — individual scoring operations that each evaluate a specific quality of the output.
Available graders
| Grader | Description |
|---|---|
| Numerical accuracy | Evaluates whether the model output is numerically correct. Useful for financial auditing, math, and other tasks with definite numerical answers. |
| String check | Evaluates whether the model output contains or matches a specific phrase. Useful for enforcing keyword usage or required terminology. |
| Text similarity | Evaluates how lexically similar the model output is to the reference answer. Useful for enforcing tone, style, or compliant language. |
Combining graders
You can combine multiple graders into a single reward function by assigning weight percentages to each. This creates a hybrid reward signal tailored to your use case. For example, you might weight numerical accuracy at 80% and text similarity at 20% to reward both the mathematical correctness and formatting of the model's answers.
For implementation details and code examples, see the Reinforcement tuning SDK guide.
Comparison with other methods
Reinforcement tuning learns quality patterns through reinforcement of high-scoring outputs. Unlike instruction fine-tuning, it doesn't require explicit demonstrations of correct responses. Reinforcement tuning typically follows instruction fine-tuning in training pipelines — the instruction phase teaches domain knowledge, while reinforcement tuning refines response quality and alignment.
Reinforcement tuning can be combined with LoRA for efficient training. The adapter learns preference patterns while the base model provides general capabilities, reducing compute requirements without sacrificing quality improvements.
Model deployment
Reinforcement-tuned models are deployed as standard model endpoints. The reinforced model behaviors are embedded in model parameters, so no special inference infrastructure is required. The model generates responses that reflect the learned quality preferences.
Updated about 1 month ago
