Reinforcement tuning (GRPO)

Reinforcement tuning trains models to generate higher-quality responses by scoring candidate outputs and reinforcing preferred behaviors. SeekrFlow implements reinforcement tuning using group relative policy optimization (GRPO). For implementation details including dataset format and code examples, see the Reinforcement tuning SDK guide.

UI supportReinforcement tuning is available in the UI with mathematical accuracy as a reward function. User-defined reward functions are not currently supported in the UI but are available through the API and SDK.

How it works

Reinforcement tuning operates through a reinforcement learning process:

Generation: The model generates multiple candidate responses to each prompt.

Evaluation: A reward function scores each candidate based on quality criteria.

Optimization: Training gradients push the model toward generating higher-scoring responses.

Reward functions use one or more graders to score outputs on specific qualities like numerical accuracy, keyword matching, or text similarity. The model learns to maximize expected reward across diverse prompts.

When to use reinforcement tuning

Reinforcement tuning provides value when:

Improving overall response quality beyond what demonstration data captures
Aligning model outputs with subjective preferences or style guidelines
Reducing unwanted behaviors (verbosity, hedging, unsafe content)
Optimizing for measurable quality metrics (accuracy, user satisfaction)
Teaching models to balance multiple competing objectives

Training requirements

Effective reinforcement tuning requires:

Reward function: Clear criteria for evaluating response quality through one or more graders
Diverse prompts: Training examples covering the range of scenarios where quality preferences apply
Reference answers: Answers against which the model can score its generated outputs
Base model capability: Strong starting model that can generate reasonable candidates before preference optimization

Reward functions

A reward function scores model outputs during training to reinforce desired behaviors. In SeekrFlow, reward functions are built from one or more graders — individual scoring operations that each evaluate a specific quality of the output.

Available graders

Grader	Description
Numerical accuracy	Evaluates whether the model output is numerically correct. Useful for financial auditing, math, and other tasks with definite numerical answers.
String check	Evaluates whether the model output contains or matches a specific phrase. Useful for enforcing keyword usage or required terminology.
Text similarity	Evaluates how lexically similar the model output is to the reference answer. Useful for enforcing tone, style, or compliant language.

Combining graders

You can combine multiple graders into a single reward function by assigning weight percentages to each. This creates a hybrid reward signal tailored to your use case. For example, you might weight numerical accuracy at 80% and text similarity at 20% to reward both the mathematical correctness and formatting of the model’s answers. For implementation details and code examples, see the Reinforcement tuning SDK guide.

Comparison with other methods

Reinforcement tuning learns quality patterns through reinforcement of high-scoring outputs. Unlike instruction fine-tuning, it doesn’t require explicit demonstrations of correct responses. Reinforcement tuning typically follows instruction fine-tuning in training pipelines — the instruction phase teaches domain knowledge, while reinforcement tuning refines response quality and alignment. Reinforcement tuning can be combined with LoRA for efficient training. The adapter learns preference patterns while the base model provides general capabilities, reducing compute requirements without sacrificing quality improvements.

Model deployment

Reinforcement-tuned models are deployed as standard model endpoints. The reinforced model behaviors are embedded in model parameters, so no special inference infrastructure is required. The model generates responses that reflect the learned quality preferences.

​How it works

​When to use reinforcement tuning

​Training requirements

​Reward functions

​Available graders

​Combining graders

​Comparison with other methods

​Model deployment

How it works

When to use reinforcement tuning

Training requirements

Reward functions

Available graders

Combining graders

Comparison with other methods

Model deployment