UI supportReinforcement tuning is available in the UI with mathematical accuracy as a reward function. User-defined reward functions are not currently supported in the UI but are available through the API and SDK.
How it works
Reinforcement tuning operates through a reinforcement learning process:
Reward functions use one or more graders to score outputs on specific qualities like numerical accuracy, keyword matching, or text similarity. The model learns to maximize expected reward across diverse prompts.
When to use reinforcement tuning
Reinforcement tuning provides value when:- Improving overall response quality beyond what demonstration data captures
- Aligning model outputs with subjective preferences or style guidelines
- Reducing unwanted behaviors (verbosity, hedging, unsafe content)
- Optimizing for measurable quality metrics (accuracy, user satisfaction)
- Teaching models to balance multiple competing objectives
Training requirements
Effective reinforcement tuning requires:- Reward function: Clear criteria for evaluating response quality through one or more graders
- Diverse prompts: Training examples covering the range of scenarios where quality preferences apply
- Reference answers: Answers against which the model can score its generated outputs
- Base model capability: Strong starting model that can generate reasonable candidates before preference optimization
Reward functions
A reward function scores model outputs during training to reinforce desired behaviors. In SeekrFlow, reward functions are built from one or more graders — individual scoring operations that each evaluate a specific quality of the output.Available graders
| Grader | Description |
|---|---|
| Numerical accuracy | Evaluates whether the model output is numerically correct. Useful for financial auditing, math, and other tasks with definite numerical answers. |
| String check | Evaluates whether the model output contains or matches a specific phrase. Useful for enforcing keyword usage or required terminology. |
| Text similarity | Evaluates how lexically similar the model output is to the reference answer. Useful for enforcing tone, style, or compliant language. |