Group relative policy optimization (GRPO)
Uses reward functions to reinforce models toward higher-quality responses.
Rather than learning from explicit demonstrations, GRPO trains models through comparative rankings that indicate which responses are preferred over others.
UI supportGRPO is available in the UI with mathematical accuracy as a reward function. User-defined reward functions are not currently supported in the UI but are available through the API and SDK.
How it works
GRPO operates through a reinforcement learning process:
- Generation: The model generates multiple candidate responses to each prompt.
- Evaluation: A reward function scores each candidate based on quality criteria.
- Optimization: Training gradients push the model toward generating higher-scoring responses.
The reward function encodes preferences about response quality. It can evaluate factors like accuracy, coherence, style, safety, or task-specific criteria. The model learns to maximize expected reward across diverse prompts.
Unlike supervised fine-tuning that teaches specific responses, GRPO teaches the model to generate responses that score well according to the reward criteria. This allows the model to generalize preference patterns to new situations.
When to use GRPO
GRPO provides value when:
- Improving overall response quality beyond what demonstration data captures
- Aligning model outputs with subjective preferences or style guidelines
- Reducing unwanted behaviors (verbosity, hedging, unsafe content)
- Optimizing for measurable quality metrics (accuracy, user satisfaction)
- Teaching models to balance multiple competing objectives
Training requirements
Effective GRPO training requires:
- Reward function: Clear criteria for evaluating response quality through automated metrics or model-based judges
- Diverse prompts: Training examples covering the range of scenarios where quality preferences apply
- Reference answers: Answers against which the model can score its generated outputs
- Base model capability: Strong starting model that can generate reasonable candidates before preference optimization
Reward functions
GRPO supports multiple reward function approaches:
- Model judges: Using capable models to evaluate response quality based on rubrics
- Automated metrics: Computed measures like accuracy, format compliance, or safety scores
- Hybrid approaches: Combining multiple reward signals with appropriate weighting
The choice of reward function significantly impacts what behaviors the model learns. Clear, consistent reward criteria lead to more reliable alignment.
Comparison with other methods
GRPO learns quality patterns through reinforcement of high-scoring outputs. Unlike instruction fine-tuning, it doesn't require explicit demonstrations of correct responses. GRPO typically follows instruction fine-tuning in training pipelines—the instruction phase teaches domain knowledge, while GRPO refines response quality and alignment.
GRPO can be combined with LoRA for efficient training. The adapter learns preference patterns while the base model provides general capabilities, reducing compute requirements without sacrificing quality improvements.
Model deployment
GRPO-tuned models are deployed as standard model endpoints. The preference optimization is embedded in model parameters, so no special inference infrastructure is required. The model generates responses that reflect the learned quality preferences.
Updated 8 days ago
