Supported on

API

SDK

Rather than learning from explicit demonstrations, GRPO trains models through comparative rankings that indicate which responses are preferred over others.

ℹ️
UI support
GRPO is available in the UI with mathematical accuracy as a reward function. User-defined reward functions are not currently supported in the UI but are available through the API and SDK.

How it works

GRPO operates through a reinforcement learning process:

Generation: The model generates multiple candidate responses to each prompt.
Evaluation: A reward function scores each candidate based on quality criteria.
Optimization: Training gradients push the model toward generating higher-scoring responses.

The reward function encodes preferences about response quality. It can evaluate factors like accuracy, coherence, style, safety, or task-specific criteria. The model learns to maximize expected reward across diverse prompts.

Unlike supervised fine-tuning that teaches specific responses, GRPO teaches the model to generate responses that score well according to the reward criteria. This allows the model to generalize preference patterns to new situations.

When to use GRPO

GRPO provides value when:

Improving overall response quality beyond what demonstration data captures
Aligning model outputs with subjective preferences or style guidelines
Reducing unwanted behaviors (verbosity, hedging, unsafe content)
Optimizing for measurable quality metrics (accuracy, user satisfaction)
Teaching models to balance multiple competing objectives

Training requirements

Effective GRPO training requires:

Reward function: Clear criteria for evaluating response quality through automated metrics or model-based judges
Diverse prompts: Training examples covering the range of scenarios where quality preferences apply
Reference answers: Answers against which the model can score its generated outputs
Base model capability: Strong starting model that can generate reasonable candidates before preference optimization

Reward functions

GRPO supports multiple reward function approaches:

Model judges: Using capable models to evaluate response quality based on rubrics
Automated metrics: Computed measures like accuracy, format compliance, or safety scores
Hybrid approaches: Combining multiple reward signals with appropriate weighting

The choice of reward function significantly impacts what behaviors the model learns. Clear, consistent reward criteria lead to more reliable alignment.

Comparison with other methods

GRPO learns quality patterns through reinforcement of high-scoring outputs. Unlike instruction fine-tuning, it doesn't require explicit demonstrations of correct responses. GRPO typically follows instruction fine-tuning in training pipelines—the instruction phase teaches domain knowledge, while GRPO refines response quality and alignment.

GRPO can be combined with LoRA for efficient training. The adapter learns preference patterns while the base model provides general capabilities, reducing compute requirements without sacrificing quality improvements.

Model deployment

GRPO-tuned models are deployed as standard model endpoints. The preference optimization is embedded in model parameters, so no special inference infrastructure is required. The model generates responses that reflect the learned quality preferences.