Preference tuning (DPO)

Align model outputs with human preferences by learning directly from response comparisons.

Supported on
UI
API
SDK

Preference tuning aligns model behavior by learning from comparisons between responses rather than from explicit correct answers. SeekrFlow implements preference tuning using direct preference optimization (DPO), which trains models to increase the likelihood of generating preferred responses over dispreferred ones.

For implementation details including dataset schema and code examples, see the Preference tuning SDK guide.

How it works

Preference tuning operates on paired response data. Each training example contains a prompt, a preferred (chosen) response, and a dispreferred (rejected) response. The model learns to distinguish what makes one response better than another and adjusts its generation patterns accordingly.

DPO optimizes directly against preference pairs without requiring a separate reward model. This makes it simpler and more stable than traditional RLHF approaches while achieving comparable alignment quality.

A key parameter, beta, controls the strength of the KL-divergence penalty — how far the tuned model is allowed to deviate from the original base model. Lower beta values allow more deviation, while higher values keep the model closer to its original behavior.

When to use preference tuning

Preference tuning provides value when:

  • Aligning model outputs with organizational standards such as tone, compliance requirements, or customer experience guidelines
  • Improving response quality in scenarios where preferences are clear but a single "correct" answer does not exist
  • Leveraging existing human feedback or editorial judgment as training signal
  • Refining model behavior without needing explicit reference answers or reward functions

How it differs from other methods

Preference tuning fills a distinct role in the fine-tuning workflow:

  • Instruction fine-tuning teaches models to replicate demonstrated responses. It requires explicit examples of correct outputs.
  • Reinforcement tuning (GRPO) optimizes against a reward function that scores generated responses against reference answers. It requires definite, verifiable answers.
  • Preference tuning (DPO) learns from comparative judgments — which response is better — without requiring gold answers or reward functions. This makes it effective for subjective quality criteria that are difficult to express as rules or metrics.

Training requirements

Effective preference tuning requires:

  • Preference dataset: Paired examples with a prompt, a chosen response, and a rejected response. SeekrFlow's data engine does not currently generate preference datasets, so datasets must be prepared externally.
  • Consistent preference signal: Clear, consistent criteria for what makes one response preferred over another. Noisy or contradictory preferences reduce training effectiveness.
  • Base model capability: A model that already generates reasonable responses. Preference tuning refines quality rather than teaching foundational knowledge.

Dataset format

Each training example contains three components:

  • Prompt: System and user messages that define the context.
  • Chosen: The preferred assistant response.
  • Rejected: The dispreferred assistant response.

Datasets must be in JSONL or Parquet format and uploaded with the preference-fine-tune file purpose. SeekrFlow validates the schema on upload and rejects datasets that do not conform to the expected structure.

Model compatibility

Preference tuning works with all base models available in SeekrFlow. No model-specific restrictions apply beyond the standard fine-tuning requirements.

Model deployment

Preference-tuned models are deployed as standard model endpoints. The learned preferences are embedded in model parameters, so no special inference infrastructure is required.