Supported on

API

SDK

Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning method that trains small adapter modules rather than updating all model weights. This approach enables faster training with significantly lower compute costs while preserving base model knowledge.

How it works

LoRA inserts small trainable adapter matrices into model layers while keeping the base model weights frozen. During training, only these adapter parameters are updated. The adapters learn task-specific patterns as low-rank decompositions of weight updates.

At inference time, the adapter combines with the base model to compute predictions. Multiple LoRA adapters can be trained from the same base model and swapped dynamically based on the task.

The method is controlled by two key parameters:

Rank: Determines adapter capacity and the complexity of patterns it can learn. Higher rank supports more complex adaptations but increases training cost.
Alpha: Controls how strongly learned patterns influence base model behavior. Higher alpha makes the adapter's influence more pronounced.

When to use LoRA

LoRA provides value when:

Adding new behaviors to a model while preserving underlying knowledge
Requiring faster training and iteration cycles than full fine-tuning
Operating under compute or cost constraints that prohibit full fine-tuning
Maintaining multiple task variants from a single base model
Experimenting with different adaptations before committing to full fine-tuning
Needing smaller model artifacts for easier storage and deployment

Training characteristics

LoRA offers distinct training advantages:

Speed: Trains significantly faster than full fine-tuning by updating fewer parameters
Memory efficiency: Enables larger batch sizes on the same hardware by keeping base model frozen
Artifact size: Produces small adapter files (typically megabytes) rather than full model weights (gigabytes)
Flexibility: Supports creating multiple adapters from one base model for different tasks

The method works best with consistent, high-quality datasets. LoRA adapters can be more sensitive to noisy or contradictory training data than full fine-tuning.

Configuration guidance

Choosing rank and alpha depends on the task:

For simple adaptations (style, formatting, tone):

Start with rank 8–16, alpha 16–32
Lower ranks often sufficient for surface-level changes

For domain adaptation (specialized terminology, structured outputs):

Start with rank 16–32, alpha 32–64
Medium ranks handle domain-specific patterns

For complex multi-constraint tasks (correctness, format, policy, reasoning):

Start with rank 32–64, alpha 64–128
Higher ranks support multiple simultaneous requirements

Most tasks work well with rank 16 and alpha 32 as starting defaults. Adjust based on whether the model learns the desired behavior too weakly (increase alpha) or too rigidly (decrease alpha or rank).

Comparison with other methods

Unlike full instruction fine-tuning, LoRA updates only adapter parameters rather than all weights. This provides faster training and smaller artifacts but may have less capacity for deep knowledge rewrites.

LoRA can combine with any fine-tuning method (instruction, context-grounded, GRPO). The choice of method determines what the model learns; LoRA determines how efficiently those parameters are updated.

For most use cases, LoRA offers the best balance of training speed, cost, and model quality. Consider full fine-tuning only when LoRA cannot achieve the desired behavior after parameter tuning.

Model deployment

LoRA models are deployed as base model plus adapter. The deployment system loads the frozen base model and applies the appropriate adapter for each request. This architecture enables efficient serving of multiple specialized models from a shared base.