Vision language tuning

For conceptual background on vision language tuning, including supported models and when to use it, see Vision language tuning.

Prepare a vision language dataset

Upload a dataset that follows the vision-language message schema. Each training example is a single-turn conversation where user messages contain both image and text content.

Example dataset schema

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        {"type": "text", "text": "What product is this?"}
      ]
    },
    {
      "role": "assistant",
      "content": [
        {"type": "text", "text": "This is the ACME Widget Pro X7, a second-generation industrial sensor unit. It features the distinctive blue housing and triple-port connector array."}
      ]
    }
  ]
}

Dataset validation

SeekrFlow validates the dataset on upload and rejects datasets with:

Malformed message content or missing required fields
Unsupported image formats
Schema violations against the expected multimodal structure

Create a vision language fine-tuning job

from seekrai.types import TrainingConfig, InfrastructureConfig
from seekrai import SeekrFlow

client = SeekrFlow()

training_config = TrainingConfig(
    training_files=[
        "file-830e9be3-25tt-13y1-0298-3a035e73o90"  # Vision-language dataset file ID
    ],
    model="meta-llama/Llama-3.2-11B-Vision-Instruct",
    n_epochs=1,
    n_checkpoints=1,
    batch_size=4,
    learning_rate=1e-5,
    experiment_name="vlm_helperbot_v1",
)

infrastructure_config = InfrastructureConfig(
    n_accel=8,
    accel_type="MI300X",
)

fine_tune = client.fine_tuning.create(
    training_config=training_config,
    infrastructure_config=infrastructure_config,
    project_id=123,
)

print(fine_tune.id)

All other TrainingConfig parameters behave the same as in text-only instruction fine-tuning. See Create a fine-tuning job for the full workflow including project setup, file retrieval, and monitoring.