Vision language tuning
Fine-tune vision-language models on image-text datasets using the SeekrFlow Python SDK.
For conceptual background on vision language tuning, including supported models and when to use it, see Vision language tuning.
Prepare a vision language dataset
Upload a dataset that follows the vision-language message schema. Each training example is a single-turn conversation where user messages contain both image and text content.
Example dataset schema
{
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
{"type": "text", "text": "What product is this?"}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "This is the ACME Widget Pro X7, a second-generation industrial sensor unit. It features the distinctive blue housing and triple-port connector array."}
]
}
]
}Dataset validation
SeekrFlow validates the dataset on upload and rejects datasets with:
- Malformed message content or missing required fields
- Unsupported image formats
- Schema violations against the expected multimodal structure
Create a vision language fine-tuning job
from seekrai.types import TrainingConfig, InfrastructureConfig
from seekrai import SeekrFlow
client = SeekrFlow()
training_config = TrainingConfig(
training_files=[
"file-830e9be3-25tt-13y1-0298-3a035e73o90" # Vision-language dataset file ID
],
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
n_epochs=1,
n_checkpoints=1,
batch_size=4,
learning_rate=1e-5,
experiment_name="vlm_helperbot_v1",
)
infrastructure_config = InfrastructureConfig(
n_accel=8,
accel_type="MI300X",
)
fine_tune = client.fine_tuning.create(
training_config=training_config,
infrastructure_config=infrastructure_config,
project_id=123,
)
print(fine_tune.id)All other TrainingConfig parameters behave the same as in text-only instruction fine-tuning. See Create a fine-tuning job for the full workflow including project setup, file retrieval, and monitoring.
Updated about 6 hours ago
