Instruction Fine-Tuning Data

This guide walks you through how to prepare and upload files to generate a standard instruction fine-tuning dataset that you can use to adapt a generalist model to a task of your choosing.

Fully automated data preparation and creation with the AI-Ready Data Engine

Traditional fine-tuning dataset creation involves a lot of manual labor: Data collection, cleaning, labeling, format conversion, and so forth. This is where SeekrFlow's AI-Ready Data Engine shines. It can process files of various formats (PDF, Markdown, JSON, DOCX) and convert them into a file ready for fine-tuning quickly and painlessly, in a matter of minutes. Read on to find out how to generate your own dataset for standard instruction fine-tuning.

Step 1: Generate instruction fine-tuning data

Endpoint: POST /v1/flow/alignment/generate Generate IFT pairs

Using your properly formatted file, you can start the data generation process to create document chunks.

Note: Unless you already have a parquet file prepared, see Data Preparation for complete instructions on how to process and convert all of your files before starting this step.

Collect your Markdown file ID
Use the ID of the final Markdown file from either Step 2 of Data Preparation, or the ingestion output in Step 3.

from seekrai import SeekrFlow

# Initialize the SeekrFlow client 
client = SeekrFlow(api_key="your api key")

# Prepare the payload 
response = client.alignment.generate(
    files=["file-1234567890"],
    instructions="I want to train a chatbot to answer questions about horror movie tropes.",
    type="principle"
)

print("Response:", response.model_dump_json())

Sample response:

Response: {"id":"aj-1234567890","created_at":"2025-04-24T21:08:53.400417Z","status":"queued"}

Step 2: Check job status (optional)

Endpoint: GET /v1/flow/alignment/{job_id} Fetch Job Details

While the alignment job is running, you can poll its status using the alignment job ID from the previous step to see when it's complete.

alignment = client.alignment.generate

file_ids = ["aj-1234567890"]

# Retrieve its current status
job = client.alignment.retrieve(alignment_job.id)
print("Job ID:", job.id)
print("Status:", job.status)            
print("Status (str):", job.status.value)

Sample response:

Job ID: aj-1234567890
Status: AlignmentJobStatus.STATUS_COMPLETED
Status (str): completed

Once the status shows completed, the Parquet file will appear in your files list (e.g., filename": "file-d7166390-962f-4d22-93a2-265d93c114e6-raft-qa-pairs.parquet). You can also find the file in the UI and download it by navigating to the Datasets Directory.