Create instruction fine-tuning data
Generate a QA pair dataset for instruction fine-tuning using the principle_files data job workflow.
This guide walks you through creating a principle_files data job, which generates a QA pair Parquet file you can use for instruction fine-tuning. The workflow bundles file upload, ingestion, prompt configuration, and alignment into a single managed job.
Before you start: Upload your source files using the Files API and have your file IDs ready before Step 2. Not sure which approach fits your use case? See Fine-tuning.
Step 1: Create a data job
Endpoint: POST /v1/flow/data-jobs Submit data job
Create a job shell with job_type set to principle_files.
from seekrai import SeekrFlow
client = SeekrFlow()
data_job = client.data_jobs.create(
name="Customer support refresh",
description="Prep PDFs and Markdown for Q4 fine-tuning",
job_type="principle_files",
)
data_job_id = data_job.id
print("Data job ID:", data_job_id)Sample response:
{
"id": "dj-1b75f4d5-5c9e-4d33-b164-a2393bc5ab6d",
"name": "Customer support refresh",
"job_type": "principle_files",
"status": "needs_setup"
}Step 2: Add files to the job
Endpoint: POST /v1/flow/data-jobs/{id}/add-files Add files to data job
Attach uploaded file IDs to the job. Non-Markdown files (PDF, DOCX, PPT) trigger ingestion automatically. Markdown files are marked as alignment-ready immediately.
data_job = client.data_jobs.add_files(
data_job_id,
file_ids=[
"file-25e34f96-2130-11f0-9236-3e11346bffff",
"file-efd0b334-2130-11f0-9236-3e11346bffff",
],
method="accuracy-optimized",
)
print("Data job status:", data_job.status)
Ingestion mode and the UIWhen ingesting files through the SeekrFlow UI, speed-optimized mode is always used. The SDK lets you choose between speed-optimized and accuracy-optimized.
Choose an ingestion method
| Method | Approx. time (100+ pages) | Description |
|---|---|---|
accuracy-optimized (default) | ~30 min | Combines OCR and direct extraction, runs LLM hierarchy cleanup, and uses advanced table detection. |
speed-optimized | ~3 min | Faster extraction heuristics for large documents; smaller documents still use the richer pipeline. |
Step 3: Monitor ingestion
Poll GET /v1/flow/data-jobs/{id} until the job status is no longer file_processing.
detail = client.data_jobs.retrieve(data_job_id)
print("Status:", detail.status)
for ingestion_job in detail.ingestion_jobs:
for record in ingestion_job.records:
print(f" {record.filename}: {record.status}")
if record.status == "failed":
print(f" Fix: {record.suggested_fix}")| Status | Meaning |
|---|---|
file_processing | Ingestion still running — wait |
needs_review | One or more files failed — resolve suggested_fix in records |
ready_to_start | Ingestion complete and prerequisites met |
If a file fails, fix the source, re-upload it, and attach it to the job again. To skip the file instead, remove it via POST /v1/flow/data-jobs/{id}/remove-files. See Monitor ingestion for a full reference on job states and error codes.
Step 4: Review and edit ingested Markdown
Download the generated Markdown files from the files array in the job detail and review for accuracy. Re-upload any corrected versions and remove the originals before starting alignment.
retrieve_resp = client.files.retrieve_content("file-dc6f19d1-b55a-43a0-a38a-56f0e7bd8d8d")
print(f"Retrieved file: {retrieve_resp.id}")Once you've reviewed and edited, re-upload and attach the corrected file:
upload_resp = client.files.upload("edited_example.md", purpose="alignment")
print(f"Uploaded file with ID: {upload_resp.id}")See Prepare and ingest files for Markdown editing guidelines and in-context learning considerations.
Step 5: Set a system prompt
A system_prompt is required for principle_files jobs before alignment can start. You can generate one from high-level instructions or write your own.
Generate a prompt
Endpoint: POST /v1/flow/data-jobs/gen_system_prompt Generate system prompt
Note
gen_system_promptreturns a suggested prompt but does not save it to the job. You must callPATCH /v1/flow/data-jobs/{id}to set it.
result = client.data_jobs.generate_system_prompt(
instructions="I want to train a chatbot to answer questions about horror movie tropes.",
)
print(result.system_prompt)Set the prompt on the job
Endpoint: PATCH /v1/flow/data-jobs/{id} Update data job
data_job = client.data_jobs.update(
data_job_id,
system_prompt=result.system_prompt,
)
print("Prompt set:", data_job.system_prompt)You can also write your own prompt directly via the PATCH endpoint without calling gen_system_prompt first.
NoteOnce alignment starts, the system prompt is locked. Make any changes before calling
/start.
Step 6: Start alignment
Endpoint: POST /v1/flow/data-jobs/{id}/start Start data job alignment
The /start endpoint enforces pre-flight validation before launching. Requirements for principle_files:
statusmust beready_to_startsystem_promptmust be set- At least one processed Markdown file must be attached
detail = client.data_jobs.start(data_job_id)
print("Alignment job:", detail.alignment_job.id, detail.alignment_job.status)If prerequisites are not met, the endpoint returns 422 Unprocessable Entity with a descriptive message.
Step 7: Monitor alignment
Poll GET /v1/flow/data-jobs/{id} to track progress.
detail = client.data_jobs.retrieve(data_job_id)
print("Status:", detail.status)
if detail.alignment_job:
print("Alignment job:", detail.alignment_job.id)
print("Fine-tuning job IDs:", detail.fine_tuning_job_ids)Step 8: Retrieve output files
Endpoint: GET /v1/flow/alignment/{job_id}/outputs
Once status shows completed, use this endpoint to retrieve metadata for all input and output files associated with the alignment job.
import requests
alignment_job_id = detail.alignment_job.id
response = requests.get(
f"https://flow.seekr.com/v1/flow/alignment/{alignment_job_id}/outputs",
headers=headers,
)
outputs = response.json()[
{
"id": "file-94ab4920-55e6-11f0-a791-96200332ab12",
"filename": "freeway_facts_part_1.md",
"type": "input",
"purpose": "alignment",
"bytes": 56826
},
{
"id": "file-94d1a85e-55e6-11f0-a791-96200332ab12",
"filename": "freeway_facts_part_2.md",
"type": "input",
"purpose": "alignment",
"bytes": 1871
},
{
"id": "file-94e397f8-55e6-11f0-a791-96200332ab12",
"filename": "freeway_facts_part_3.md",
"type": "input",
"purpose": "alignment",
"bytes": 5387
},
{
"id": "file-f883ecfb-1353-4693-928f-0467b268b07b",
"filename": "freeway_facts_part_1-20250630191733-raft-qa-pairs-messages.jsonl",
"type": "output",
"purpose": "alignment",
"bytes": 0
},
{
"id": "file-2d497c5c-55fa-4bbe-80a3-0b4eccda4d6f",
"filename": "freeway_facts_part_1-20250630191733-raft-qa-pairs.parquet",
"type": "output",
"purpose": "fine-tune",
"bytes": 324
}
]The response includes both input and output files. The output file with "purpose": "fine-tune" (the .parquet) is the file ID to use when creating a fine-tuning job.
Updated 12 days ago
