Create instruction fine-tuning data

Generate a QA pair dataset for instruction fine-tuning using the principle_files data job workflow.

This guide walks you through creating a principle_files data job, which generates a QA pair Parquet file you can use for instruction fine-tuning. The workflow bundles file upload, ingestion, prompt configuration, and alignment into a single managed job.

Before you start: Upload your source files using the Files API and have your file IDs ready before Step 2. Not sure which approach fits your use case? See Fine-tuning.


Step 1: Create a data job

Endpoint: POST /v1/flow/data-jobs Submit data job

Create a job shell with job_type set to principle_files.

from seekrai import SeekrFlow

client = SeekrFlow()

data_job = client.data_jobs.create(
    name="Customer support refresh",
    description="Prep PDFs and Markdown for Q4 fine-tuning",
    job_type="principle_files",
)
data_job_id = data_job.id
print("Data job ID:", data_job_id)

Sample response:

{
    "id": "dj-1b75f4d5-5c9e-4d33-b164-a2393bc5ab6d",
    "name": "Customer support refresh",
    "job_type": "principle_files",
    "status": "needs_setup"
}

Step 2: Add files to the job

Endpoint: POST /v1/flow/data-jobs/{id}/add-files Add files to data job

Attach uploaded file IDs to the job. Non-Markdown files (PDF, DOCX, PPT) trigger ingestion automatically. Markdown files are marked as alignment-ready immediately.

data_job = client.data_jobs.add_files(
    data_job_id,
    file_ids=[
        "file-25e34f96-2130-11f0-9236-3e11346bffff",
        "file-efd0b334-2130-11f0-9236-3e11346bffff",
    ],
    method="accuracy-optimized",
)
print("Data job status:", data_job.status)
ℹ️

Ingestion mode and the UI

When ingesting files through the SeekrFlow UI, speed-optimized mode is always used. The SDK lets you choose between speed-optimized and accuracy-optimized.

Choose an ingestion method

MethodApprox. time (100+ pages)Description
accuracy-optimized (default)~30 minCombines OCR and direct extraction, runs LLM hierarchy cleanup, and uses advanced table detection.
speed-optimized~3 minFaster extraction heuristics for large documents; smaller documents still use the richer pipeline.

Step 3: Monitor ingestion

Poll GET /v1/flow/data-jobs/{id} until the job status is no longer file_processing.

detail = client.data_jobs.retrieve(data_job_id)
print("Status:", detail.status)

for ingestion_job in detail.ingestion_jobs:
    for record in ingestion_job.records:
        print(f"  {record.filename}: {record.status}")
        if record.status == "failed":
            print(f"    Fix: {record.suggested_fix}")
StatusMeaning
file_processingIngestion still running — wait
needs_reviewOne or more files failed — resolve suggested_fix in records
ready_to_startIngestion complete and prerequisites met

If a file fails, fix the source, re-upload it, and attach it to the job again. To skip the file instead, remove it via POST /v1/flow/data-jobs/{id}/remove-files. See Monitor ingestion for a full reference on job states and error codes.


Step 4: Review and edit ingested Markdown

Download the generated Markdown files from the files array in the job detail and review for accuracy. Re-upload any corrected versions and remove the originals before starting alignment.

retrieve_resp = client.files.retrieve_content("file-dc6f19d1-b55a-43a0-a38a-56f0e7bd8d8d")
print(f"Retrieved file: {retrieve_resp.id}")

Once you've reviewed and edited, re-upload and attach the corrected file:

upload_resp = client.files.upload("edited_example.md", purpose="alignment")
print(f"Uploaded file with ID: {upload_resp.id}")

See Prepare and ingest files for Markdown editing guidelines and in-context learning considerations.


Step 5: Set a system prompt

A system_prompt is required for principle_files jobs before alignment can start. You can generate one from high-level instructions or write your own.

Generate a prompt

Endpoint: POST /v1/flow/data-jobs/gen_system_prompt Generate system prompt

ℹ️

Note

gen_system_prompt returns a suggested prompt but does not save it to the job. You must call PATCH /v1/flow/data-jobs/{id} to set it.

result = client.data_jobs.generate_system_prompt(
    instructions="I want to train a chatbot to answer questions about horror movie tropes.",
)
print(result.system_prompt)

Set the prompt on the job

Endpoint: PATCH /v1/flow/data-jobs/{id} Update data job

data_job = client.data_jobs.update(
    data_job_id,
    system_prompt=result.system_prompt,
)
print("Prompt set:", data_job.system_prompt)

You can also write your own prompt directly via the PATCH endpoint without calling gen_system_prompt first.

ℹ️

Note

Once alignment starts, the system prompt is locked. Make any changes before calling /start.


Step 6: Start alignment

Endpoint: POST /v1/flow/data-jobs/{id}/start Start data job alignment

The /start endpoint enforces pre-flight validation before launching. Requirements for principle_files:

  • status must be ready_to_start
  • system_prompt must be set
  • At least one processed Markdown file must be attached
detail = client.data_jobs.start(data_job_id)
print("Alignment job:", detail.alignment_job.id, detail.alignment_job.status)

If prerequisites are not met, the endpoint returns 422 Unprocessable Entity with a descriptive message.


Step 7: Monitor alignment

Poll GET /v1/flow/data-jobs/{id} to track progress.

detail = client.data_jobs.retrieve(data_job_id)
print("Status:", detail.status)
if detail.alignment_job:
    print("Alignment job:", detail.alignment_job.id)
print("Fine-tuning job IDs:", detail.fine_tuning_job_ids)

Step 8: Retrieve output files

Endpoint: GET /v1/flow/alignment/{job_id}/outputs

Once status shows completed, use this endpoint to retrieve metadata for all input and output files associated with the alignment job.

import requests

alignment_job_id = detail.alignment_job.id
response = requests.get(
    f"https://flow.seekr.com/v1/flow/alignment/{alignment_job_id}/outputs",
    headers=headers,
)
outputs = response.json()
[
  {
    "id": "file-94ab4920-55e6-11f0-a791-96200332ab12",
    "filename": "freeway_facts_part_1.md",
    "type": "input",
    "purpose": "alignment",
    "bytes": 56826
  },
  {
    "id": "file-94d1a85e-55e6-11f0-a791-96200332ab12",
    "filename": "freeway_facts_part_2.md",
    "type": "input",
    "purpose": "alignment",
    "bytes": 1871
  },
  {
    "id": "file-94e397f8-55e6-11f0-a791-96200332ab12",
    "filename": "freeway_facts_part_3.md",
    "type": "input",
    "purpose": "alignment",
    "bytes": 5387
  },
  {
    "id": "file-f883ecfb-1353-4693-928f-0467b268b07b",
    "filename": "freeway_facts_part_1-20250630191733-raft-qa-pairs-messages.jsonl",
    "type": "output",
    "purpose": "alignment",
    "bytes": 0
  },
  {
    "id": "file-2d497c5c-55fa-4bbe-80a3-0b4eccda4d6f",
    "filename": "freeway_facts_part_1-20250630191733-raft-qa-pairs.parquet",
    "type": "output",
    "purpose": "fine-tune",
    "bytes": 324
  }
]

The response includes both input and output files. The output file with "purpose": "fine-tune" (the .parquet) is the file ID to use when creating a fine-tuning job.