> ## Documentation Index
> Fetch the complete documentation index at: https://docs.seekr.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Create instruction fine-tuning data

> Generate a QA pair dataset for instruction fine-tuning using the principle_files data job workflow.

This guide walks you through creating a `principle_files` data job, which generates a QA pair Parquet file you can use for instruction fine-tuning. The workflow bundles file upload, ingestion, prompt configuration, and alignment into a single managed job.

**Before you start:** Upload your source files using the [Files API](/flow/sdk/data-engine/file-ingestion) and have your file IDs ready before Step 2. Not sure which approach fits your use case? See [Fine-tuning](/flow/components/fine-tuning).

## Step 1: Create a data job

**Endpoint:** `POST /v1/flow/data-jobs` [Submit data job](/flow/reference/submit_data_job_v1_flow_data_jobs_post)

Create a job shell with `job_type` set to `principle_files`.

<CodeGroup>
  ```python Python theme={null}
  from seekrai import SeekrFlow

  client = SeekrFlow()

  data_job = client.data_jobs.create(
      name="Customer support refresh",
      description="Prep PDFs and Markdown for Q4 fine-tuning",
      job_type="principle_files",
  )
  data_job_id = data_job.id
  print("Data job ID:", data_job_id)
  ```
</CodeGroup>

**Sample response:**

<CodeGroup>
  ```json JSON theme={null}
  {
      "id": "dj-1b75f4d5-5c9e-4d33-b164-a2393bc5ab6d",
      "name": "Customer support refresh",
      "job_type": "principle_files",
      "status": "needs_setup"
  }
  ```
</CodeGroup>

## Step 2: Add files to the job

**Endpoint:** `POST /v1/flow/data-jobs/{id}/add-files` [Add files to data job](/flow/reference/add_files_to_data_job_v1_flow_data_jobs__data_job_id__add_files_post)

Attach uploaded file IDs to the job. Non-Markdown files (PDF, DOCX, PPT) trigger ingestion automatically. Markdown files are marked as alignment-ready immediately.

<CodeGroup>
  ```python Python theme={null}
  data_job = client.data_jobs.add_files(
      data_job_id,
      file_ids=[
          "file-25e34f96-2130-11f0-9236-3e11346bffff",
          "file-efd0b334-2130-11f0-9236-3e11346bffff",
      ],
      method="accuracy-optimized",
  )
  print("Data job status:", data_job.status)
  ```
</CodeGroup>

<Info>
  **Ingestion mode and the UI**

  When ingesting files through the SeekrFlow UI, speed-optimized mode is always used. The SDK lets you choose between speed-optimized and accuracy-optimized.
</Info>

### Choose an ingestion method

| Method                         | Approx. time (100+ pages) | Description                                                                                        |
| ------------------------------ | ------------------------- | -------------------------------------------------------------------------------------------------- |
| `accuracy-optimized` (default) | \~30 min                  | Combines OCR and direct extraction, runs LLM hierarchy cleanup, and uses advanced table detection. |
| `speed-optimized`              | \~3 min                   | Faster extraction heuristics for large documents; smaller documents still use the richer pipeline. |

## Step 3: Monitor ingestion

Poll `GET /v1/flow/data-jobs/{id}` until the job status is no longer `file_processing`.

<CodeGroup>
  ```python Python theme={null}
  detail = client.data_jobs.retrieve(data_job_id)
  print("Status:", detail.status)

  for ingestion_job in detail.ingestion_jobs:
      for record in ingestion_job.records:
          print(f"  {record.filename}: {record.status}")
          if record.status == "failed":
              print(f"    Fix: {record.suggested_fix}")
  ```
</CodeGroup>

| Status            | Meaning                                                         |
| ----------------- | --------------------------------------------------------------- |
| `file_processing` | Ingestion still running — wait                                  |
| `needs_review`    | One or more files failed — resolve `suggested_fix` in `records` |
| `ready_to_start`  | Ingestion complete and prerequisites met                        |

If a file fails, fix the source, re-upload it, and attach it to the job again. To skip the file instead, remove it via `POST /v1/flow/data-jobs/{id}/remove-files`. See [Monitor ingestion](/flow/sdk/data-engine/monitor-ingestion) for a full reference on job states and error codes.

## Step 4: Review and edit ingested Markdown

Download the generated Markdown files from the `files` array in the job detail and review for accuracy. Re-upload any corrected versions and remove the originals before starting alignment.

<CodeGroup>
  ```python Python theme={null}
  retrieve_resp = client.files.retrieve_content("file-dc6f19d1-b55a-43a0-a38a-56f0e7bd8d8d")
  print(f"Retrieved file: {retrieve_resp.id}")
  ```
</CodeGroup>

Once you've reviewed and edited, re-upload and attach the corrected file:

<CodeGroup>
  ```python Python theme={null}
  upload_resp = client.files.upload("edited_example.md", purpose="alignment")
  print(f"Uploaded file with ID: {upload_resp.id}")
  ```
</CodeGroup>

See [Prepare and ingest files](/flow/sdk/data-engine/file-ingestion) for Markdown editing guidelines and in-context learning considerations.

## Step 5: Set a system prompt

A `system_prompt` is required for `principle_files` jobs before alignment can start. You can generate one from high-level instructions or write your own.

### Generate a prompt

**Endpoint:** `POST /v1/flow/data-jobs/gen_system_prompt` [Generate system prompt](/flow/reference/generate_system_prompt_v1_flow_data_jobs_gen_system_prompt_post)

<Note>
  `gen_system_prompt` returns a suggested prompt but does not save it to the job. You must call `PATCH /v1/flow/data-jobs/{id}` to set it.
</Note>

<CodeGroup>
  ```python Python theme={null}
  result = client.data_jobs.generate_system_prompt(
      instructions="I want to train a chatbot to answer questions about horror movie tropes.",
  )
  print(result.system_prompt)
  ```
</CodeGroup>

### Set the prompt on the job

**Endpoint:** `PATCH /v1/flow/data-jobs/{id}` [Update data job](/flow/reference/patch_data_job_v1_flow_data_jobs__data_job_id__patch)

<CodeGroup>
  ```python Python theme={null}
  data_job = client.data_jobs.update(
      data_job_id,
      system_prompt=result.system_prompt,
  )
  print("Prompt set:", data_job.system_prompt)
  ```
</CodeGroup>

You can also write your own prompt directly via the PATCH endpoint without calling `gen_system_prompt` first.

<Note>
  Once alignment starts, the system prompt is locked. Make any changes before calling `/start`.
</Note>

## Step 6: Start alignment

**Endpoint:** `POST /v1/flow/data-jobs/{id}/start` [Start data job alignment](/flow/reference/start_data_job_alignment_v1_flow_data_jobs__data_job_id__start_post)

The `/start` endpoint enforces pre-flight validation before launching. Requirements for `principle_files`:

* `status` must be `ready_to_start`
* `system_prompt` must be set
* At least one processed Markdown file must be attached

<CodeGroup>
  ```python Python theme={null}
  detail = client.data_jobs.start(data_job_id)
  print("Alignment job:", detail.alignment_job.id, detail.alignment_job.status)
  ```
</CodeGroup>

If prerequisites are not met, the endpoint returns `422 Unprocessable Entity` with a descriptive message.

## Step 7: Monitor alignment

Poll `GET /v1/flow/data-jobs/{id}` to track progress.

<CodeGroup>
  ```python Python theme={null}
  detail = client.data_jobs.retrieve(data_job_id)
  print("Status:", detail.status)
  if detail.alignment_job:
      print("Alignment job:", detail.alignment_job.id)
  print("Fine-tuning job IDs:", detail.fine_tuning_job_ids)
  ```
</CodeGroup>

## Step 8: Retrieve output files

**Endpoint:** `GET /v1/flow/alignment/{job_id}/outputs`

Once `status` shows `completed`, use this endpoint to retrieve metadata for all input and output files associated with the alignment job.

<CodeGroup>
  ```python Python theme={null}
  import os
  import requests

  alignment_job_id = detail.alignment_job.id
  headers = {"Authorization": os.environ["SEEKR_API_KEY"]}
  response = requests.get(
      f"https://flow.seekr.com/v1/flow/alignment/{alignment_job_id}/outputs",
      headers=headers,
  )
  outputs = response.json()
  ```
</CodeGroup>

<CodeGroup>
  ```json JSON expandable theme={null}
  [
    {
      "id": "file-94ab4920-55e6-11f0-a791-96200332ab12",
      "filename": "freeway_facts_part_1.md",
      "type": "input",
      "purpose": "alignment",
      "bytes": 56826
    },
    {
      "id": "file-94d1a85e-55e6-11f0-a791-96200332ab12",
      "filename": "freeway_facts_part_2.md",
      "type": "input",
      "purpose": "alignment",
      "bytes": 1871
    },
    {
      "id": "file-94e397f8-55e6-11f0-a791-96200332ab12",
      "filename": "freeway_facts_part_3.md",
      "type": "input",
      "purpose": "alignment",
      "bytes": 5387
    },
    {
      "id": "file-f883ecfb-1353-4693-928f-0467b268b07b",
      "filename": "freeway_facts_part_1-20250630191733-raft-qa-pairs-messages.jsonl",
      "type": "output",
      "purpose": "alignment",
      "bytes": 0
    },
    {
      "id": "file-2d497c5c-55fa-4bbe-80a3-0b4eccda4d6f",
      "filename": "freeway_facts_part_1-20250630191733-raft-qa-pairs.parquet",
      "type": "output",
      "purpose": "fine-tune",
      "bytes": 324
    }
  ]
  ```
</CodeGroup>

The response includes both input and output files. The output file with `"purpose": "fine-tune"` (the `.parquet`) is the file ID to use when creating a fine-tuning job.
