Generate a QA pair dataset for instruction fine-tuning using the principle_files data job workflow.
This guide walks you through creating a principle_files data job, which generates a QA pair Parquet file you can use for instruction fine-tuning. The workflow bundles file upload, ingestion, prompt configuration, and alignment into a single managed job.Before you start: Upload your source files using the Files API and have your file IDs ready before Step 2. Not sure which approach fits your use case? See Fine-tuning.
Endpoint:POST /v1/flow/data-jobs/{id}/add-filesAdd files to data jobAttach uploaded file IDs to the job. Non-Markdown files (PDF, DOCX, PPT) trigger ingestion automatically. Markdown files are marked as alignment-ready immediately.
Ingestion mode and the UIWhen ingesting files through the SeekrFlow UI, speed-optimized mode is always used. The SDK lets you choose between speed-optimized and accuracy-optimized.
Poll GET /v1/flow/data-jobs/{id} until the job status is no longer file_processing.
detail = client.data_jobs.retrieve(data_job_id)print("Status:", detail.status)for ingestion_job in detail.ingestion_jobs: for record in ingestion_job.records: print(f" {record.filename}: {record.status}") if record.status == "failed": print(f" Fix: {record.suggested_fix}")
Status
Meaning
file_processing
Ingestion still running — wait
needs_review
One or more files failed — resolve suggested_fix in records
ready_to_start
Ingestion complete and prerequisites met
If a file fails, fix the source, re-upload it, and attach it to the job again. To skip the file instead, remove it via POST /v1/flow/data-jobs/{id}/remove-files. See Monitor ingestion for a full reference on job states and error codes.
Download the generated Markdown files from the files array in the job detail and review for accuracy. Re-upload any corrected versions and remove the originals before starting alignment.
gen_system_prompt returns a suggested prompt but does not save it to the job. You must call PATCH /v1/flow/data-jobs/{id} to set it.
result = client.data_jobs.generate_system_prompt( instructions="I want to train a chatbot to answer questions about horror movie tropes.",)print(result.system_prompt)
Endpoint:POST /v1/flow/data-jobs/{id}/startStart data job alignmentThe /start endpoint enforces pre-flight validation before launching. Requirements for principle_files:
status must be ready_to_start
system_prompt must be set
At least one processed Markdown file must be attached
Endpoint:GET /v1/flow/alignment/{job_id}/outputsOnce status shows completed, use this endpoint to retrieve metadata for all input and output files associated with the alignment job.
The response includes both input and output files. The output file with "purpose": "fine-tune" (the .parquet) is the file ID to use when creating a fine-tuning job.