Data Preparation

Upload multiple files simultaneously in their original formats and the AI-Ready Data Engine will automatically extract relevant data and structure them into a dataset that can be used for fine-tuning.

Begin the data preparation process by gathering source documents for upload to the Data Engine. Aim to collect documents that are relevant to the domain, and that are structured in a clear and logical way, with headings and subheadings for readability.

Step 1: Preparing your file for upload

Supported file types

  • Markdown (.md)
  • PDF (.pdf)
  • Word Documents (.docx)
  • JSON (.json)

Length requirements

  • Your file should have at least 4 cohesive sections, each detailed enough to form a meaningful “chunk.”
  • Although exact token counts are not mandatory, each chunk should contain enough substantive text to form meaningful document chunks.

File formatting guidelines

Before uploading, make sure your files are properly formatted by following these steps:

PDF and DOCXMarkdownJSON
1. Use clear headings and avoid images without context. 2. Ensure text content is structured logically for conversion.1. Use correct header hierarchy (#, ##, ###, etc.). 2. Avoid missing or empty headers and skipped levels. 3. Limit headers to six levels (######). 4. Ensure all sections have meaningful content under them.1. Validate the JSON structure using a tool like JSONLint. 2. Ensure all key-value pairs are well-formed and relevant.

Note: Google Docs users can export files as Markdown, PDF, or DOCX.

Once your files look good, proceed to the next step.


Step 2: Upload training files

Use SeekrFlow’s API to upload key documents and generate an AI-ready dataset. Documents can be in the following formats:

  • PDF
  • DOCX
  • Markdown
  • JSON

This dataset, a storage-efficient QA Pair Parquet file, can then be used to run a supervised instruction fine-tuning job.

Note: If fine-tuning is used in combination with user-provided training data, the resulting dataset from fine-tuning and the training data must be combined prior to using the data as part of the fine-tuning service.

The following example code shows you how to upload either a single file, or multiple files of varying types:

Upload a single file

Upload a single file, up to 4GB.

Endpoint: PUT /v1/flow/filesUpload Training File

from seekrai import SeekrFlow
client = SeekrFlow(api_key="your-api-key")

# Upload a file
upload_resp = client.files.upload("example.pdf", purpose="alignment")
print(upload_resp.id)

Sample response

If you're uploading PDF and DOCX files, use the generated file ids to start an ingestion job that will give you a Markdown file you can use for alignment (shown below). If it was a Markdown file, you can skip this step.

Uploading file example.pdf: 100%|█| 21.6M/21.6M [00:21<00:00, 1.02MB
file-25e34f96-2130-11f0-9236-3e11346bffff

Upload multiple files

Upload multiple files up to 4GB each.

Note: File size limits differ between the API/SDK and UI, which supports file uploads up to 10 files/100mb each.

The following example uploads two files, one PDF and one DOCX, simultaneously:

Endpoint: PUT /v1/flow/bulk_filesUpload Multiple Training Files

from seekrai import SeekrFlow
client = SeekrFlow(api_key="your api key")

bulk_resp = client.files.bulk_upload(
    ["example1.pdf", "example2.pdf"], purpose="alignment"
)

# Access and print the ID of each uploaded file
for resp in bulk_resp:
    print(resp.id)

Step 3: Start an ingestion job (for PDFs and DOCX only)

Endpoint: POST /v1/flow/alignment/ingestion Multi-file Ingestion

If your original source was a PDF or DOCX, you must first convert it to Markdown.

If you already have a Markdown file, skip this step. (fwiw, do not attempt to ingest Markdowns - you'll get an error)

If your file has tables embedded, make sure there are no empty cells. Empty cells run the risk of shifting all of your content in your table over to fill in the empty cell.

from seekrai import SeekrFlow

client = SeekrFlow(api_key="your api key")
ingestion = client.ingestion

# Start ingestion on one or more uploaded file IDs
response = ingestion.ingest(files=["file-25e34f96-2130-11f0-9236-3e11346bffff", "file-efd0b334-2130-11f0-9236-3e11346bffff"])
print("Ingestion Job ID:", response.id)

Once ingestion is complete, you’ll receive a Markdown file that you can use for fine-tuning.


Step 4: Check ingestion status

Endpoint: GET /v1/flow/alignment/ingestion/{job_id} List all ingestion jobs

After starting ingestion, check the status to confirm when it's been completed successfully.

List all jobs:

from seekrai import SeekrFlow

client = SeekrFlow(api_key="your api key")
ingestion = client.ingestion

#List all ingestion jobs
job_list = ingestion.list()
for job in job_list.data:
    print(job.id, job.status)

Retrieve an individual ingestion job:

#Retrieve your job's status
job = client.ingestion.retrieve("ij-1234567890")
print("Job ID:", job.id)
print("Status:", job.status) 

# Grab the new file IDs
output_ids = job.output_files
print("Ingestion created these files:", output_ids)

Once status shows completed, your output file(s) will also be available for the next step: generating data for standard instruction fine-tuning.

Sample response:

Job ID: ij-1234567890
Status: IngestionJobStatus.COMPLETED
Ingestion created these files: ['file-1234567890', 'file-0987654321']