Begin the data preparation process by gathering source documents for upload to the Data Engine. Aim to collect documents that are relevant to the domain, and that are structured in a clear and logical way, with headings and subheadings for readability.

Step 1: Prepare your file for upload

Supported file types

Markdown (.md)
PDF (.pdf)
Word Documents (.docx, .doc)
PowerPoint (.ppt, .pptx)

Length requirements

Your file should have at least 4 cohesive sections, each detailed enough to form a meaningful “chunk.”
Although exact token counts are not mandatory, each chunk should contain enough substantive text to form meaningful document chunks.

File formatting guidelines

Before uploading, make sure your files are properly formatted by following these steps:

PDF and DOCX:

Use clear headings and avoid images without context.
Ensure text content is structured logically for conversion.

Example PDF

Markdown:

Use correct header hierarchy (#H1, ##H2, ###H3, etc.).
Limit headers to six levels (######).
There's a clear, logical flow of information throughout.
Avoid missing or empty headers and skipped levels.
Ensure all sections have meaningful content under them.

ℹ️
Note
Google Docs users can export files as Markdown, PDF, or DOCX.

Once your files are in order, proceed to the next step.

Step 2: Upload files

Use SeekrFlow’s Files API to upload source documents. Documents can be in the following formats:

PDF
DOCX
DOC
Markdown
PowerPoint (.ppt, .pptx)

The following examples show how to upload a single file or multiple files of varying types:

Upload a single file

Upload a single file, up to 4GB.

Endpoint: PUT /v1/flow/filesUpload Training File

from seekrai import SeekrFlow
client = SeekrFlow(api_key="your-api-key") #Note: If your API key is stored as an environment variable, you can leave the parentheses empty, as shown below.

# Upload a file
upload_resp = client.files.upload("example.pdf", purpose="alignment")
print(upload_resp.id)

Sample response

Uploading file example.pdf: 100%|█| 21.6M/21.6M [00:21<00:00, 1.02MB
file-25e34f96-2130-11f0-9236-3e11346bffff

Upload multiple files

Upload multiple files up to 4GB each. The endpoint accepts an array of files as input.

ℹ️
Note
File size limits differ between the API/SDK and UI, which supports file uploads up to 10 files/100mb each.

The following example uploads two files, one PDF and one DOCX, simultaneously:

Endpoint: PUT /v1/flow/bulk_files Bulk upload files

from seekrai import SeekrFlow
client = SeekrFlow()

bulk_resp = client.files.bulk_upload(
    ["example1.pdf", "example2.pdf"], purpose="alignment"
)

# Access and print the ID of each uploaded file
for resp in bulk_resp:
    print(resp.id)

List and delete files

Keep track of uploaded files and remove duplicates or erroneous uploads as needed.

List all uploaded files:

Endpoint: GET v1/flow/files List files

# List all files
files_response = client.files.list()

for file in files_response.data:
    print(f"ID: {file.id}, Filename: {file.filename}")

Delete a file from the system:

Endpoint: DELETE v1/flow/files/{file_id} Delete file

# Delete a file
client.files.delete(file_id)
print(f"Successfully deleted file {file_id}")

Step 3: Start an ingestion job (for PDFs, DOCX, DOC, and PPT/PPTX only)

Endpoint: POST /v1/flow/alignment/ingestion Multi-file Ingestion

ℹ️
Note
When using data jobs, ingestion is triggered automatically when you attach files via POST /v1/flow/data-jobs/{id}/add-files — you do not need to call this endpoint directly. See Create instruction fine-tuning data for the complete data job workflow.

If your original source was a PDF, DOCX, DOC, PPT, or PPTX, you must first convert it to Markdown.

If you already have a Markdown file, skip this step.

❗
Important
Do not ingest Markdown files. This will return an error.

If your file has tables embedded, make sure there are no empty cells. Empty cells run the risk of shifting all of your content in your table over to fill in the empty cell.

from seekrai import SeekrFlow

client = SeekrFlow(api_key="your api key")
ingestion = client.ingestion

# Start ingestion on one or more uploaded file IDs
response = ingestion.ingest(
  files=["file-25e34f96-2130-11f0-9236-3e11346bffff", "file-efd0b334-2130-11f0-9236-3e11346bffff"],
	method="accuracy-optimized"
)
print("Ingestion Job ID:", response.id)

Once ingestion is complete, you’ll receive a Markdown file that you can use for fine-tuning.

ℹ️
Ingestion mode and the UI
When ingesting files through the SeekrFlow UI, speed-optimized mode is always used. The SDK lets you choose between speed-optimized and accuracy-optimized.

Choose an ingestion method

Accuracy-optimized (default)

When you use method="accuracy-optimized" or omit the method parameter, the system prioritizes accuracy. Depending on what data is available in your PDF document (bookmarks, tables, text layers), the system combines multiple extraction techniques for best results.

Key features:

Uses both OCR and direct text extraction, then blends them together
Employs LLM agents to correct and enhance document hierarchy
Applies advanced table detection algorithms for accurate table formatting

ℹ️
Note
Documents over 100 pages can take up to 30 minutes to process.

Speed-optimized

When you use method="speed-optimized", the system balances quality with processing speed. It automatically selects faster methods based on document size while maintaining reasonable accuracy for smaller documents.

Key features:

Small documents still use high-accuracy methods
Larger documents use speed optimized algorithms to meet time constraints

ℹ️
Note
Optimized to complete in approximately 3 minutes regardless of document size.

Once ingestion is complete, you'll receive a Markdown file that you can use for fine-tuning.

Step 4: Check ingestion status

After starting ingestion, you can track job progress, view per-file statuses, and diagnose any failures. See Monitor ingestion for details on checking job states, interpreting file_records, and resolving errors.

Once status shows completed, your output file(s) will be available for the next step.

Step 5: Review and edit ingested file

Before moving on to the next step, download the Markdown files created during ingestion to make sure all the information was transferred successfully. This is your opportunity to make any necessary edits in a Markdown editor, then re-upload your file for best results.

from seekrai import SeekrFlow
client = SeekrFlow()

retrieve_resp = client.files.retrieve_content("file-dc6f19d1-b55a-43a0-a38a-56f0e7bd8d8d")

file_id = retrieve_resp.id
print(f"Retrieve file with ID: {file_id}")

Markdown editing guidelines

Check to make sure standard Markdown formatting conventions are being followed:

Clear hierarchy with appropriate heading levels (#H1, ##H2, ###H3, etc.)
Logical flow of information from beginning to end
Consistent formatting throughout the document
Clear separation between sections using headings and whitespace
Meaningful content below each header

In addition, do a sweep for incorrect text or characters that may have been picked up during ingestion. For example:

# California State Univ University, Monter , Monterey Bay [Digital Commons @ CSUMB](https://digitalcommons.csumb.edu/)

## Movie Music: Film Soundtracks or Film Soundtracks Thr acks Throughout Hist oughout History

This file has picked up some erroneous text during ingestion. It's best to remove these to ensure the highest-quality QA pairs for fine-tuning (though a few minor additions like /or extra line breaks likely won't affect quality).

Once you're satisfied, save your file with a .md extension and upload it:

upload_resp = client.files.upload("edited_example.md", purpose="alignment")

file_id = upload_resp.id
print(f"Uploaded file with ID: {file_id}")

Uploading file edited_example.md: 100%|███████████████████████████████████| 41.8k/41.8k [00:01<00:00, 35.3kB/s]
Uploaded file with ID: file-9b9bf862-26cd-11f0-958d-52c2a8425a49

Keep the new file ID for the next step.

Step 6: In-context learning with Markdown (optional)

Once you’ve reviewed and edited your ingested Markdown (Step 5), you can use it directly in your prompts without fine-tuning. This works well for simpler, single documents that fit within a context window — for example, an airline’s customer service plan.

Example

The following Markdown file was generated by the AI-Ready Data Engine from a PDF source. Its clearly structured sections make it a good candidate for in-context learning.

# American Airlines Customer Service Plan

## Our Commitment
American Airlines and American Eagle are in business to provide safe, dependable and friendly air transportation.

## Accommodations for Unaccompanied Minors
Children 5–14 years old may travel under our unaccompanied minor (UMNR) service on nonstop or same-plane flights.

- Children 8–14 may travel on connecting flights via select airports (CLT, DFW, LAX, etc.).
- Children 15–17 may optionally use UMNR service.
- UMNR service is not available for codeshare or partner flights.

## Customers with Disabilities
American maintains Special Assistance Coordinators (SACs) to arrange accommodations such as:

- Pre-reserved seating
- Boarding assistance
- Wheelchair support
- In-cabin storage of assistive devices

## Flight Delays and Cancellations
For delays or cancellations:

- Rebooking on the next available flight is provided at no cost.
- Hotel and meal vouchers are issued if delays are caused by American and result in overnight stays.
- In cases of diversions, we provide transportation and accommodations as needed.

## AAdvantage® Program
Members earn miles for flights, credit card use, hotel stays, car rentals, and more. Miles can be redeemed for travel, upgrades, or donated.

Pass the file contents as context in your prompt:

You are a helpful customer support agent. Use only the following context to answer the question.

--- Context (american_airlines_customer_service.md) ---
<contents of american_airlines_customer_service.md>

Question: "What assistance does American Airlines provide if a flight is diverted to another city?"

The model uses the Markdown structure to locate the relevant section and return an answer without any model retraining.

When to use in-context learning

Best for simple, single documents that fit within a context window (typically 4k–16k tokens)
Reduces initial setup — no model training required
Becomes impractical for large document sets or policies with complex conditional logic
When you need to scale beyond a single document or require consistent precision, consider RAG or fine-tuning instead