Data Preparation

Upload multiple files simultaneously in their original formats and the AI-Ready Data Engine will automatically extract relevant data and structure them into a dataset that can be used for fine-tuning.

Begin the data preparation process by gathering source documents for upload to the Data Engine. Aim to collect documents that are relevant to the domain, and that are structured in a clear and logical way, with headings and subheadings for readability.

Step 1: Prepare your file for upload

Supported file types

  • Markdown (.md)
  • PDF (.pdf)
  • Word Documents (.docx)

Length requirements

  • Your file should have at least 4 cohesive sections, each detailed enough to form a meaningful “chunk.”
  • Although exact token counts are not mandatory, each chunk should contain enough substantive text to form meaningful document chunks.

File formatting guidelines

Before uploading, make sure your files are properly formatted by following these steps:

PDF and DOCX:

  1. Use clear headings and avoid images without context.
  2. Ensure text content is structured logically for conversion.

Example PDF

Markdown:

  1. Use correct header hierarchy (#H1, ##H2, ##H3, etc.).
  2. Limit headers to six levels (######).
  3. There's a clear, logical flow of information throughout.
  4. Avoid missing or empty headers and skipped levels.
  5. Ensure all sections have meaningful content under them.

Note: Google Docs users can export files as Markdown, PDF, or DOCX.

Once your files are in order, proceed to the next step.


Step 2: Upload files

Use SeekrFlow’s API to upload key documents and generate an AI-ready dataset. Documents can be in the following formats:

  • PDF
  • DOCX
  • Markdown

These files will eventually be ingested and processed into a storage-efficient QA Pair Parquet file, which can then be used to run a supervised instruction fine-tuning job.

The following example code shows you how to upload either a single file, or multiple files of varying types:

Upload a single file

Upload a single file, up to 4GB.

Endpoint: PUT /v1/flow/filesUpload Training File

from seekrai import SeekrFlow
client = SeekrFlow(api_key="your-api-key") #Note: If your API key is stored as an environment variable, you can leave the parentheses empty, as shown below.

# Upload a file
upload_resp = client.files.upload("example.pdf", purpose="alignment")
print(upload_resp.id)

Sample response

If you're uploading PDF and DOCX files, use the generated file ids to start an ingestion job that will give you a Markdown file you can use for alignment (shown below). If it was a Markdown file, you can skip this step.

Uploading file example.pdf: 100%|█| 21.6M/21.6M [00:21<00:00, 1.02MB
file-25e34f96-2130-11f0-9236-3e11346bffff

Upload multiple files

Upload multiple files up to 4GB each. The endpoint accepts an array of files as input.

Note: File size limits differ between the API/SDK and UI, which supports file uploads up to 10 files/100mb each.

The following example uploads two files, one PDF and one DOCX, simultaneously:

Endpoint: PUT /v1/flow/bulk_filesUpload Multiple Training Files

from seekrai import SeekrFlow
client = SeekrFlow()

bulk_resp = client.files.bulk_upload(
    ["example1.pdf", "example2.pdf"], purpose="alignment"
)

# Access and print the ID of each uploaded file
for resp in bulk_resp:
    print(resp.id)

List and delete files

Sometimes in life, we all need to do a little housekeeping. Keep track of what files you uploaded and remove duplicates or erroneous file uploads with the following:

List all uploaded files:

Endpoint: GET v1/flow/files List all files

# List all files
files_response = client.files.list()

for file in files_response.data:
    print(f"ID: {file.id}, Filename: {file.filename}")

Delete a file from the system:

Endpoint: DELETE v1/flow/files/{file_id} Delete training file

# Delete a file
client.files.delete(file_id)
print(f"Successfully deleted file {file_id}")

Step 3: Start an ingestion job (for PDFs and DOCX only)

Endpoint: POST /v1/flow/alignment/ingestion Multi-file Ingestion

If your original source was a PDF or DOCX, you must first convert it to Markdown.

If you already have a Markdown file, skip this step. (fwiw, do not attempt to ingest Markdowns - you'll get an error)

If your file has tables embedded, make sure there are no empty cells. Empty cells run the risk of shifting all of your content in your table over to fill in the empty cell.

from seekrai import SeekrFlow

client = SeekrFlow(api_key="your api key")
ingestion = client.ingestion

# Start ingestion on one or more uploaded file IDs
response = ingestion.ingest(files=["file-25e34f96-2130-11f0-9236-3e11346bffff", "file-efd0b334-2130-11f0-9236-3e11346bffff"])
print("Ingestion Job ID:", response.id)

Once ingestion is complete, you’ll receive a Markdown file that you can use for fine-tuning.


Step 4: Check ingestion status

After starting ingestion, check the status to confirm when it's been completed successfully.

List all ingestion jobs:

Endpoint: GET /v1/flow/alignment/ingestion/ List all ingestion jobs

from seekrai import SeekrFlow

client = SeekrFlow(api_key="your api key")
ingestion = client.ingestion

#List all ingestion jobs
job_list = ingestion.list()
for job in job_list.data:
    print(job.id, job.status)

Retrieve an individual ingestion job:

Endpoint: GET v1/flow/alignment/ingestion/{ingestion_job_id} List specific ingestion job

#Retrieve your job's status
job = client.ingestion.retrieve("ij-1234567890")
print("Job ID:", job.id)
print("Status:", job.status) 

# Grab the new file IDs
output_ids = job.output_files
print("Ingestion created these files:", output_ids)

Once status shows completed, your output file(s) will also be available for the next step: generating data for standard instruction fine-tuning.

Sample response:

Job ID: ij-1234567890
Status: IngestionJobStatus.COMPLETED
Ingestion created these files: ['file-1234567890', 'file-0987654321']

Step 5: Review and edit ingested file

Before moving on to the next step, download the Markdown files created during ingestion to make sure all the information was transferred successfully. This is your opportunity to make any necessary edits in a Markdown editor, then re-upload your file for best results.

from seekrai import SeekrFlow
client = SeekrFlow()

retrieve_resp = client.files.retrieve_content("file-dc6f19d1-b55a-43a0-a38a-56f0e7bd8d8d")

file_id = retrieve_resp.id
print(f"Retrieve file with ID: {file_id}")

Markdown editing guidelines

Check to make sure standard Markdown formatting conventions are being followed:

  • Clear hierarchy with appropriate heading levels (#H1, ##H2, ##H3, etc.)
  • Logical flow of information from beginning to end
  • Consistent formatting throughout the document
  • Clear separation between sections using headings and whitespace
  • Meaningful content below each header

In addition, do a sweep for incorrect text or characters that may have been picked up during ingestion. For example:

# California State Univ University, Monter , Monterey Bay [Digital Commons @ CSUMB](https://digitalcommons.csumb.edu/)

## Movie Music: Film Soundtracks or Film Soundtracks Thr acks Throughout Hist oughout History

This file has picked up some erroneous text during ingestion. It's best to remove these to ensure the highest-quality QA pairs for fine-tuning (though a few minor additions like /or extra line breaks likely won't affect quality).

Once you're satisfied, save your file with a .md extension and upload it:

upload_resp = client.files.upload("edited_example.md", purpose="alignment")

file_id = upload_resp.id
print(f"Uploaded file with ID: {file_id}")
Uploading file edited_example.md: 100%|███████████████████████████████████| 41.8k/41.8k [00:01<00:00, 35.3kB/s]
Uploaded file with ID: file-9b9bf862-26cd-11f0-958d-52c2a8425a49

Keep that new file id for the next step...generating data for fine-tuning.


Step 6: In-context learning with Markdown (optional)

Once you’ve reviewed and edited your ingested Markdown (Step 5), you can use it directly in your prompts without fine-tuning. This can help in some cases; for example, when working with simpler documents—like an airline’s customer service plan.

Example

This is an example of a Markdown file generated by the AI-Ready Data Engine from a PDF or DOCX source. It contains clearly structured sections that support strong in-context learning.

# American Airlines Customer Service Plan

## Our Commitment
American Airlines and American Eagle are in business to provide safe, dependable and friendly air transportation.

## Accommodations for Unaccompanied Minors
Children 5–14 years old may travel under our unaccompanied minor (UMNR) service on nonstop or same-plane flights.

- Children 8–14 may travel on connecting flights via select airports (CLT, DFW, LAX, etc.).
- Children 15–17 may optionally use UMNR service.
- UMNR service is not available for codeshare or partner flights.

## Customers with Disabilities
American maintains Special Assistance Coordinators (SACs) to arrange accommodations such as:

- Pre-reserved seating
- Boarding assistance
- Wheelchair support
- In-cabin storage of assistive devices

## Flight Delays and Cancellations
For delays or cancellations:

- Rebooking on the next available flight is provided at no cost.
- Hotel and meal vouchers are issued if delays are caused by American and result in overnight stays.
- In cases of diversions, we provide transportation and accommodations as needed.

## AAdvantage® Program
Members earn miles for flights, credit card use, hotel stays, car rentals, and more. Miles can be redeemed for travel, upgrades, or donated.

This Markdown file can now be used as prompt context to answer questions without model retraining. The model uses the Markdown structure to directly locate the relevant policy section and return an answer.

You are a helpful customer support agent. Use only the following context to answer the question.

--- Context (american_airlines_customer_service.md) ---
<contents of american_airlines_customer_service.md>

Question: "What assistance does American Airlines provide if a flight is diverted to another city?"

When and why to use in-context learning

In-context learning can be beneficial, but has important limitations to consider:

  • It works for modest-sized documents that fit within context windows (typically 4k-16k tokens), but becomes ineffective for comprehensive policies or multiple documents
  • It can provide reasonably accurate answers when the information is clearly structured and straightforward, though precision decreases with complexity
  • It reduces initial development time compared to fine-tuning, though at the expense of consistency and scalability
  • It's best for early prototypes or limited-scope applications where perfect accuracy isn't critical

Realistic benefits

  • Reduced initial setup effort: You can implement a basic system without the engineering overhead of fine-tuning or complex retrieval systems
  • Improved structure utilization: Markdown formatting helps models navigate information better than raw text, though inconsistently
  • Stepping stone to more robust solutions: The same structured documents can be repurposed for RAG systems when you inevitably need to scale beyond context window limitations

Key considerations

Using structured Markdown files for in-context learning with customer policy documents is effective, but there are several important considerations to ensure optimal implementation.

  • Response quality varies significantly based on prompt design and how well document structure aligns with query patterns
  • Large documents must be heavily summarized or truncated to fit context windows
  • Complex policies with exceptions or conditional rules often lead to oversimplified or incorrect interpretations
  • Updates to source documents require manual intervention and complete prompt reconstruction

Implementation recommendations

  • Establish a validation process to ensure the Markdown accurately represents the original document
  • Thoroughly test for responses that deviate from policy, and track them
  • Include metadata sections with last-updated dates and policy version numbers
  • Consider hybrid approaches combining in-context learning with lightweight RAG for large documents



Next

Generate a Parquet file with QA pairs that you can use to fine-tune a model.