Prepare and ingest files
Upload multiple files simultaneously in their original formats and the AI-Ready Data Engine will automatically extract relevant data and structure them into a dataset that can be used for fine-tuning.
Begin the data preparation process by gathering source documents for upload to the Data Engine. Aim to collect documents that are relevant to the domain, and that are structured in a clear and logical way, with headings and subheadings for readability.
Step 1: Prepare your file for upload
Supported file types
- Markdown (
.md) - PDF (
.pdf) - Word Documents (
.docx,.doc) - PowerPoint (
.ppt,.pptx)
Length requirements
- Your file should have at least 4 cohesive sections, each detailed enough to form a meaningful “chunk.”
- Although exact token counts are not mandatory, each chunk should contain enough substantive text to form meaningful document chunks.
File formatting guidelines
Before uploading, make sure your files are properly formatted by following these steps:
PDF and DOCX:
- Use clear headings and avoid images without context.
- Ensure text content is structured logically for conversion.
Markdown:
- Use correct header hierarchy (
#H1, ##H2, ###H3,etc.). - Limit headers to six levels (
######). - There's a clear, logical flow of information throughout.
- Avoid missing or empty headers and skipped levels.
- Ensure all sections have meaningful content under them.
NoteGoogle Docs users can export files as Markdown, PDF, or DOCX.
Once your files are in order, proceed to the next step.
Step 2: Upload files
Use SeekrFlow’s Files API to upload source documents. Documents can be in the following formats:
- DOCX
- DOC
- Markdown
- PowerPoint (.ppt, .pptx)
The following examples show how to upload a single file or multiple files of varying types:
Upload a single file
Upload a single file, up to 4GB.
Endpoint: PUT /v1/flow/filesUpload Training File
from seekrai import SeekrFlow
client = SeekrFlow(api_key="your-api-key") #Note: If your API key is stored as an environment variable, you can leave the parentheses empty, as shown below.
# Upload a file
upload_resp = client.files.upload("example.pdf", purpose="alignment")
print(upload_resp.id)Sample response
Uploading file example.pdf: 100%|█| 21.6M/21.6M [00:21<00:00, 1.02MB
file-25e34f96-2130-11f0-9236-3e11346bffffUpload multiple files
Upload multiple files up to 4GB each. The endpoint accepts an array of files as input.
NoteFile size limits differ between the API/SDK and UI, which supports file uploads up to 10 files/100mb each.
The following example uploads two files, one PDF and one DOCX, simultaneously:
Endpoint: PUT /v1/flow/bulk_files Bulk upload files
from seekrai import SeekrFlow
client = SeekrFlow()
bulk_resp = client.files.bulk_upload(
["example1.pdf", "example2.pdf"], purpose="alignment"
)
# Access and print the ID of each uploaded file
for resp in bulk_resp:
print(resp.id)List and delete files
Keep track of uploaded files and remove duplicates or erroneous uploads as needed.
List all uploaded files:
Endpoint: GET v1/flow/files List files
# List all files
files_response = client.files.list()
for file in files_response.data:
print(f"ID: {file.id}, Filename: {file.filename}")Delete a file from the system:
Endpoint: DELETE v1/flow/files/{file_id} Delete file
# Delete a file
client.files.delete(file_id)
print(f"Successfully deleted file {file_id}")Step 3: Start an ingestion job (for PDFs, DOCX, DOC, and PPT/PPTX only)
Endpoint: POST /v1/flow/alignment/ingestion Multi-file Ingestion
NoteWhen using data jobs, ingestion is triggered automatically when you attach files via
POST /v1/flow/data-jobs/{id}/add-files— you do not need to call this endpoint directly. See Create instruction fine-tuning data for the complete data job workflow.
If your original source was a PDF, DOCX, DOC, PPT, or PPTX, you must first convert it to Markdown.
If you already have a Markdown file, skip this step.
ImportantDo not ingest Markdown files. This will return an error.
If your file has tables embedded, make sure there are no empty cells. Empty cells run the risk of shifting all of your content in your table over to fill in the empty cell.
from seekrai import SeekrFlow
client = SeekrFlow(api_key="your api key")
ingestion = client.ingestion
# Start ingestion on one or more uploaded file IDs
response = ingestion.ingest(
files=["file-25e34f96-2130-11f0-9236-3e11346bffff", "file-efd0b334-2130-11f0-9236-3e11346bffff"],
method="accuracy-optimized"
)
print("Ingestion Job ID:", response.id)Once ingestion is complete, you’ll receive a Markdown file that you can use for fine-tuning.
Ingestion mode and the UIWhen ingesting files through the SeekrFlow UI, speed-optimized mode is always used. The SDK lets you choose between speed-optimized and accuracy-optimized.
Choose an ingestion method
Accuracy-optimized (default)
When you use method="accuracy-optimized" or omit the method parameter, the system prioritizes accuracy. Depending on what data is available in your PDF document (bookmarks, tables, text layers), the system combines multiple extraction techniques for best results.
Key features:
- Uses both OCR and direct text extraction, then blends them together
- Employs LLM agents to correct and enhance document hierarchy
- Applies advanced table detection algorithms for accurate table formatting
NoteDocuments over 100 pages can take up to 30 minutes to process.
Speed-optimized
When you use method="speed-optimized", the system balances quality with processing speed. It automatically selects faster methods based on document size while maintaining reasonable accuracy for smaller documents.
Key features:
- Small documents still use high-accuracy methods
- Larger documents use speed optimized algorithms to meet time constraints
NoteOptimized to complete in approximately 3 minutes regardless of document size.
Once ingestion is complete, you'll receive a Markdown file that you can use for fine-tuning.
Step 4: Check ingestion status
After starting ingestion, you can track job progress, view per-file statuses, and diagnose any failures. See Monitor ingestion for details on checking job states, interpreting file_records, and resolving errors.
Once status shows completed, your output file(s) will be available for the next step.
Step 5: Review and edit ingested file
Before moving on to the next step, download the Markdown files created during ingestion to make sure all the information was transferred successfully. This is your opportunity to make any necessary edits in a Markdown editor, then re-upload your file for best results.
from seekrai import SeekrFlow
client = SeekrFlow()
retrieve_resp = client.files.retrieve_content("file-dc6f19d1-b55a-43a0-a38a-56f0e7bd8d8d")
file_id = retrieve_resp.id
print(f"Retrieve file with ID: {file_id}")Markdown editing guidelines
Check to make sure standard Markdown formatting conventions are being followed:
- Clear hierarchy with appropriate heading levels (
#H1, ##H2, ###H3,etc.) - Logical flow of information from beginning to end
- Consistent formatting throughout the document
- Clear separation between sections using headings and whitespace
- Meaningful content below each header
In addition, do a sweep for incorrect text or characters that may have been picked up during ingestion. For example:
# California State Univ University, Monter , Monterey Bay [Digital Commons @ CSUMB](https://digitalcommons.csumb.edu/)
## Movie Music: Film Soundtracks or Film Soundtracks Thr acks Throughout Hist oughout HistoryThis file has picked up some erroneous text during ingestion. It's best to remove these to ensure the highest-quality QA pairs for fine-tuning (though a few minor additions like /or extra line breaks likely won't affect quality).
Once you're satisfied, save your file with a .md extension and upload it:
upload_resp = client.files.upload("edited_example.md", purpose="alignment")
file_id = upload_resp.id
print(f"Uploaded file with ID: {file_id}")Uploading file edited_example.md: 100%|███████████████████████████████████| 41.8k/41.8k [00:01<00:00, 35.3kB/s]
Uploaded file with ID: file-9b9bf862-26cd-11f0-958d-52c2a8425a49Keep the new file ID for the next step.
Step 6: In-context learning with Markdown (optional)
Once you’ve reviewed and edited your ingested Markdown (Step 5), you can use it directly in your prompts without fine-tuning. This works well for simpler, single documents that fit within a context window — for example, an airline’s customer service plan.
Example
The following Markdown file was generated by the AI-Ready Data Engine from a PDF source. Its clearly structured sections make it a good candidate for in-context learning.
# American Airlines Customer Service Plan
## Our Commitment
American Airlines and American Eagle are in business to provide safe, dependable and friendly air transportation.
## Accommodations for Unaccompanied Minors
Children 5–14 years old may travel under our unaccompanied minor (UMNR) service on nonstop or same-plane flights.
- Children 8–14 may travel on connecting flights via select airports (CLT, DFW, LAX, etc.).
- Children 15–17 may optionally use UMNR service.
- UMNR service is not available for codeshare or partner flights.
## Customers with Disabilities
American maintains Special Assistance Coordinators (SACs) to arrange accommodations such as:
- Pre-reserved seating
- Boarding assistance
- Wheelchair support
- In-cabin storage of assistive devices
## Flight Delays and Cancellations
For delays or cancellations:
- Rebooking on the next available flight is provided at no cost.
- Hotel and meal vouchers are issued if delays are caused by American and result in overnight stays.
- In cases of diversions, we provide transportation and accommodations as needed.
## AAdvantage® Program
Members earn miles for flights, credit card use, hotel stays, car rentals, and more. Miles can be redeemed for travel, upgrades, or donated.Pass the file contents as context in your prompt:
You are a helpful customer support agent. Use only the following context to answer the question.
--- Context (american_airlines_customer_service.md) ---
<contents of american_airlines_customer_service.md>
Question: "What assistance does American Airlines provide if a flight is diverted to another city?"The model uses the Markdown structure to locate the relevant section and return an answer without any model retraining.
When to use in-context learning
- Best for simple, single documents that fit within a context window (typically 4k–16k tokens)
- Reduces initial setup — no model training required
- Becomes impractical for large document sets or policies with complex conditional logic
- When you need to scale beyond a single document or require consistent precision, consider RAG or fine-tuning instead
Updated 10 days ago
