Step 1: Prepare your file for upload
Supported file types
- Markdown (
.md) - PDF (
.pdf) - Word Documents (
.docx,.doc) - PowerPoint (
.ppt,.pptx)
Length requirements
- Your file should have at least 4 cohesive sections, each detailed enough to form a meaningful “chunk.”
- Although exact token counts are not mandatory, each chunk should contain enough substantive text to form meaningful document chunks.
File formatting guidelines
Before uploading, make sure your files are properly formatted by following these steps: PDF and DOCX:- Use clear headings and avoid images without context.
- Ensure text content is structured logically for conversion.
Example PDF
- Use correct header hierarchy (
#H1, ##H2, ###H3,etc.). - Limit headers to six levels (
######). - There’s a clear, logical flow of information throughout.
- Avoid missing or empty headers and skipped levels.
- Ensure all sections have meaningful content under them.
Google Docs users can export files as Markdown, PDF, or DOCX.
Step 2: Upload files
Use SeekrFlow’s Files API to upload source documents. Documents can be in the following formats:- DOCX
- DOC
- Markdown
- PowerPoint (.ppt, .pptx)
Upload a single file
Upload a single file, up to 4GB. Endpoint:PUT /v1/flow/filesUpload Training File
Sample response
Upload multiple files
Upload multiple files up to 4GB each. The endpoint accepts an array of files as input.File size limits differ between the API/SDK and UI, which supports file uploads up to 10 files/100mb each.
PUT /v1/flow/bulk_files Bulk upload files
List and delete files
Keep track of uploaded files and remove duplicates or erroneous uploads as needed. List all uploaded files: Endpoint:GET v1/flow/files List files
DELETE v1/flow/files/{file_id} Delete file
Step 3: Start an ingestion job (for PDFs, DOCX, DOC, and PPT/PPTX only)
Endpoint:POST /v1/flow/alignment/ingestion Multi-file Ingestion
When using data jobs, ingestion is triggered automatically when you attach files via
POST /v1/flow/data-jobs/{id}/add-files — you do not need to call this endpoint directly. See Create instruction fine-tuning data for the complete data job workflow.Ingestion mode and the UIWhen ingesting files through the SeekrFlow UI, speed-optimized mode is always used. The SDK lets you choose between speed-optimized and accuracy-optimized.
Choose an ingestion method
Accuracy-optimized (default)
When you usemethod="accuracy-optimized" or omit the method parameter, the system prioritizes accuracy. Depending on what data is available in your PDF document (bookmarks, tables, text layers), the system combines multiple extraction techniques for best results.
Key features:
- Uses both OCR and direct text extraction, then blends them together
- Employs LLM agents to correct and enhance document hierarchy
- Applies advanced table detection algorithms for accurate table formatting
Documents over 100 pages can take up to 30 minutes to process.
Speed-optimized
When you usemethod="speed-optimized", the system balances quality with processing speed. It automatically selects faster methods based on document size while maintaining reasonable accuracy for smaller documents.
Key features:
- Small documents still use high-accuracy methods
- Larger documents use speed optimized algorithms to meet time constraints
Optimized to complete in approximately 3 minutes regardless of document size.
Step 4: Check ingestion status
After starting ingestion, you can track job progress, view per-file statuses, and diagnose any failures. See Monitor ingestion for details on checking job states, interpretingfile_records, and resolving errors.
Once status shows completed, your output file(s) will be available for the next step.
Step 5: Review and edit ingested file
Before moving on to the next step, download the Markdown files created during ingestion to make sure all the information was transferred successfully. This is your opportunity to make any necessary edits in a Markdown editor, then re-upload your file for best results.Markdown editing guidelines
Check to make sure standard Markdown formatting conventions are being followed:- Clear hierarchy with appropriate heading levels (
#H1, ##H2, ###H3,etc.) - Logical flow of information from beginning to end
- Consistent formatting throughout the document
- Clear separation between sections using headings and whitespace
- Meaningful content below each header
/or extra line breaks likely won’t affect quality).
Once you’re satisfied, save your file with a .md extension and upload it:
Step 6: In-context learning with Markdown (optional)
Once you’ve reviewed and edited your ingested Markdown (Step 5), you can use it directly in your prompts without fine-tuning. This works well for simpler, single documents that fit within a context window — for example, an airline’s customer service plan.Example
The following Markdown file was generated by the AI-Ready Data Engine from a PDF source. Its clearly structured sections make it a good candidate for in-context learning.When to use in-context learning
- Best for simple, single documents that fit within a context window (typically 4k–16k tokens)
- Reduces initial setup — no model training required
- Becomes impractical for large document sets or policies with complex conditional logic
- When you need to scale beyond a single document or require consistent precision, consider RAG or fine-tuning instead