> ## Documentation Index
> Fetch the complete documentation index at: https://docs.seekr.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Prepare and ingest files

> Upload multiple files simultaneously in their original formats and the AI-Ready Data Engine will automatically extract relevant data and structure them into a dataset that can be used for fine-tuning.

Begin the data preparation process by gathering source documents for upload to the Data Engine. Aim to collect documents that are relevant to the domain, and that are structured in a clear and logical way, with headings and subheadings for readability.

## Step 1: Prepare your file for upload

### Supported file types

* Markdown (`.md`)
* PDF (`.pdf`)
* Word Documents (`.docx`, `.doc`)
* PowerPoint (`.ppt`, `.pptx`)

### Length requirements

* Your file should have at least 4 cohesive sections, each detailed enough to form a meaningful “chunk.”
* Although exact token counts are not mandatory, each chunk should contain enough substantive text to form meaningful document chunks.

### File formatting guidelines

Before uploading, make sure your files are properly formatted by following these steps:

**PDF and DOCX**:

1. Use clear headings and avoid images without context.
2. Ensure text content is structured logically for conversion.

<CardGroup>
  <Card title="Example PDF" icon="file-pdf" href="https://drive.google.com/file/d/1wwwh2YeZyD24he-GbS4PvMDv5ZnFe2qQ/view?usp=sharing" horizontal />
</CardGroup>

**Markdown**:

1. Use correct header hierarchy (`#H1, ##H2, ###H3,` etc.).
2. Limit headers to six levels (`######`).
3. There's a clear, logical flow of information throughout.
4. Avoid missing or empty headers and skipped levels.
5. Ensure all sections have meaningful content under them.

<Info>
  Google Docs users can export files as Markdown, PDF, or DOCX.
</Info>

Once your files are in order, proceed to the next step.

## Step 2: Upload files

Use SeekrFlow’s Files API to upload source documents. Documents can be in the following formats:

* PDF
* DOCX
* DOC
* Markdown
* PowerPoint (.ppt, .pptx)

The following examples show how to upload a single file or multiple files of varying types:

### Upload a single file

Upload a single file, up to 4GB.

**Endpoint:** `PUT /v1/flow/files`[Upload Training File](/flow/reference/file_upload_v1_flow_files_put)

<CodeGroup>
  ```python Python theme={null}
  from seekrai import SeekrFlow
  client = SeekrFlow(api_key="your-api-key") #Note: If your API key is stored as an environment variable, you can leave the parentheses empty, as shown below.

  # Upload a file
  upload_resp = client.files.upload("example.pdf", purpose="alignment")
  print(upload_resp.id)
  ```
</CodeGroup>

#### Sample response

<CodeGroup>
  ```curl cURL theme={null}
  Uploading file example.pdf: 100%|█| 21.6M/21.6M [00:21<00:00, 1.02MB
  file-25e34f96-2130-11f0-9236-3e11346bffff
  ```
</CodeGroup>

### Upload multiple files

Upload multiple files up to 4GB each. The endpoint accepts an array of files as input.

<Info>
  File size limits differ between the API/SDK and [UI](https://apps.seekr.com/flow), which supports file uploads up to 10 files/100mb each.
</Info>

The following example uploads two files, one PDF and one DOCX, simultaneously:

**Endpoint:** `PUT /v1/flow/bulk_files` [Bulk upload files](/flow/reference/bulk_file_upload_v1_flow_bulk_files_put)

<CodeGroup>
  ```python Python theme={null}
  from seekrai import SeekrFlow
  client = SeekrFlow()

  bulk_resp = client.files.bulk_upload(
      ["example1.pdf", "example2.pdf"], purpose="alignment"
  )

  # Access and print the ID of each uploaded file
  for resp in bulk_resp:
      print(resp.id)
  ```
</CodeGroup>

### List and delete files

Keep track of uploaded files and remove duplicates or erroneous uploads as needed.

List all uploaded files:

**Endpoint:** `GET v1/flow/files` [List files](/flow/reference/list_files_v1_flow_files_get)

<CodeGroup>
  ```python Python theme={null}
  # List all files
  files_response = client.files.list()

  for file in files_response.data:
      print(f"ID: {file.id}, Filename: {file.filename}")
  ```
</CodeGroup>

Delete a file from the system:

**Endpoint:** `DELETE v1/flow/files/{file_id}` [Delete file](/flow/reference/delete_file_v1_flow_files__file_id__delete)

<CodeGroup>
  ```python Python theme={null}
  # Delete a file
  client.files.delete(file_id)
  print(f"Successfully deleted file {file_id}")
  ```
</CodeGroup>

## Step 3: Start an ingestion job (for PDFs, DOCX, DOC, and PPT/PPTX only)

**Endpoint:** `POST /v1/flow/alignment/ingestion` [Multi-file Ingestion](/flow/reference/ingest_files_v1_flow_alignment_ingestion_post)

<Info>
  When using data jobs, ingestion is triggered automatically when you attach files via `POST /v1/flow/data-jobs/{id}/add-files` — you do not need to call this endpoint directly. See [Create instruction fine-tuning data](/flow/sdk/data-engine/standard-instruction-finetuning) for the complete data job workflow.
</Info>

If your original source was a PDF, DOCX, DOC, PPT, or PPTX, you must first convert it to Markdown.

If you already have a Markdown file, skip this step.

<Warning>
  Do not ingest Markdown files. This will return an error.
</Warning>

If your file has tables embedded, make sure there are no empty cells. Empty cells run the risk of shifting all of your content in your table over to fill in the empty cell.

<CodeGroup>
  ```python Python theme={null}
  from seekrai import SeekrFlow

  client = SeekrFlow(api_key="your api key")
  ingestion = client.ingestion

  # Start ingestion on one or more uploaded file IDs
  response = ingestion.ingest(
    files=["file-25e34f96-2130-11f0-9236-3e11346bffff", "file-efd0b334-2130-11f0-9236-3e11346bffff"],
  	method="accuracy-optimized"
  )
  print("Ingestion Job ID:", response.id)
  ```
</CodeGroup>

Once ingestion is complete, you’ll receive a Markdown file that you can use for fine-tuning.

<Info>
  **Ingestion mode and the UI**

  When ingesting files through the SeekrFlow UI, speed-optimized mode is always used. The SDK lets you choose between speed-optimized and accuracy-optimized.
</Info>

### Choose an ingestion method

#### Accuracy-optimized (default)

When you use `method="accuracy-optimized"` or omit the method parameter, the system prioritizes accuracy. Depending on what data is available in your PDF document (bookmarks, tables, text layers), the system combines multiple extraction techniques for best results.

**Key features:**

* Uses both OCR and direct text extraction, then blends them together
* Employs LLM agents to correct and enhance document hierarchy
* Applies advanced table detection algorithms for accurate table formatting

<Info>
  Documents over 100 pages can take up to 30 minutes to process.
</Info>

#### Speed-optimized

When you use `method="speed-optimized"`, the system balances quality with processing speed. It automatically selects faster methods based on document size while maintaining reasonable accuracy for smaller documents.

**Key features:**

* Small documents still use high-accuracy methods
* Larger documents use speed optimized algorithms to meet time constraints

<Info>
  Optimized to complete in approximately 3 minutes regardless of document size.
</Info>

Once ingestion is complete, you'll receive a Markdown file that you can use for fine-tuning.

## Step 4: Check ingestion status

After starting ingestion, you can track job progress, view per-file statuses, and diagnose any failures. See [Monitor ingestion](/flow/sdk/data-engine/monitor-ingestion) for details on checking job states, interpreting `file_records`, and resolving errors.

Once `status` shows `completed`, your output file(s) will be available for the next step.

## Step 5: Review and edit ingested file

Before moving on to the next step, download the Markdown files created during ingestion to make sure all the information was transferred successfully. This is your opportunity to make any necessary edits in a Markdown editor, then re-upload your file for best results.

<CodeGroup>
  ```python Python theme={null}
  from seekrai import SeekrFlow
  client = SeekrFlow()

  retrieve_resp = client.files.retrieve_content("file-dc6f19d1-b55a-43a0-a38a-56f0e7bd8d8d")

  file_id = retrieve_resp.id
  print(f"Retrieve file with ID: {file_id}")
  ```
</CodeGroup>

#### Markdown editing guidelines

Check to make sure standard Markdown formatting conventions are being followed:

* Clear hierarchy with appropriate heading levels (`#H1, ##H2, ###H3,` etc.)
* Logical flow of information from beginning to end
* Consistent formatting throughout the document
* Clear separation between sections using headings and whitespace
* Meaningful content below each header

In addition, do a sweep for incorrect text or characters that may have been picked up during ingestion. For example:

<CodeGroup>
  ```text Markdown theme={null}
  # California State Univ University, Monter , Monterey Bay [Digital Commons @ CSUMB](https://digitalcommons.csumb.edu/)

  ## Movie Music: Film Soundtracks or Film Soundtracks Thr acks Throughout Hist oughout History
  ```
</CodeGroup>

This file has picked up some erroneous text during ingestion. It's best to remove these to ensure the highest-quality QA pairs for fine-tuning (though a few minor additions like `/`or extra line breaks likely won't affect quality).

Once you're satisfied, save your file with a `.md` extension and upload it:

<CodeGroup>
  ```python Python theme={null}
  upload_resp = client.files.upload("edited_example.md", purpose="alignment")

  file_id = upload_resp.id
  print(f"Uploaded file with ID: {file_id}")
  ```
</CodeGroup>

<CodeGroup>
  ```curl cURL theme={null}
  Uploading file edited_example.md: 100%|███████████████████████████████████| 41.8k/41.8k [00:01<00:00, 35.3kB/s]
  Uploaded file with ID: file-9b9bf862-26cd-11f0-958d-52c2a8425a49
  ```
</CodeGroup>

Keep the new file ID for the next step.

## Step 6: In-context learning with Markdown (optional)

Once you’ve reviewed and edited your ingested Markdown (Step 5), you can use it directly in your prompts without fine-tuning. This works well for simpler, single documents that fit within a context window — for example, an airline’s customer service plan.

### Example

The following Markdown file was generated by the AI-Ready Data Engine from a PDF source. Its clearly structured sections make it a good candidate for in-context learning.

<CodeGroup>
  ```text Markdown expandable theme={null}
  # American Airlines Customer Service Plan

  ## Our Commitment
  American Airlines and American Eagle are in business to provide safe, dependable and friendly air transportation.

  ## Accommodations for Unaccompanied Minors
  Children 5–14 years old may travel under our unaccompanied minor (UMNR) service on nonstop or same-plane flights.

  - Children 8–14 may travel on connecting flights via select airports (CLT, DFW, LAX, etc.).
  - Children 15–17 may optionally use UMNR service.
  - UMNR service is not available for codeshare or partner flights.

  ## Customers with Disabilities
  American maintains Special Assistance Coordinators (SACs) to arrange accommodations such as:

  - Pre-reserved seating
  - Boarding assistance
  - Wheelchair support
  - In-cabin storage of assistive devices

  ## Flight Delays and Cancellations
  For delays or cancellations:

  - Rebooking on the next available flight is provided at no cost.
  - Hotel and meal vouchers are issued if delays are caused by American and result in overnight stays.
  - In cases of diversions, we provide transportation and accommodations as needed.

  ## AAdvantage® Program
  Members earn miles for flights, credit card use, hotel stays, car rentals, and more. Miles can be redeemed for travel, upgrades, or donated.
  ```
</CodeGroup>

Pass the file contents as context in your prompt:

<CodeGroup>
  ```text Markdown theme={null}
  You are a helpful customer support agent. Use only the following context to answer the question.

  --- Context (american_airlines_customer_service.md) ---
  <contents of american_airlines_customer_service.md>

  Question: "What assistance does American Airlines provide if a flight is diverted to another city?"
  ```
</CodeGroup>

The model uses the Markdown structure to locate the relevant section and return an answer without any model retraining.

### When to use in-context learning

* Best for simple, single documents that fit within a context window (typically 4k–16k tokens)
* Reduces initial setup — no model training required
* Becomes impractical for large document sets or policies with complex conditional logic
* When you need to scale beyond a single document or require consistent precision, consider RAG or fine-tuning instead
