Create Context-Grounded Fine-Tuning Data

Generate structured document chunks that improve how AI retrieves, ranks, and synthesizes information in a Retrieval-Augmented Generation (RAG) pipeline.

This guide walks you through how to prepare and upload files via API to generate structured documents (context data) that improve how AI processes and synthesizes retrieved information in a Retrieval-Augmented Generation (RAG) pipeline.

✏️

SDK support for CoG coming soon!


Step 1: Prepare your file for upload

Your file must be in Markdown

  • JSON, PDF, or DOCX are not accepted. If you have a PDF or DOCX, convert it first (see Step 3 below).

Check length requirements

  • Your file should have at least 4 cohesive sections, each detailed enough to form a meaningful “chunk.”
  • Although exact token counts are not mandatory, each chunk should contain enough substantive text to form meaningful document chunks.

Structure your Markdown file

  • Use headings (#, ##, ### in Markdown) in logical order.
  • Keep each topic thematically consistent per section.
  • Avoid empty headers or bullet points with no explanation.

Step 2: Upload your Markdown file

Endpoint: PUT /v1/flow/files Upload Training File

Use the same file upload endpoint as with standard instruction fine-tuning. The difference is that you'll set "purpose": "context_data" instead of alignment.

import requests

# Define API Endpoint
url = "https://flow.seekr.com/v1/flow/files"

# Set Headers with API Key
headers = {
    "Authorization": "YOUR_API_KEY",
    "accept": "application/json"
}

# Open and Upload File (must be Markdown for Enhanced Retrieval)
files = {
    "files": ("example.md", open("/path/to/your/example.md", "rb")),
    "purpose": (None, "context_data")
}

response = requests.put(url, files=files, headers=headers)

print("Status Code:", response.status_code)
print("Response:", response.json())


 

Sample response:

{
    "id": "file-123456789",
    "object": "file",
    "created_at": "example_timestamp",
    "type": "md",
    "purpose": "content_data",
    "filename": "example.md",
    "bytes": 64105,
    "created_by": "user"
}

Step 3: Start an ingestion job (for PDF and DOCX only)

Endpoint: POST /v1/flow/alignment/ingestion Multi-file Ingestion

If your original source was a PDF or DOCX, you must first convert it to Markdown.
If you already have a Markdown file, skip this step.

url = "https://flow.seekr.com/v1/flow/alignment/ingestion"

payload = {
    "file_ids": ["file-123456789"]  # The file ID returned from upload
}

response = requests.post(url, json=payload, headers=headers)

print("Status Code:", response.status_code)
print("Response:", response.json())


 

Sample response:

{
    "id": "ij-123456789",
    "created_at": "example_timestamp",
    "status": "running",
    "output_files": []
}

Once ingestion is complete, you’ll receive a Markdown file that you can use for context-grounded fine-tuning.


Step 4: Check ingestion status (optional)

Endpoint: GET /v1/flow/alignment/ingestion/{job_id} Get Detailed Job Status

After starting ingestion, you may want to check the status to confirm it has been completed successfully.

import requests

job_id = "ij-123456789"
url = "https://flow.seekr.com/v1/flow/alignment/ingestion/{job_id}"

response = requests.get(url, headers=headers)
print("Status Code:", response.status_code)
print("Response:", response.json())

Sample response:

{
  "id": "ij-123456789",
  "created_at": "example_timestamp",
  "status": "completed",
  "output_files": [
    "file-123456789"
  ]
}

Check the status field. Once it shows completed, your output file(s) will be available for the next step.


Step 5: Generate context data

Endpoint: POST /v1/flow/alignment/generate Generate IFT pairs

Now that you have a properly formatted Markdown file, you can start the data generation process to create document chunks.

Collect your Markdown file ID
Use the ID of the final Markdown file from either Step 2 of Data Preparation or the ingestion output in Step 3.

After collecting your file ID, send the request:

import requests

url = "https://flow.seekr.com/v1/flow/alignment/generate"

payload = {
    "files": [
        "file-123456789"
    ],
    "instructions": "I want to train a chatbot to answer questions about the US interstate system.",
    "type": "context_data"
}

response = requests.post(url, json=payload, headers=headers)

print("Status Code:", response.status_code)
print("Response:", response.json())

Sample response:

{
    "id": "aj-123456789",
    "created_at": "example_timestamp",
    "status": "queued"
}

Step 6: Check job status (optional)

Endpoint: GET /v1/flow/alignment/{job_id} Fetch Job Details

While the alignment job is running, you can poll its status to see when it is complete.

import requests

job_id = "aj-123456789"
url = "https://flow.seekr.com/v1/flow/alignment/{job_id}"

response = requests.get(url, headers=headers)
print("Status Code:", response.status_code)
print("Response:", response.json())

Sample response:

{
  "id": "aj-123456789",
  "created_at": "example_timestamp",
  "status": "completed"
}

Once the status shows completed, the Parquet file will appear in your files list (e.g., filename": "file-d7166390-962f-4d22-93a2-265d93c114e6-raft-qa-pairs.parquet).