Context-Grounded Fine-Tuning Data

Generate structured document chunks that improve how AI retrieves, ranks, and synthesizes information in a Retrieval-Augmented Generation (RAG) pipeline.

This guide walks you through how to prepare and upload files to generate structured documents (context data) that improve how AI processes and synthesizes retrieved information in a Retrieval-Augmented Generation (RAG) pipeline.

How it Works

Enhanced Retrieval is distinct from principle alignment, which focuses on steering a model to adhere to specific domain policy documents or guidelines. Our new method improves a model’s ability to reason by teaching it which documents from the RAG database it can safely ignore, and having it generate a chain-of-thought style response to explain the answer it chose. The resulting context data are used to fine-tune the model.

What is Context Data?

Context data differs from traditional fine-tuning data, which is designed to adapt a model to specific domain or task knowledge. Instead of simply refining what a model knows, context data improves how it processes retrieved information, helping it distinguish between relevant and irrelevant documents in a RAG database.

Each piece of data is organized as follows:

  • A question
  • A set of reference documents including relevant (“oracle”) and irrelevant (“distractor”) documents to answer the question. Distractors are randomly-selected pieces of text generated from your source document and used to test the strength of the model’s reasoning. This encourages the model to rely on its memory when answering, and makes it more robust against a range of real-world scenarios.
  • A Chain-of-Thought (CoT) explanation – with citations – generated from an instruction prompt. This is crucial to help the model understand how the question, the relevant documents, and the right answer fit together.

When this data is used for fine-tuning, the model itself is updated at the weight level to internalize this capability—allowing it to incorporate relevant information from retrieved documents in a way the base model could not.

Why it Matters

This method achieves better results than traditional fine-tuning while using fewer resources, and boosts accuracy and usability of RAG applications with contextualized outputs.

In-Domain Performance Improvement: Contrasted with traditional fine-tuning, this method is more strategic in its modification of the base model used in a RAG pipeline. Instead of model-wide parameter updates, it focuses on where and how to update for domain-specific tasks, guided by retrieved examples. Tasks involving specialized vocabulary or domain-specific reasoning patterns (e.g., technical documentation search and solution finding) see significant improvement with this kind of strategy.

Edge Case Handling: This method provides more robust edge case handling relevant to specific domains (e.g., industry-specific EDD reporting).

When to Use it

Use enhanced retrieval when your goal is to maximize the quality and relevance of retrieved information from a larger corpus, especially in a vector database. You might store these documents in a vector DB for quick, chunk-level retrieval, so ensure they are coherent and thematically consistent and include the domain knowledge your LLM must reference.

Best suited for: Knowledge bases, technical manuals, research articles, product documentation, or any domain content you want your LLM to retrieve with precision.


Step 1: Preparing Your File for Upload

Ensure Your File is in Markdown

  • Enhanced Retrieval does not accept JSON, PDF, or DOCX. If you have a PDF or DOCX, convert it first (see Step 3 below).

Check Length Requirements

  • Your file should have at least 4 cohesive sections, each detailed enough to form a meaningful “chunk.”
  • Although exact token counts are not mandatory, each chunk should contain enough substantive text to form meaningful document chunks.

Structure Your Markdown File

  • Use headings (#, ##, ### in Markdown) in logical order.
  • Keep each topic thematically consistent per section.
  • Avoid empty headers or bullet points with no explanation.

Step 2: Uploading Your Markdown File

Endpoint: PUT /v1/flow/files Upload Training File

Use the same file upload endpoint as with principle alignment. The difference is you will set "purpose": "context_data" to indicate that this file is meant for alignment.

import requests # Define API Endpoint url = "https://flow.seekr.com/v1/flow/files" # Set Headers with API Key headers = { "Authorization": "YOUR_API_KEY", "accept": "application/json" } # Open and Upload File (must be Markdown for Enhanced Retrieval) files = { "files": ("example.md", open("/path/to/your/example.md", "rb")), "purpose": (None, "context_data") } response = requests.put(url, files=files, headers=headers) print("Status Code:", response.status_code) print("Response:", response.json())

Sample Response:

{ "id": "file-123456789", "object": "file", "created_at": "example_timestamp", "type": "md", "purpose": "content_data", "filename": "example.md", "bytes": 64105, "created_by": "user" }

Step 3: Starting an Ingestion Job (If Needed)

Endpoint: POST /v1/flow/alignment/ingestion Multi-file Ingestion

If your original source was a PDF or DOCX, you must first convert it to Markdown.
If you already have a Markdown file, skip this step.

url = "https://flow.seekr.com/v1/flow/alignment/ingestion" payload = { "file_ids": ["file-123456789"] # The file ID returned from upload } response = requests.post(url, json=payload, headers=headers) print("Status Code:", response.status_code) print("Response:", response.json())

Sample Response:

{ "id": "ij-123456789", "created_at": "example_timestamp", "status": "running", "output_files": [] }

Once ingestion is complete, you’ll receive a Markdown file that you can use for Enhanced Retrieval.


Step 4: Checking Ingestion Status (Optional)

Endpoint: GET /v1/flow/alignment/ingestion/{job_id} Get Detailed Job Status

After you initiate ingestion, you may want to check the status to confirm it has been completed successfully.

import requests job_id = "ij-123456789" url = "https://flow.seekr.com/v1/flow/alignment/ingestion/{job_id}" response = requests.get(url, headers=headers) print("Status Code:", response.status_code) print("Response:", response.json())

Sample Response:

{ "id": "ij-123456789", "created_at": "example_timestamp", "status": "completed", "output_files": [ "file-123456789" ] }

Check the status field. Once it shows completed, your output file(s) will be available for the next step.


Step 5: Generating Context Data

Endpoint: POST /v1/flow/alignment/generate Generate IFT pairs

Now that you have a properly formatted Markdown file, you can start the data generation process to create document chunks.

Collect Your Markdown File ID
Use the ID of the final Markdown file from either Step 2 or the ingestion output in Step 3.

Send the Request

import requests url = "https://flow.seekr.com/v1/flow/alignment/generate" payload = { "files": [ "file-123456789" ], "instructions": "I want to train a chatbot to answer questions about the US interstate system.", "type": "context_data" } response = requests.post(url, json=payload, headers=headers) print("Status Code:", response.status_code) print("Response:", response.json())

Sample Response:

{ "id": "aj-123456789", "created_at": "example_timestamp", "status": "queued" }

Step 6: Checking Job Status (Optional)

Endpoint: GET /v1/flow/alignment/{job_id} Fetch Job Details

While the alignment job is running, you can poll its status to see when it is complete.

import requests job_id = "aj-123456789" url = "https://flow.seekr.com/v1/flow/alignment/{job_id}" response = requests.get(url, headers=headers) print("Status Code:", response.status_code) print("Response:", response.json())

Sample Response:

{ "id": "aj-123456789", "created_at": "example_timestamp", "status": "completed" }

Once the status shows completed, the Parquet file will appear in your files list (e.g., filename": "file-d7166390-962f-4d22-93a2-265d93c114e6-raft-qa-pairs.parquet).


Did this page help you?