Standard Fine-Tuning Data

Standard supervised fine-tuning with QA pairs delivers higher-quality results by systematically training on curated question-answer examples that allow the model to develop more accurate responses within specific domains or tasks.

This guide walks you through how to prepare and upload files to generate a standard fine-tuning dataset that you can use to adapt a generalist model to a task of your choosing.

How it Works

Fine-tuning simply means adapting a pre-trained, generalist model to be more effective at a given task; for example, developing a training resource for CBP officers that has deep knowledge of the operational strategies and organizational structures of international drug cartels. Generalist models are powerful but lack access to highly sensitive information (for good reason). They're also not directly optimized for factual accuracy. However, they can be fine-tuned, or taught to specialize in a certain task or domain, while retaining most of their original capabilities.

Preparing and creating these datasets manually is time-consuming and labor-intensive. Another crucial aspect of building specialized use cases is making sure the model adheres to your organization's specific guidelines, which can include regulatory compliance (e.g., HIPAA, GDPR), company policies, or ethical standards. SeekrFlow automates all of these key steps to create trustworthy, accurate data for fine-tuning, including:

  1. File Formatting and Uploading
  2. QA Pair Generation
  3. Parquet QA Pair file for Fine-Tuning

What are QA Pairs?

A Parquet QA Pair File is a structured dataset containing question-and-answer (Q&A) pairs, stored in .parquet format for optimized data storage.

The goal of QA pair generation is to help update model parameters through a systematic training process, by providing structured examples that take the form of a sample input and a desired output, like this example from a QA pair file generated from American Airlines policy data:

{
  "finetuning_data_element_id": qa_1234,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful chatbot trained to assist American Airlines customers with questions about airline policies."
    },
    {
      "role": "user",
      "content": "What is your baggage policy?"
    },
    {
      "role": "assistant",
      "content": "For checked bags, American Airlines allows one free bag for most passengers, with size and weight limits, and charges fees for additional bags and overweight/oversized items. For carry-ons, you can bring one carry-on and one personal item, with size restrictions."
    }

Each pair is run through the model and serves as a training signal to adjust model weights. This structured, lightweight approach leads to better model understanding - and better results.

Combined with the system prompt, the model can now respond in detail:

{
 "user":"What is your baggage policy?",
 "response": "Our baggage policy allows for a single bag or personal item to be brought on board for free. You can also purchase additional baggage allowances in advance for $4-$20 per bag, depending on the airport and the time of year.However, it's worth noting that some airports have specific restrictions on the number of bags that can be brought on board, and there may be additional fees for excess bags. It's always a good idea to check the airline's policies and the restrictions at the airport you're flying out of to avoid any issues. Additionally, if you're flying with a partner airline, you should contact them directly for more information on their baggage policies."
}

Compare the output against a sample response from a base model to see the value of this approach:

{
 "user": "What is your baggage policy?",
 "response":"As a responsible and ethical AI language model, I don't have a baggage policy as I am not a physical entity that can carry luggage or cargo. Additionally, I don't have the ability to interact with the physical world or transport objects. My purpose is to assist and provide helpful responses to users' questions and tasks, so please feel free to ask me anything!"
}


When to Use Fine-tuning

While fine-tuning your model on SeekrFlow is quick and painless compared to traditional methods, we still recommend experimenting with fast-iteration methods like prompting, prompt chaining, and function calling first. These methods are fast, can yield good results on their own, or be used in combination with fine-tuning later.

  • Prototype quickly with prompts in our Sandbox.
    • (If you haven't signed up yet, see Quickstart for instructions)
  • Build prompt chains and more complex logic on SeekrFlow using our LangChain integration package.
  • Create function calls to interface with your own code and external sources.

Once you've gotten as far as you can with prompting and are ready to fine-tune, save your prompts for reuse as fine-tuning goals, or as fine-tuning data. Good prompts and (corrected) failed prompts can both become useful data for fine-tuning a model to a complex, nuanced task. Other ways fine-tuning can improve your results include:

  • Improving output reliability
  • Setting the style, tone, or format
  • Performing a new task that’s more easily demonstrated than described
  • Reducing cost and latency over a more expensive model
  • Edge case handling

When you're ready to begin fine-tuning, you'll need to gather source documents for upload to SeekrFlow. Aim to collect documents that are relevant to the domain, and that are structured in a clear and logical way, with headings and subheadings for readability.

Step 1: Preparing Your File for Upload

Supported File Types

  • Markdown (.md)
  • PDF (.pdf)
  • Word Documents (.docx)
  • JSON (.json)

Length Requirements

  • Your file should have at least 4 cohesive sections, each detailed enough to form a meaningful “chunk.”
  • Although exact token counts are not mandatory, each chunk should contain enough substantive text to form meaningful document chunks.

File Formatting Guidelines

Before uploading, make sure your files are properly formatted by following these steps:

PDF and DOCXMarkdownJSON
1. Use clear headings and avoid images without context. 2. Ensure text content is structured logically for conversion.1. Use correct header hierarchy (#, ##, ###, etc.). 2. Avoid missing or empty headers and skipped levels. 3. Limit headers to six levels (######). 4. Ensure all sections have meaningful content under them.1. Validate the JSON structure using a tool like JSONLint. 2. Ensure all key-value pairs are well-formed and relevant.

Note: Google Docs users can export files as Markdown, PDF, or DOCX.

Once your files look good, proceed to the next step.


Step 2: Upload Training Files

Endpoint: PUT /v1/flow/files Upload Training File

import requests

# Define API Endpoint
url = "https://flow.seekr.com/v1/flow/files"

# Set Headers with API Key
headers = {
    "Authorization": "YOUR_API_KEY",
    "accept": "application/json"
}

# Open and Upload File 
files = {
    "files": ("example.md", open("/path/to/your/example.md", "rb")),
    "purpose": "alignment"
}

response = requests.put(url, files=files, headers=headers)

print("Status Code:", response.status_code)
print("Response:", response.json())


 

Sample Response:

{
    "id": "file-123456789",
    "object": "file",
    "created_at": "example_timestamp",
    "type": "md",
    "purpose": "alignment",
    "filename": "example.md",
    "bytes": 64105,
    "created_by": "user"
}



Step 3: Starting an Ingestion Job (for PDFs and DOCX)

Endpoint: POST /v1/flow/alignment/ingestion Multi-file Ingestion

If your original source was a PDF or DOCX, you must first convert it to Markdown.

If you already have a Markdown file, skip this step.

If your file has tables embedded, make sure there are no empty cells. Empty cells run the risk of shifting all of your content in your table over to fill in the empty cell.

url = "https://flow.seekr.com/v1/flow/alignment/ingestion"

payload = {
    "file_ids": ["file-123456789"]  # The file ID returned from upload
}

response = requests.post(url, json=payload, headers=headers)

print("Status Code:", response.status_code)
print("Response:", response.json())


 

Sample Response:

{
    "id": "ij-123456789",
    "created_at": "example_timestamp",
    "status": "running",
    "output_files": []
}

Once ingestion is complete, you’ll receive a Markdown file that you can use for fine-tuning.


Step 4: Checking Ingestion Status (Optional)

Endpoint: GET /v1/flow/alignment/ingestion/{job_id} Get Detailed Job Status

After you initiate ingestion, you may want to check the status to confirm it has been completed successfully.

import requests

job_id = "ij-123456789"
url = "https://flow.seekr.com/v1/flow/alignment/ingestion/{job_id}"

response = requests.get(url, headers=headers)
print("Status Code:", response.status_code)
print("Response:", response.json())

Sample Response:

{
  "id": "ij-123456789",
  "created_at": "example_timestamp",
  "status": "completed",
  "output_files": [
    "file-123456789"
  ]
}

Check the status field. Once it shows completed, your output file(s) will be available for the next step.


Step 5: Generating Fine-Tuning Data

Endpoint: POST /v1/flow/alignment/generate Generate IFT pairs

Now that you have a properly formatted file, you can start the data generation process to create document chunks.

Collect Your Markdown File ID
Use the ID of the final Markdown file from either Step 2 or the ingestion output in Step 3.

import requests

url = "https://flow.seekr.com/v1/flow/alignment/generate"

payload = {
    "files": [
        "file-123456789"
    ],
    "instructions": "I want to train a chatbot to answer questions about international drug cartels.",
    "purpose": "alignment"
}

response = requests.post(url, json=payload, headers=headers)

print("Status Code:", response.status_code)
print("Response:", response.json())

Sample Response:

{
    "id": "aj-123456789",
    "created_at": "example_timestamp",
    "status": "queued"
}

Step 6: Checking Job Status (Optional)

Endpoint: GET /v1/flow/alignment/{job_id} Fetch Job Details

While the alignment job is running, you can poll its status to see when it is complete.

import requests

job_id = "aj-123456789"
url = "https://flow.seekr.com/v1/flow/alignment/{job_id}"

response = requests.get(url, headers=headers)
print("Status Code:", response.status_code)
print("Response:", response.json())

Sample Response:

{
  "id": "aj-123456789",
  "created_at": "example_timestamp",
  "status": "completed"
}

Once the status shows completed, the Parquet file will appear in your files list (e.g., filename": "file-d7166390-962f-4d22-93a2-265d93c114e6-raft-qa-pairs.parquet).


The stacked approach: Fine-Tuning + Context-Grounded Fine-Tuning

Currently, SeekrFlow offers Supervised Fine-Tuning (SFT) as the default method. Performing SFT before running a context-grounded fine-tuning job can significantly boost model performance. The first fine-tuning run can identify more correct patterns, setting up the subsequent context-grounded fine-tuning job for success.

Recommended Workflow:

  1. Fine-tune the base model using the QA pair file generated from your source documents. Focus on data quality and task relevance.
  2. With the fine-tuned model as the starting point, generate context-grounded data and run a second job using context-grounded data to adjust the model for best results.