Generating Fine-Tuning Data with the AI-Ready Data Engine

Seekr's AI-Ready Data Engine creates complete, reliable AI-ready datasets from your data for use with a range of fine-tuning techniques.

Automatic Data Generation with the AI-Ready Data Engine

The AI-Ready Data Engine is a multi-stage, agentic system that autonomously transforms diverse data formats into high-quality, AI-ready datasets that integrate seamlessly with AI applications—delivering superior results faster and at dramatically lower costs than traditional data preparation methods.

How does the AI-Ready Data Engine Work?

Our engine processes and integrates diverse data types—including files, databases, web content, audio, and video—to build a comprehensive knowledge base from both structured and unstructured sources.

Benefits of Using the Data Engine

High-Quality Automatic Data Creation for Fine-Tuning: Eliminates manual data preparation, saving time and resources while providing a premium-quality dataset that can be used for a range of fine-tuning techniques.

Robust against Preference Leakage: Designed with known base model contamination issues in mind.

Agentic Routing and Tool Use: Intelligently routes to appropriate models, and uses tools such as web APIs and code interpreters for data enhancement.

Beyond Data Generation and Augmentation

Our engine can create other integrated components for end-to-end AI applications, including:

  • Tools for model use
  • Guardrails applied at inference
  • Domain-specific reward functions for reinforcement learning in reasoning models

Upload Training Files

Use SeekrFlow’s API to upload key documents and generate a fine-tuning dataset: Upload Training File Documents can be in the following formats:

  • PDF
  • DOCX
  • Markdown
  • JSON

The dataset can then be used to initiate a fine-tuning job. The output will transformed into a Parquet file, a format consumable by SeekrFlow's fine-tuning service. If fine-tuning is used in combination with user-provided training data, the resulting dataset from fine-tuning and the training data must be combined prior to using the data as part of the fine-tuning service. The following example code shows you how to upload either a single file, or multiple files of varying types:

Upload a single file

import requests

# Define API Endpoint
url = "https://flow.seekr.com/v1/flow/files"

# Set Headers with API Key
headers = {
    "Authorization": "YOUR_API_KEY",
    "accept": "application/json"
}

# Open and Upload File
files = {
    "files": ("example.pdf", open("/path/to/your/example.pdf", "rb")),
    "purpose": (None, "alignment")
}

# Make API Request
response = requests.put(url, files=files, headers=headers)

# Print Response
print("Status Code:", response.status_code)
print("Response:", response.json())

Example response

Status Code for example.pdf: 200

Response for example.pdf:
{
  "id": "file_1234567890",
  "object": "file",
  "created_at": "example_timestamp",
  "type": "pdf",
  "purpose": "alignment",
  "filename": "example.pdf",
  "bytes": 102400,
  "created_by": "user"
}

Upload multiple files

import requests
import os 

# Define API Endpoint
url = "https://flow.seekr.com/v1/flow/files"

# Set Headers with API Key
headers = {
    "Authorization": "YOUR_API_KEY",
    "accept": "application/json"
}

# Define file_paths
file_paths = ['file_1.docx', 'file_2.pdf']

# Create a multipart/form-data request to open and upload files
with requests.Session() as session:
    for file_path in file_paths:
        with open(file_path, 'rb') as f:
            response = session.put(
                url,
                files = {'files': (os.path.basename(file_path), f)},
                headers=headers,
                data={'purpose': 'alignment'} 
            )
            print(f"Status Code for {os.path.basename(file_path)}:", response.status_code)
            print(f"Response for {os.path.basename(file_path)}:", response.json())

Example response

Status Code for file_1.pdf: 200
Response for file_1.pdf
{
  'id': 'file-12345', 
  'object': 'file', 
  'created_at': 'example_timestamp', 
  'type': 'pdf', 
  'purpose': 'alignment', 
  'filename': 'file_1.docx', 
  'bytes': 348111, 
  'created_by': 'user'
}

Status Code for file_2.pdf: 200
Response for file_2.pdf:
{
  'id': 'file-67890', 
  'object': 'file', 
  'created_at': 'example_timestamp', 
  'type': 'pdf', 
  'purpose': 'alignment', 
  'filename': 'file_2.pdf', 
  'bytes': 1711963, 
  'created_by': 'user'
}