AI-ready data

AI-ready data generates and transforms datasets, converting raw content into training-ready formats. These jobs produce structured outputs optimized for model fine-tuning and alignment.

How it works

The AI-ready data pipeline transforms uploaded files into structured training datasets:

File selection – Choose source files from storage to use as training material

Dataset type selection – Specify the type of dataset to generate based on fine-tuning method

Generation – The system processes files and creates structured question-and-answer pairs or training examples

Output – Generated datasets are saved in formats compatible with fine-tuning workflows

Integration with fine-tuning

AI-ready datasets feed directly into fine-tuning workflows:

Instruction fine-tuning – Standard instruction datasets train models on task-specific examples
Context-grounded fine-tuning – Context-grounded datasets train models to use retrieval effectively
Dataset quality – Higher quality source content and generation produces better fine-tuned models

The data engine automates the transition from raw files to training-ready datasets, reducing manual dataset preparation effort.

Datasets for fine-tuning

AI-ready data supports multiple dataset formats aligned to fine-tuning methods:

Instruction fine-tuning

Standard instruction datasets consist of traditional question-and-answer pairs aligned to task-specific instructions. Each example demonstrates how the model should respond to particular queries or prompts. Structure:

Input – The question, prompt, or instruction
Output – The expected response or completion

Use cases:

Teaching domain-specific knowledge
Customizing response style and tone
Training task-specific behaviors

These datasets are used with instruction fine-tuning to embed knowledge directly into model parameters.

Context-grounded fine-tuning

Context-grounded datasets consist of question-and-answer pairs that reference source documents. Each example includes the query, relevant context from source files, and the correct response grounded in that context. Structure:

Query – The question or prompt
Context – Relevant excerpts from source documents
Response – Answer derived from the provided context

Use cases:

Training models to use external knowledge bases effectively
Teaching retrieval-aware response generation
Building models that cite sources and stay grounded in provided information

These datasets are used with context-grounded fine-tuning to train models for retrieval-augmented generation workflows.

Generation parameters

Dataset generation can be configured with parameters that control output characteristics:

Number of examples – How many training pairs to generate from source content
Diversity settings – Controls for question variety and coverage across source material
Quality filters – Criteria for ensuring generated examples meet minimum standards

Dataset quality

Generated datasets are optimized for training effectiveness:

Relevance – Questions and answers are derived from actual source content
Consistency – Output format matches fine-tuning requirements
Coverage – Examples span the breadth of source material
Validation – Generated datasets can be reviewed before use in training

Dataset management

AI-ready datasets are managed alongside other data engine outputs:

Status tracking – Monitor generation job progress
Review – Inspect generated examples before fine-tuning
Versioning – Maintain multiple dataset versions from the same sources
Export – Download datasets in standard formats

​How it works

​Integration with fine-tuning

​Datasets for fine-tuning

​Instruction fine-tuning

​Context-grounded fine-tuning

​Generation parameters

​Dataset quality

​Dataset management