AI-ready data

Generate structured training datasets from raw content for model fine-tuning.

Supported on
UI
API
SDK

AI-ready data generates and transforms datasets, converting raw content into training-ready formats. These jobs produce structured outputs optimized for model fine-tuning and alignment.

How it works

The AI-ready data pipeline transforms uploaded files into structured training datasets:

  1. File selection – Choose source files from storage to use as training material
  2. Dataset type selection – Specify the type of dataset to generate based on fine-tuning method
  3. Generation – The system processes files and creates structured question-and-answer pairs or training examples
  4. Output – Generated datasets are saved in formats compatible with fine-tuning workflows

Dataset types

AI-ready data supports multiple dataset formats aligned to fine-tuning methods:

Standard instruction datasets

Standard instruction datasets consist of traditional question-and-answer pairs aligned to task-specific instructions. Each example demonstrates how the model should respond to particular queries or prompts.

Structure:

  • Input – The question, prompt, or instruction
  • Output – The expected response or completion

Use cases:

  • Teaching domain-specific knowledge
  • Customizing response style and tone
  • Training task-specific behaviors

These datasets are used with instruction fine-tuning to embed knowledge directly into model parameters.

Context-grounded datasets

Context-grounded datasets consist of question-and-answer pairs that reference source documents. Each example includes the query, relevant context from source files, and the correct response grounded in that context.

Structure:

  • Query – The question or prompt
  • Context – Relevant excerpts from source documents
  • Response – Answer derived from the provided context

Use cases:

  • Training models to use external knowledge bases effectively
  • Teaching retrieval-aware response generation
  • Building models that cite sources and stay grounded in provided information

These datasets are used with context-grounded fine-tuning to train models for retrieval-augmented generation workflows.

Generation parameters

Dataset generation can be configured with parameters that control output characteristics:

  • Number of examples – How many training pairs to generate from source content
  • Diversity settings – Controls for question variety and coverage across source material
  • Quality filters – Criteria for ensuring generated examples meet minimum standards

Dataset quality

Generated datasets are optimized for training effectiveness:

  • Relevance – Questions and answers are derived from actual source content
  • Consistency – Output format matches fine-tuning requirements
  • Coverage – Examples span the breadth of source material
  • Validation – Generated datasets can be reviewed before use in training

Dataset management

AI-ready datasets are managed alongside other data engine outputs:

  • Status tracking – Monitor generation job progress
  • Review – Inspect generated examples before fine-tuning
  • Versioning – Maintain multiple dataset versions from the same sources
  • Export – Download datasets in standard formats

Integration with fine-tuning

AI-ready datasets feed directly into fine-tuning workflows:

  • Instruction fine-tuning – Standard instruction datasets train models on task-specific examples
  • Context-grounded fine-tuning – Context-grounded datasets train models to use retrieval effectively
  • Dataset quality – Higher quality source content and generation produces better fine-tuned models

The data engine automates the transition from raw files to training-ready datasets, reducing manual dataset preparation effort.