How it works
The AI-ready data pipeline transforms uploaded files into structured training datasets:Generation – The system processes files and creates structured question-and-answer pairs or training examples
Integration with fine-tuning
AI-ready datasets feed directly into fine-tuning workflows:- Instruction fine-tuning – Standard instruction datasets train models on task-specific examples
- Context-grounded fine-tuning – Context-grounded datasets train models to use retrieval effectively
- Dataset quality – Higher quality source content and generation produces better fine-tuned models
Datasets for fine-tuning
AI-ready data supports multiple dataset formats aligned to fine-tuning methods:Instruction fine-tuning
Standard instruction datasets consist of traditional question-and-answer pairs aligned to task-specific instructions. Each example demonstrates how the model should respond to particular queries or prompts. Structure:- Input – The question, prompt, or instruction
- Output – The expected response or completion
- Teaching domain-specific knowledge
- Customizing response style and tone
- Training task-specific behaviors
Context-grounded fine-tuning
Context-grounded datasets consist of question-and-answer pairs that reference source documents. Each example includes the query, relevant context from source files, and the correct response grounded in that context. Structure:- Query – The question or prompt
- Context – Relevant excerpts from source documents
- Response – Answer derived from the provided context
- Training models to use external knowledge bases effectively
- Teaching retrieval-aware response generation
- Building models that cite sources and stay grounded in provided information
Generation parameters
Dataset generation can be configured with parameters that control output characteristics:- Number of examples – How many training pairs to generate from source content
- Diversity settings – Controls for question variety and coverage across source material
- Quality filters – Criteria for ensuring generated examples meet minimum standards
Dataset quality
Generated datasets are optimized for training effectiveness:- Relevance – Questions and answers are derived from actual source content
- Consistency – Output format matches fine-tuning requirements
- Coverage – Examples span the breadth of source material
- Validation – Generated datasets can be reviewed before use in training
Dataset management
AI-ready datasets are managed alongside other data engine outputs:- Status tracking – Monitor generation job progress
- Review – Inspect generated examples before fine-tuning
- Versioning – Maintain multiple dataset versions from the same sources
- Export – Download datasets in standard formats