Data engine
SDK workflows for uploading files, running ingestion, and creating data jobs for fine-tuning and semantic search.
The data engine transforms raw content into structured, AI-ready data. It manages the complete data lifecycle from file ingestion through preparation for training and retrieval workflows.
Fine-tuning data workflows are built around data jobs — managed operations that bundle file ingestion, Markdown review, prompt configuration, and alignment generation into a single tracked unit.
For a conceptual overview of the data engine and its capabilities, see Data engine.
Data engine workflows
Prepare and ingest files
Upload source documents (PDF, DOCX, PPT, Markdown) and convert them to Markdown via the ingestion API.
Monitor ingestion
Track ingestion progress through data job status, per-file records, and timeline events.
Create and populate a vector database
Set up a vector database and run document ingestion to generate embeddings for semantic search and retrieval.
Create instruction fine-tuning data
Use a principle_files data job to generate a QA pair dataset for instruction fine-tuning.
Create context-grounded fine-tuning data
Use context_grounded_files or context_grounded_vector_db data jobs to generate training data grounded in an existing knowledge source.
Manage data jobs
List, filter, update metadata, and cancel data jobs.
Updated about 1 month ago
