Data engine

The data engine transforms raw content into structured, AI-ready data. It manages the complete data lifecycle from file ingestion through preparation for training and retrieval workflows.

Fine-tuning data workflows are built around data jobs — managed operations that bundle file ingestion, Markdown review, prompt configuration, and alignment generation into a single tracked unit.

For a conceptual overview of the data engine and its capabilities, see Data engine.

Data engine workflows

Prepare and ingest files

Upload source documents (PDF, DOCX, PPT, Markdown) and convert them to Markdown via the ingestion API.

Monitor ingestion

Track ingestion progress through data job status, per-file records, and timeline events.

Create and populate a vector database

Set up a vector database and run document ingestion to generate embeddings for semantic search and retrieval.

Create instruction fine-tuning data

Use a principle_files data job to generate a QA pair dataset for instruction fine-tuning.

Create context-grounded fine-tuning data

Use context_grounded_files or context_grounded_vector_db data jobs to generate training data grounded in an existing knowledge source.

Manage data jobs

List, filter, update metadata, and cancel data jobs.