Create and populate a vector database

This guide covers setting up the vector database, processing your documents, creating embeddings, and monitoring ingestion jobs. A complete example is available following the step-by-step guide, along with guidance for managing your vector databases.

Step 1: Set up a vector database

Seekr’s Vector Database SDK provides advanced semantic search capabilities by transforming text into vector embeddings, making it possible to perform semantic searches that focus on meaning and context. This approach provides a smarter and more intuitive way to retrieve documents compared to traditional keyword-based methods. Start by creating a new vector database, specifying the embedding model: Supported embedding models

Model Name	Dimensions	Best Used For
`intfloat/e5-mistral-7b-instruct`	4096	This model has some multilingual capability. However, since it was mainly trained on English data, we recommend using this model for English only.
`bedrock:amazon.titan-embed-text-v2:0`	256, 512, or 1024	Recommended Bedrock model. Best for RAG, document search, and reranking. Supports 100+ languages. Self-hosted AWS/EKS only — see Use AWS Bedrock for ingestion and inference.
`bedrock:amazon.titan-embed-text-v1`	1536	Legacy Bedrock model. Text retrieval and semantic similarity. Supports 25+ languages. Self-hosted AWS/EKS only — see Use AWS Bedrock for ingestion and inference.
`bedrock:amazon.titan-embed-g1-text-02`	1536	Legacy Bedrock G1 model. Text retrieval and semantic similarity. Supports 25+ languages. Self-hosted AWS/EKS only — see Use AWS Bedrock for ingestion and inference.
`bedrock:amazon.titan-embed-image-v1`	256, 384, or 1024	Multimodal (text + image) embeddings. Self-hosted AWS/EKS only — see Use AWS Bedrock for ingestion and inference.

Create an empty vector database

Start by creating a new vector database with specified embedding model.

from seekrai import SeekrFlow

# Initialize client

client = SeekrFlow(api_key="YOUR KEY HERE") # you can leave this empty if your key is stored as an environment variable

# Create vector database

vector_db = client.vector_database.create(
model="intfloat/e5-mistral-7b-instruct",
name="QuickStart_DB",
description="Quick start example database"
)

database_id = vector_db.id
print(f"Created database: {vector_db.name} (ID: {database_id})")

Sample response:

Created database: Quickstart_DB123456789 (ID: b7123456789-09876-4567)

Step 2: Upload files

Supported file types

PDF (.pdf)
Word documents (.docx)
Markdown (.md)

File size guidelines Upload multiple files, up to 4GB each. Markdown formatting guidelines Before uploading, check that your Markdown files are properly formatted to avoid rejection:

All files must have correctly ordered headers (# followed by ##, and so on) with titles and meaningful content. For example:

# Customer service plan

Some content, separated from the header by a line.

## Unaccompanied minor service

Some more content

### Etc.

Avoid using headers with more than 6 hashtags (e.g., ####### Pointlessly small md header)

Upload a file for ingestion

Next, upload your files for processing.

If you already have file_ids from a separate ingestion job, you can skip this step and use the same file_ids.

# Upload a file

# Windows (use raw string to avoid backslash issues):
file_path = r"C:\Users\username\Downloads\document.pdf" # Replace with your file path

# Mac/Linux
file_path = "/Users/username/Downloads/document.pdf" # Replace with your file path

upload_response = client.files.upload(file_path, purpose="alignment")
file_id = upload_response.id
print(f"Uploaded file with ID: {file_id}")

Once uploaded, each file is given a unique file_id to use for ingestion.

Upload a batch of files for ingestion (optional)

The endpoint accepts an array of file_ids as input.

from seekrai import SeekrFlow
client = SeekrFlow()

bulk_resp = client.files.bulk_upload(
["documentation.pdf", "policies.md", "guidelines.docx"], purpose="alignment"
)
print("Upload complete")

# Access and print the ID of each uploaded file

print("\nFile IDs:")
for resp in bulk_resp:
print(resp.id)

Sample response:

Uploading file documentation.pdf: 100%|██████████████████████████████████████████████████████████████████████| 7.16M/7.16M [00:08<00:00, 857kB/s]
Uploading file policies.md: 100%|█████████████████████████████████████████████████████████████████| 1.26M/1.26M [00:08<00:00, 151kB/s]
Uploading file guidelines.docx: 100%|███████████████████████████████████████████████████████████| 21.9k/21.9k [00:08<00:00, 2.62kB/s]
Upload complete

File IDs:
file-457989bc-2cf5-11f0-8b3b-56f95a5e9ef4
file-45f909da-2cf5-11f0-8b3b-56f95a5e9ef4
file-46226d20-2cf5-11f0-8b3b-56f95a5e9ef4

Step 3: Start a vector database ingestion job

Next, create a job to ingest documents into your vector database. This step converts the files and creates embeddings from them. The token_count parameter specifies the target size of each chunk, ensuring each chunk is neither too large (risking truncation by model limits) nor too small (losing semantic coherence). Best practices:

Common ranges: For embedding and retrieval, 200–500 tokens per chunk is a widely used range, balancing context and efficiency. The example here uses a token count of 512.
Adjust for document type: If your documents are dense or have complex structure (e.g., legal, technical), consider slightly larger chunks; for conversational or highly variable content, smaller chunks may work better.

The overlap_tokens parameter creates overlapping regions between adjacent chunks at chunk boundaries, reducing the risk of missing relevant information that spans two chunks. Adjust chunking parameters based on document characteristics:

Document Type	Recommended `token_count`	Recommended `overlap_tokens`
Technical documentation	384-512	50-75
Legal documents	512-768	75-100
Conversational content	256-384	25-50

# Create ingestion job
ingestion_job = client.vector_database.create_ingestion_job(
    database_id=database_id,
    files=[file_id],
    method="accuracy-optimized",
		chunking_method="markdown",
    token_count=512,
    overlap_tokens=50
)

job_id = ingestion_job.id
print(f"Created ingestion job: {job_id}")

Sample response:

Created ingestion job: ij-d80bd45a-4bb5-4bac-bbf3-7e3345409bc8

Attach metadata at ingestion

To attach user-defined metadata to the chunks created by an ingestion job, include an optional metadata object in the request. The metadata is job-level: it is copied onto every chunk produced from every file in the job. You can later filter or edit it with the chunk metadata methods (see Manage chunk metadata).

ingestion_job = client.vector_database.create_ingestion_job(
    database_id=database_id,
    files=[file_id],
    method="accuracy-optimized",
    chunking_method="markdown",
    token_count=512,
    overlap_tokens=50,
    metadata={
        "year": 2024,
        "doc_type": "annual_report",
        "department": "finance",
        "is_confidential": True,
    },
)

The metadata object must follow a few constraints (flat object, typed values, 20 keys maximum); see Metadata rules for the full list. To set different metadata on different chunks within one job, use the per-chunk metadata blocks described under Add per-chunk metadata.

Ingestion mode and the UIWhen ingesting files through the SeekrFlow UI, speed-optimized mode is always used. The SDK lets you choose between speed-optimized and accuracy-optimized.

Choose an ingestion method

Accuracy-optimized (default) When you use method="accuracy-optimized" or omit the method parameter, the system prioritizes accuracy. Depending on what data is available in your PDF document (bookmarks, tables, text layers), the system combines multiple extraction techniques for best results. Key features:

Uses both OCR and direct text extraction, then blends them together
Employs LLM agents to correct and enhance document hierarchy
Applies advanced table detection algorithms for accurate table formatting

Documents over 100 pages can take up to 30 minutes to process.

Speed-optimized When you use method="speed-optimized", the system balances quality with processing speed. It automatically selects faster methods based on document size while maintaining reasonable accuracy for smaller documents. Key features:

Small documents still use high-accuracy methods
Larger documents use speed optimized algorithms to meet time constraints

Optimized to complete in approximately 3 minutes regardless of document size.

Once ingestion is complete, you’ll receive a Markdown file that you can use for fine-tuning. Choose a chunking method After selecting your files, configure how your content will be segmented: Markdown chunking (default) Intelligent structure-aware chunking that automatically detects logical content breaks:

Respects document structure and header hierarchies
Keeps related content together (headers with their content)
Preserves tables with their headers
Groups small sections to optimize chunk sizes

Semantic chunking Call with chunking_method="semantic" to enable meaning-aware segmentation powered by LLM similarity scoring. Meaning-aware segmentation that relies on LLM similarity scoring to decide where chunks begin and end:

Searches for topic shifts instead of raw heading boundaries to keep tightly related sentences together
Automatically merges short paragraphs or bullets when they express the same idea
Honors document structure when it provides strong signals but can span across headings if the semantics match
Applies Chunk Size and Chunk Overlap as safety caps, splitting only when the semantic chunk would exceed those limits

When to use it:

Long narrative content (wikis, blogs, requirements) where sections do not follow strict Markdown hierarchy
Mixed-format documents where context spans multiple small headers or callouts

To enable Semantic Chunking, select it from the chunking strategy dropdown when configuring your vector store. No additional markup is required—just tune Chunk Size/Overlap to reflect the granularity you need. Manual window chunking User-controlled document segmentation using custom break markers:

Insert ---DOCUMENT_BREAK--- markers to define exactly where chunks should be separated
Chunk Size (token count) - Maximum tokens per chunk (used as fallback when content exceeds limit)
Chunk Overlap (token count) - Overlap between chunks when sliding window is applied

How to use manual document breaks Use ---DOCUMENT_BREAK--- to specifiy where the chunks should be separated. Document break markers only work when using Manual Window Chunking mode. When using Markdown Chunking (default), these break markers are ignored as the system uses document structure for segmentation instead. To use custom segmentation in your documents:

Select Manual Window Chunking when configuring your vector store

Insert break markers in your Markdown files where you want to force chunk boundaries:

# Section 1
This content will be in one chunk...

---DOCUMENT_BREAK---

# Section 2

This content will be in a separate chunk...

---DOCUMENT_BREAK---

# Section 3

This starts another chunk...

Configure chunk settings:

Set your desired maximum chunk size (e.g., 1000 tokens)
Set overlap (e.g., 100 tokens)

If the content between two document break markers exceeds the maximum chunk size, the sliding window chunker will automatically split it into multiple chunks.

This ensures precise control over how your data is broken down before embedding.

Add per-chunk metadata

With manual window chunking, you can attach different metadata to different chunks by embedding a metadata block inside a section. Place the block between ---CHUNK_META_START--- and ---CHUNK_META_END---, with a single line of JSON in between. The block is scoped to the section it appears in and is removed from the text before indexing, so it never becomes part of the chunk content.

# Introduction

---CHUNK_META_START---
{"author": "Jane Smith", "source": "chapter_1", "classification": "public", "year": 2024}
---CHUNK_META_END---

This is the first chunk of the document.

---DOCUMENT_BREAK---

# Conclusion

This chunk has no metadata block and falls back to the job-level metadata from the ingestion request.

Per-chunk metadata follows the same rules as job-level metadata and behaves as follows:

It is supported only with manual window chunking. Including these markers with any other chunking method is an error.
Only one metadata block is allowed per section. A second block in the same section is an error.
A section’s block replaces the job-level metadata for that chunk. The two are not merged.
A section with no block inherits the job-level metadata from the ingestion request, or no metadata if the request supplied none.

Embedding model configuration

Supported embedding models The system uses the following embedding models for vector generation:

Model	Max Token Length	Estimated Max Words
E5-Mistral-7B-Instruct	4096	~3,040
Titan Text Embeddings V2 (Bedrock)	8192	~6,080
Titan Text Embeddings V1 (Bedrock)	8192	~6,080
Titan Text Embeddings G1 (Bedrock)	8192	~6,080
Titan Multimodal Embeddings (Bedrock)	128	~96

Bedrock embedding models are available for self-hosted AWS/EKS deployments only. See Use AWS Bedrock for ingestion and inference for setup instructions.

Each model has a maximum sequence length as shown in the table above
Using inputs longer than the model’s max token length is not recommended
Check model specifications for language support and optimal use cases

Step 4: Monitor ingestion status (optional)

After starting an ingestion job, you can track job progress, view per-file statuses, and diagnose any failures. See Monitor ingestion for details on checking job states, interpreting file_records, and resolving errors. Once status shows completed, your vector database is ready to query.

Source tracing

When a file is ingested into a vector database, the pipeline captures provenance metadata for every chunk — no additional configuration required. Each chunk stores:

line_number_start / line_number_end — line range within the ingested Markdown
char_start / char_end — character offsets within the section
hierarchy — heading path from the document root to the chunk (e.g. ["Chapter 3", "3.2 Pricing", "Cancellation"])
page_number — source document page number, 1-indexed (null for native Markdown or JSON uploads)

Provenance metadata is captured for PDF, DOCX, PPTX, Markdown, and JSON files. To retrieve and use this metadata after a run, see Source tracing.

Line numbers refer to lines in the ingested Markdown, not the original file. Use page_number to navigate back to a source PDF.

Complete example

This example demonstrates the entire workflow for creating a vector database, adding files, and kicking off an ingestion job:

from seekrai import SeekrFlow
import time
import os

client = SeekrFlow()

# Step 1: Create vector database
print("Creating vector database...")
db_name = f"QuickStart_DB_{int(time.time())}"
vector_db = client.vector_database.create(
    model="intfloat/e5-mistral-7b-instruct",
    name=db_name,
    description="Quick start example database"
)
database_id = vector_db.id
print(f"Created database: {vector_db.name} (ID: {database_id})")

# Step 2: Upload file
print("Uploading file...")
file_path = "document.pdf"  # Replace with your file path
upload_response = client.files.upload(file_path, purpose="alignment")
file_id = upload_response.id
print(f"Uploaded file with ID: {file_id}")

# Step 3: Begin vector database ingestion
print("Creating ingestion job...")
ingestion_job = client.vector_database.create_ingestion_job(
    database_id=database_id,
    files=[file_id],
    method="accuracy-optimized",
    token_count=512,
    overlap_tokens=50
)
job_id = ingestion_job.id
print(f"Created ingestion job with ID: {job_id}")

# Step 4: Monitor ingestion status
# For per-file tracking and error diagnostics, see Monitor ingestion.
print("Waiting for ingestion job to complete...")
interval = 5    # Check every 5 seconds

while True:
    job_status = client.vector_database.retrieve_ingestion_job(database_id, job_id)
    status = job_status.status
    print(f"Ingestion job status: {status}")

    if status == "completed":
        print(f"Vector database ready with ID: {database_id}")
        break
    elif status == "failed":
        error = getattr(job_status, "error_message", "Unknown error")
        print(f"Ingestion job failed: {error}")
        break

    time.sleep(interval)

print("Setup complete!")

Manage your vector databases

List all vector databases

databases = client.vector_database.list()

for db in databases.data:
print(f"ID: {db.id}, Name: {db.name}")

Get a specific vector database

# Get vector database details
db_details = client.vector_database.retrieve(database_id)

print(f"Name: {db_details.name}")
print(f"Last updated: {db_details.updated_at}")

Delete a vector database

client.vector_database.delete(database_id) print(f"Successfully deleted
database {database_id}") ```
</CodeGroup>

### List all files in a vector database

<CodeGroup>
```python Python
# List files in vector database
db_files = client.vector_database.list_files(database_id)

for file in db_files.data:
print(f"ID: {file.id}, Filename: {file.filename}")

Delete a file from a vector database

# Delete a file from vector database
client.vector_database.delete_file(database_id, file_id)
print(f"Successfully deleted file {file_id} from {database_id}")

Troubleshoot common issues

For ingestion-specific error codes with plain-language messages and suggested fixes, see Monitor ingestion.

Document processing issues

Issue	Possible cause	Solution
Files fail to upload	File exceeds size limit	Split large files or compress them
	Invalid file format	Ensure file extension matches actual format
	Network timeout	Implement retry logic with exponential backoff
Markdown parsing errors	Improper header hierarchy	Fix header structure (ensure proper nesting)
	Unsupported Markdown syntax	Use standard Markdown formatting
PDF extraction issues	Protected PDF	Remove password protection before uploading

File ingestion issues

Issue	Possible cause	Solution
Slow ingestion	Complex document structure	Adjust chunking parameters
	Resource constraints	Monitor system resources during ingestion
	Large batch size	Break into smaller batches
Failed ingestion job	Malformed content	Check files for compatibility issues
	Service timeout	Increase timeout settings

​Step 1: Set up a vector database

​Create an empty vector database

​Step 2: Upload files

​Upload a file for ingestion

​Upload a batch of files for ingestion (optional)

​Step 3: Start a vector database ingestion job

​Attach metadata at ingestion

​Choose an ingestion method

​Add per-chunk metadata

​Embedding model configuration

​Step 4: Monitor ingestion status (optional)

​Source tracing

​Complete example

​Manage your vector databases

​List all vector databases

​Get a specific vector database

​Delete a vector database

​Delete a file from a vector database

​Troubleshoot common issues

​Document processing issues

​File ingestion issues

Step 1: Set up a vector database

Create an empty vector database

Step 2: Upload files

Upload a file for ingestion

Upload a batch of files for ingestion (optional)

Step 3: Start a vector database ingestion job

Attach metadata at ingestion

Choose an ingestion method

Add per-chunk metadata

Embedding model configuration

Step 4: Monitor ingestion status (optional)

Source tracing

Complete example

Manage your vector databases

List all vector databases

Get a specific vector database

Delete a vector database

Delete a file from a vector database

Troubleshoot common issues

Document processing issues

File ingestion issues