Create and Populate a Vector Database

Set up a vector database and ingest documents to generate embeddings for semantic search and retrieval.

This guide covers setting up the vector database, processing your documents, creating embeddings, and monitoring ingestion jobs. A complete example is available following the step-by-step guide, along with guidance for managing your vector databases.

Step 1: Set up a vector database

Seekr's Vector Database SDK provides advanced semantic search capabilities by transforming text into vector embeddings, making it possible to perform semantic searches that focus on meaning and context. This approach provides a smarter and more intuitive way to retrieve documents compared to traditional keyword-based methods.

Start by creating a new vector database, specifying the embedding model:

Supported embedding models

Model NameDimensionsBest Used For
intfloat/e5-mistral-7b-instruct4096This model has some multilingual capability. However, since it was mainly trained on English data, we recommend using this model for English only.

Create an empty vector database

Start by creating a new vector database with specified embedding model.

from seekrai import SeekrFlow

# Initialize client
client = SeekrFlow(api_key="YOUR KEY HERE") # you can leave this empty if your key is stored as an environment variable

# Create vector database
vector_db = client.vector_database.create(
    model="intfloat/e5-mistral-7b-instruct",
    name="QuickStart_DB",
    description="Quick start example database"
)

database_id = vector_db.id
print(f"Created database: {vector_db.name} (ID: {database_id})")

Sample response:

Created database: Quickstart_DB123456789 (ID: b7123456789-09876-4567)

Step 2: Upload files

Supported file types

  • PDF (.pdf)
  • Word documents (.docx)
  • Markdown (.md)

File size guidelines

Upload multiple files, up to 4GB each.

Markdown formatting guidelines

Before uploading, check that your Markdown files are properly formatted to avoid rejection:

  • All files must have correctly ordered headers (# followed by ##, and so on) with titles and meaningful content. For example:
# Customer service plan

Some content, separated from the header by a line. 

## Unaccompanied minor service

Some more content

### Etc.
  • Avoid using headers with more than 6 hashtags (e.g., ####### Pointlessly small md header)

Find some example Markdown files here.


Upload a file for ingestion

Next, upload your files for processing.

Note: If you already have file_ids from a separate ingestion job, you can skip this step and use the same file_ids.

# Upload a file 

# Windows (use raw string to avoid backslash issues):
file_path = r"C:\Users\username\Downloads\document.pdf" # Replace with your file path

# Mac/Linux
file_path = "/Users/username/Downloads/document.pdf" # Replace with your file path

upload_response = client.files.upload(file_path, purpose="alignment")
file_id = upload_response.id
print(f"Uploaded file with ID: {file_id}")

Once uploaded, each file is given a unique file_id to use for ingestion.

Upload a batch of files for ingestion (optional)

The endpoint accepts an array of file_ids as input.

from seekrai import SeekrFlow
client = SeekrFlow()

bulk_resp = client.files.bulk_upload(
    ["documentation.pdf", "policies.md", "guidelines.docx"], purpose="alignment"
)
print("Upload complete")

# Access and print the ID of each uploaded file
print("\nFile IDs:")
for resp in bulk_resp:
    print(resp.id)

Sample response:

Uploading file documentation.pdf: 100%|██████████████████████████████████████████████████████████████████████| 7.16M/7.16M [00:08<00:00, 857kB/s]
Uploading file policies.md: 100%|█████████████████████████████████████████████████████████████████| 1.26M/1.26M [00:08<00:00, 151kB/s]
Uploading file guidelines.docx: 100%|███████████████████████████████████████████████████████████| 21.9k/21.9k [00:08<00:00, 2.62kB/s]
Upload complete

File IDs:
file-457989bc-2cf5-11f0-8b3b-56f95a5e9ef4
file-45f909da-2cf5-11f0-8b3b-56f95a5e9ef4
file-46226d20-2cf5-11f0-8b3b-56f95a5e9ef4

Step 3: Initiate vector database ingestion

Next, create a job to ingest documents into your vector database. This step converts the files and creates embeddings from them.

The token_count parameter specifies the target size of each chunk, ensuring each chunk is neither too large (risking truncation by model limits) nor too small (losing semantic coherence).

Best practices:

  1. Common ranges: For embedding and retrieval, 200–500 tokens per chunk is a widely used range, balancing context and efficiency. The example here uses a token count of 512.
  2. Adjust for document type: If your documents are dense or have complex structure (e.g., legal, technical), consider slightly larger chunks; for conversational or highly variable content, smaller chunks may work better.

The overlap_tokens parameter creates overlapping regions between adjacent chunks at chunk boundaries, reducing the risk of missing relevant information that spans two chunks.

Adjust chunking parameters based on document characteristics:

Document TypeRecommended token_countRecommended overlap_tokens
Technical documentation384-51250-75
Legal documents512-76875-100
Conversational content256-38425-50

from seekrai import SeekrFlow

client = SeekrFlow(api_key="your api key")
ingestion = client.ingestion

# Start ingestion on one or more uploaded file IDs
# You can specify the conversion method (optional)
response = ingestion.ingest(
    files=["file-25e34f96-2130-11f0-9236-3e11346bffff", "file-efd0b334-2130-11f0-9236-3e11346bffff"],
    method="accuracy-optimized",  # or "speed-optimized" (default is "accuracy-optimized")
)
print("Ingestion Job ID:", response.id)


# Create ingestion job
ingestion_job = client.vector_database.create_ingestion_job(
    database_id=database_id,
    files=[file_id],
    method="accuracy-optimized",
		chunking_method="markdown",
    token_count=512,
    overlap_tokens=50
)

job_id = ingestion_job.id
print(f"Created ingestion job: {job_id}")

Sample response:

Created ingestion job: ij-d80bd45a-4bb5-4bac-bbf3-7e3345409bc8

The Method Parameter

Accuracy-Optimized (Default)

When you use method="accuracy-optimized" or omit the method parameter, the system prioritizes accuracy. Depending on what data is available in your PDF document (bookmarks, tables, text layers), the system combines multiple extraction techniques for best results.

Key features:

  • Uses both OCR and direct text extraction, then blends them together
  • Employs LLM agents to correct and enhance document hierarchy
  • Applies advanced table detection algorithms for accurate table formatting

Processing note: Documents over 100 pages can take up to 30 minutes to process.

Speed-Optimized

When you use method="speed-optimized", the system balances quality with processing speed. It automatically selects faster methods based on document size while maintaining reasonable accuracy for smaller documents.

Key features:

  • Small documents still use high-accuracy methods
  • Larger documents use speed optimized algorithms to meet time constraints

Processing note: Optimized to complete in approximately 3 minutes regardless of document size.

Once ingestion is complete, you'll receive a Markdown file that you can use for fine-tuning.

The Chunking Method Paramter

After selecting your files, configure how your content will be segmented:

Markdown Chunking (Default)

Intelligent structure-aware chunking that automatically detects logical content breaks:

  • Respects document structure and header hierarchies
  • Keeps related content together (headers with their content)
  • Preserves tables with their headers
  • Groups small sections to optimize chunk sizes

Manual Window Chunking

User-controlled document segmentation using custom break markers:

  • Insert ---DOCUMENT_BREAK--- markers to define exactly where chunks should be separated
  • Chunk Size (token count) - Maximum tokens per chunk (used as fallback when content exceeds limit)
  • Chunk Overlap (token count) - Overlap between chunks when sliding window is applied
How to Use Manual Document Breaks

Use ---DOCUMENT_BREAK--- to specifiy where the chunks should be separated.
Document break markers only work when using Manual Window Chunking mode. When using Markdown Chunking (default), these break markers are ignored as the system uses document structure for segmentation instead.

To use custom segmentation in your documents:

  1. Select Manual Window Chunking when configuring your vector store
  2. Insert break markers in your Markdown files where you want to force chunk boundaries:
# Section 1
This content will be in one chunk...

---DOCUMENT_BREAK---

# Section 2
This content will be in a separate chunk...

---DOCUMENT_BREAK---

# Section 3
This starts another chunk...
  1. Configure chunk settings:
    • Set your desired maximum chunk size (e.g., 1000 tokens)
    • Set overlap (e.g., 100 tokens)

⚠️ Warning: If the content between two document break markers exceeds the maximum chunk size, the sliding window chunker will automatically split it into multiple chunks.

This ensures precise control over how your data is broken down before embedding.


📊 Embedding Model Configuration

Supported Embedding Model

The system uses the following embedding model for vector generation:

ModelMax Token LengthEstimated Max Words
E5-Mistral-7B-Instruct4096~3,040

Important Notes:

  • Each model has a maximum sequence length as shown in the table above
  • Using inputs longer than the model's max token length is not recommended
  • Check model specifications for language support and optimal use cases

Step 4: Monitor ingestion status (optional)

After starting an ingestion job, you can check its status until completion:

import time

timeout = 300  # 5 minutes timeout
interval = 5   # Check every 5 seconds
start_time = time.time() # Track start time

while True:
    job_status = client.vector_database.retrieve_ingestion_job(database_id, job_id)
    status = job_status.status
    print(f"Ingestion job status: {status}")

    if status == "completed":
        print("Vector database ready!")
        break
    elif status == "failed":
        error = getattr(job_status, "error_message", "Unknown error")
        print(f"Ingestion job failed: {error}")
        break

    # Check if timeout was exceeded
    elapsed_time = time.time() - start_time
    if elapsed_time >= timeout:
      	print(f"Timeout reached after {timeout} seconds. Job status: {status}")
      	break
        
    time.sleep(interval)

Sample response:

Ingestion job status: completed
Vector database ready!

Complete example: Database creation > document ingestion

This example demonstrates the entire workflow for creating a vector database, adding files, and kicking off an ingestion job:

from seekrai import SeekrFlow
import time
import os

client = SeekrFlow()

# Step 1: Create vector database
print("Creating vector database...")
db_name = f"QuickStart_DB_{int(time.time())}"
vector_db = client.vector_database.create(
    model="intfloat/e5-mistral-7b-instruct",
    name=db_name,
    description="Quick start example database"
)
database_id = vector_db.id
print(f"Created database: {vector_db.name} (ID: {database_id})")

# Step 2: Upload file
print("Uploading file...")
file_path = "document.pdf"  # Replace with your file path
upload_response = client.files.upload(file_path, purpose="alignment")
file_id = upload_response.id
print(f"Uploaded file with ID: {file_id}")

# Step 3: Begin vector database ingestion
print("Creating ingestion job...")
ingestion_job = client.vector_database.create_ingestion_job(
    database_id=database_id,
    files=[file_id],
    method="best",
    token_count=512,
    overlap_tokens=50
)
job_id = ingestion_job.id
print(f"Created ingestion job with ID: {job_id}")

# Step 4: Wait for ingestion job to complete
print("Waiting for ingestion job to complete...")
timeout = 300  # 5 minutes timeout
interval = 5    # Check every 5 seconds

# Step 5: Monitor job status until completion
while True:
    job_status = client.vector_database.retrieve_ingestion_job(database_id, job_id)
    status = job_status.status
    print(f"Ingestion job status: {status}")
    
    if status == "completed":
        print(f"Vector database ready with ID: {database_id}")
        break
    elif status == "failed":
        error = getattr(job_status, "error_message", "Unknown error")
        print(f"Ingestion job failed: {error}")
        break
        
    time.sleep(interval)

print("Setup complete!")

Vector database management

List all vector databases

databases = await client.vector_database.list()

for db in databases.data:
print(f"ID: {db.id}, Name: {db.name}")

Get a specific vector database

# Get vector database details
db_details = client.vector_database.retrieve(database_id)

print(f"Name: {db_details.name}")
print(f"Last updated: {db_details.updated_at}")

Delete a vector database

# Delete a vector database
client.vector_database.delete(database_id)
print(f"Successfully deleted database {database_id}")

List all files in a vector database

# List files in vector database
db_files = client.vector_database.list_files(database.id)

for file in db_files.data:
    print(f"ID: {file.id}, Filename: {file.filename}")

Delete a file from a vector database

# Delete a file from vector database
client.vector_database.delete_file(database.id, file.id)
print(f"Successfully deleted file {file.id} from {database.id}")

Troubleshooting

Document processing issues

IssuePossible causeSolution
Files fail to uploadFile exceeds size limitSplit large files or compress them
Invalid file formatEnsure file extension matches actual format
Network timeoutImplement retry logic with exponential backoff
Markdown parsing errorsImproper header hierarchyFix header structure (ensure proper nesting)
Unsupported Markdown syntaxUse standard Markdown formatting
PDF extraction issuesProtected PDFRemove password protection before uploading

File ingestion issues

IssuePossible causeSolution
Slow ingestionComplex document structureAdjust chunking parameters
Resource constraintsMonitor system resources during ingestion
Large batch sizeBreak into smaller batches
Failed ingestion jobMalformed contentCheck files for compatibility issues
Service timeoutIncrease timeout settings