Create and Populate a Vector Database
Set up a vector database and ingest documents to generate embeddings for semantic search and retrieval.
This guide covers setting up the vector database, processing your documents, creating embeddings, and monitoring ingestion jobs. A complete example is available following the step-by-step guide, along with guidance for managing your vector databases.
Step 1: Set up a vector database
Seekr's Vector Database SDK provides advanced semantic search capabilities by transforming text into vector embeddings, making it possible to perform semantic searches that focus on meaning and context. This approach provides a smarter and more intuitive way to retrieve documents compared to traditional keyword-based methods.
Start by creating a new vector database, specifying the embedding model:
Supported embedding models
Model Name | Dimensions | Best Used For |
---|---|---|
intfloat/e5-mistral-7b-instruct | 4096 | This model has some multilingual capability. However, since it was mainly trained on English data, we recommend using this model for English only. |
Create an empty vector database
Start by creating a new vector database with specified embedding model.
from seekrai import SeekrFlow
# Initialize client
client = SeekrFlow(api_key="YOUR KEY HERE") # you can leave this empty if your key is stored as an environment variable
# Create vector database
vector_db = client.vector_database.create(
model="intfloat/e5-mistral-7b-instruct",
name="QuickStart_DB",
description="Quick start example database"
)
database_id = vector_db.id
print(f"Created database: {vector_db.name} (ID: {database_id})")
Sample response:
Created database: Quickstart_DB123456789 (ID: b7123456789-09876-4567)
Step 2: Upload files
Supported file types
- PDF (
.pdf
) - Word documents (
.docx
) - Markdown (
.md
)
File size guidelines
Upload multiple files, up to 4GB each.
Markdown formatting guidelines
Before uploading, check that your Markdown files are properly formatted to avoid rejection:
- All files must have correctly ordered headers (
#
followed by##
, and so on) with titles and meaningful content. For example:
# Customer service plan
Some content, separated from the header by a line.
## Unaccompanied minor service
Some more content
### Etc.
- Avoid using headers with more than 6 hashtags (e.g.,
####### Pointlessly small md header
)
Find some example Markdown files here.
Upload a file for ingestion
Next, upload your files for processing.
Note: If you already have file_ids from a separate ingestion job, you can skip this step and use the same file_ids.
# Upload a file
# Windows (use raw string to avoid backslash issues):
file_path = r"C:\Users\username\Downloads\document.pdf" # Replace with your file path
# Mac/Linux
file_path = "/Users/username/Downloads/document.pdf" # Replace with your file path
upload_response = client.files.upload(file_path, purpose="alignment")
file_id = upload_response.id
print(f"Uploaded file with ID: {file_id}")
Once uploaded, each file is given a unique file_id
to use for ingestion.
Upload a batch of files for ingestion (optional)
The endpoint accepts an array of file_ids
as input.
from seekrai import SeekrFlow
client = SeekrFlow()
bulk_resp = client.files.bulk_upload(
["documentation.pdf", "policies.md", "guidelines.docx"], purpose="alignment"
)
print("Upload complete")
# Access and print the ID of each uploaded file
print("\nFile IDs:")
for resp in bulk_resp:
print(resp.id)
Sample response:
Uploading file documentation.pdf: 100%|██████████████████████████████████████████████████████████████████████| 7.16M/7.16M [00:08<00:00, 857kB/s]
Uploading file policies.md: 100%|█████████████████████████████████████████████████████████████████| 1.26M/1.26M [00:08<00:00, 151kB/s]
Uploading file guidelines.docx: 100%|███████████████████████████████████████████████████████████| 21.9k/21.9k [00:08<00:00, 2.62kB/s]
Upload complete
File IDs:
file-457989bc-2cf5-11f0-8b3b-56f95a5e9ef4
file-45f909da-2cf5-11f0-8b3b-56f95a5e9ef4
file-46226d20-2cf5-11f0-8b3b-56f95a5e9ef4
Step 3: Initiate vector database ingestion
Next, create a job to ingest documents into your vector database. This step converts the files and creates embeddings from them.
The token_count
parameter specifies the target size of each chunk, ensuring each chunk is neither too large (risking truncation by model limits) nor too small (losing semantic coherence).
Best practices:
- Common ranges: For embedding and retrieval, 200–500 tokens per chunk is a widely used range, balancing context and efficiency. The example here uses a token count of 512.
- Adjust for document type: If your documents are dense or have complex structure (e.g., legal, technical), consider slightly larger chunks; for conversational or highly variable content, smaller chunks may work better.
The overlap_tokens
parameter creates overlapping regions between adjacent chunks at chunk boundaries, reducing the risk of missing relevant information that spans two chunks.
Adjust chunking parameters based on document characteristics:
Document Type | Recommended token_count | Recommended overlap_tokens |
---|---|---|
Technical documentation | 384-512 | 50-75 |
Legal documents | 512-768 | 75-100 |
Conversational content | 256-384 | 25-50 |
The method
parameter determines how files are converted to text before chunking. The quality of this conversion directly affects chunking accuracy and downstream tasks. Setting it tobest
specifies the best available method for document conversion.
# Create ingestion job
ingestion_job = client.vector_database.create_ingestion_job(
database_id=database_id,
files=[file_id],
method="best",
token_count=512,
overlap_tokens=50
)
job_id = ingestion_job.id
print(f"Created ingestion job: {job_id}")
Sample response:
Created ingestion job: ij-d80bd45a-4bb5-4bac-bbf3-7e3345409bc8
Step 4: Monitor ingestion status (optional)
After starting an ingestion job, you can check its status until completion:
import time
timeout = 300 # 5 minutes timeout
interval = 5 # Check every 5 seconds
start_time = time.time() # Track start time
while True:
job_status = client.vector_database.retrieve_ingestion_job(database_id, job_id)
status = job_status.status
print(f"Ingestion job status: {status}")
if status == "completed":
print("Vector database ready!")
break
elif status == "failed":
error = getattr(job_status, "error_message", "Unknown error")
print(f"Ingestion job failed: {error}")
break
# Check if timeout was exceeded
elapsed_time = time.time() - start_time
if elapsed_time >= timeout:
print(f"Timeout reached after {timeout} seconds. Job status: {status}")
break
time.sleep(interval)
Sample response:
Ingestion job status: completed
Vector database ready!
Complete example: Database creation > document ingestion
This example demonstrates the entire workflow for creating a vector database, adding files, and kicking off an ingestion job:
from seekrai import SeekrFlow
import time
import os
client = SeekrFlow()
# Step 1: Create vector database
print("Creating vector database...")
db_name = f"QuickStart_DB_{int(time.time())}"
vector_db = client.vector_database.create(
model="intfloat/e5-mistral-7b-instruct",
name=db_name,
description="Quick start example database"
)
database_id = vector_db.id
print(f"Created database: {vector_db.name} (ID: {database_id})")
# Step 2: Upload file
print("Uploading file...")
file_path = "document.pdf" # Replace with your file path
upload_response = client.files.upload(file_path, purpose="alignment")
file_id = upload_response.id
print(f"Uploaded file with ID: {file_id}")
# Step 3: Begin vector database ingestion
print("Creating ingestion job...")
ingestion_job = client.vector_database.create_ingestion_job(
database_id=database_id,
files=[file_id],
method="best",
token_count=512,
overlap_tokens=50
)
job_id = ingestion_job.id
print(f"Created ingestion job with ID: {job_id}")
# Step 4: Wait for ingestion job to complete
print("Waiting for ingestion job to complete...")
timeout = 300 # 5 minutes timeout
interval = 5 # Check every 5 seconds
# Step 5: Monitor job status until completion
while True:
job_status = client.vector_database.retrieve_ingestion_job(database_id, job_id)
status = job_status.status
print(f"Ingestion job status: {status}")
if status == "completed":
print(f"Vector database ready with ID: {database_id}")
break
elif status == "failed":
error = getattr(job_status, "error_message", "Unknown error")
print(f"Ingestion job failed: {error}")
break
time.sleep(interval)
print("Setup complete!")
Vector database management
List all vector databases
databases = await client.vector_database.list()
for db in databases.data:
print(f"ID: {db.id}, Name: {db.name}")
Get a specific vector database
# Get vector database details
db_details = client.vector_database.retrieve(database_id)
print(f"Name: {db_details.name}")
print(f"Last updated: {db_details.updated_at}")
Delete a vector database
# Delete a vector database
client.vector_database.delete(database_id)
print(f"Successfully deleted database {database_id}")
List all files in a vector database
# List files in vector database
db_files = client.vector_database.list_files(database.id)
for file in db_files.data:
print(f"ID: {file.id}, Filename: {file.filename}")
Delete a file from a vector database
# Delete a file from vector database
client.vector_database.delete_file(database.id, file.id)
print(f"Successfully deleted file {file.id} from {database.id}")
Troubleshooting
Document processing issues
Issue | Possible cause | Solution |
---|---|---|
Files fail to upload | File exceeds size limit | Split large files or compress them |
Invalid file format | Ensure file extension matches actual format | |
Network timeout | Implement retry logic with exponential backoff | |
Markdown parsing errors | Improper header hierarchy | Fix header structure (ensure proper nesting) |
Unsupported Markdown syntax | Use standard Markdown formatting | |
PDF extraction issues | Protected PDF | Remove password protection before uploading |
File ingestion issues
Issue | Possible cause | Solution |
---|---|---|
Slow ingestion | Complex document structure | Adjust chunking parameters |
Resource constraints | Monitor system resources during ingestion | |
Large batch size | Break into smaller batches | |
Failed ingestion job | Malformed content | Check files for compatibility issues |
Service timeout | Increase timeout settings |
Updated about 11 hours ago