Storage
Upload files and create vector databases for training and retrieval workflows.
Storage manages raw content through file ingestion and vector database creation. Files are uploaded, processed, and organized for downstream use. Vector databases transform these files into searchable knowledge bases through document chunking and embedding generation.
File storage
File storage manages the upload, processing, and organization of raw content. The system ingests documents in multiple formats, extracts text and structure, and maintains files throughout their lifecycle as source material for downstream workflows.
How it works
The file storage system handles the complete ingestion pipeline:
- Upload – Files are uploaded through the UI, API, or SDK
- Processing – The system extracts text content and document structure
- Storage – Processed files are stored with metadata and status tracking
- Access – Files remain available for vector database creation and dataset generation
Supported formats
File storage accepts multiple document formats:
- DOCX
- JSON
- Markdown
- TXT
The system automatically detects file types and applies appropriate processing for content extraction.
File metadata
Each uploaded file maintains metadata throughout its lifecycle:
- File ID – Unique identifier for the file
- Filename – Original file name
- Status – Processing state (uploading, processing, ready, failed)
- Size – File size in bytes
- Upload timestamp – When the file was added to storage
- Content type – Detected or specified file type
File organization
Files can be organized and managed through:
- Listing – Retrieve all files or filter by status and metadata
- Retrieval – Access individual files by ID
- Deletion – Remove files no longer needed
- Status monitoring – Track processing progress for uploaded files
Vector databases
Vector databases (also called vector stores) store document embeddings for semantic search and retrieval. They transform raw files into searchable knowledge bases by chunking documents and generating embeddings for each segment, allowing agents to find relevant information based on meaning rather than keyword matching.
How it works
The vector database creation process:
- File selection – Choose which files to include in the vector database.
- Chunking – Documents are split into segments based on configurable parameters.
- Embedding generation – Each chunk is converted into a vector representation using specialized embedding models.
- Storage – Embeddings and their associated text are stored for retrieval.
- Search – Queries are embedded and compared against stored vectors to find semantically similar content.
Chunking strategy
Document chunking determines how files are segmented before embedding. Proper chunking ensures that retrieved segments contain complete, coherent information relevant to queries. SeekrFlow supports multiple chunking methods to optimize retrieval for different document types and use cases:
Sliding window chunking Creates fixed-size segments with configurable token counts and overlap between adjacent chunks. This method provides consistent chunk sizes and ensures context continuity across boundaries.
- Chunk size – Number of tokens per segment (default: 512)
- Overlap – Tokens shared between adjacent chunks (default: 50)
Best for: General-purpose retrieval when documents lack clear structural boundaries.
Markdown chunking Splits documents at markdown structural elements (headers, sections, lists). This method preserves logical document organization and keeps related content together.
Best for: Technical documentation, structured reports, and content with clear hierarchical organization.
Semantic chunking Analyzes content meaning to identify natural topic boundaries and groups semantically related information into chunks. This method creates variable-size segments based on conceptual coherence rather than fixed token counts.
Best for: Long-form content, narrative documents, and materials where maintaining topical coherence is critical for retrieval quality.
Embedding models
Vector databases use specialized embedding models to convert text into vector representations. These models are trained to encode semantic meaning, allowing similarity comparisons between queries and document chunks.
The embedding dimension and model choice affect:
- Retrieval accuracy – How well the system identifies relevant content
- Storage requirements – Vector database size based on embedding dimensions
- Search speed – Query performance relative to database size and embedding complexity
Semantic search
Vector databases power semantic search capabilities:
- Meaning-based retrieval – Find content based on conceptual similarity rather than exact keyword matches
- Ranked results – Return chunks ordered by relevance to the query
- Context preservation – Retrieve coherent segments that maintain document structure
- Multi-document search – Query across all files attached to the vector store
Vector store management
Vector databases can be managed through:
- Creation – Build new vector stores from selected files
- File attachment – Add additional files to existing vector stores
- Status monitoring – Track processing progress for vector database creation
- Integration – Connect vector stores to agent FileSearch tools
Use cases
Vector databases support several key workflows:
Retrieval-augmented generation (RAG) Agents use vector databases to find relevant context before generating responses. The FileSearch tool queries vector stores to retrieve document segments that inform the agent's output.
Knowledge base search Internal documentation, policies, and reference materials become searchable by meaning. Users can ask questions in natural language and receive relevant information from document collections.
Semantic discovery Find related content across large document sets without knowing exact keywords. Vector databases surface conceptually similar information that traditional search might miss.
Integration with other components
Storage integrates across SeekrFlow:
- Agents – FileSearch tool queries vector databases for knowledge retrieval
- Fine-tuning – Files serve as source content for training dataset generation
- Context-grounded fine-tuning – Vector databases provide retrieval infrastructure for training models to use external knowledge
- Evaluations – Test agent performance on knowledge retrieval tasks using vector stores
Updated 8 days ago
