Data Preparation
Upload multiple files simulataneously in their original formats and the AI-Ready Data Engine will automatically extract relevant data and structure them into a dataset that can be used for fine-tuning.
Begin the data preparation process by gathering source documents for upload to the Data Engine. Aim to collect documents that are relevant to the domain, and that are structured in a clear and logical way, with headings and subheadings for readability.
Step 1: Preparing Your File for Upload
Supported File Types
- Markdown (
.md
) - PDF (
.pdf
) - Word Documents (
.docx
) - JSON (
.json
)
Length Requirements
- Your file should have at least 4 cohesive sections, each detailed enough to form a meaningful “chunk.”
- Although exact token counts are not mandatory, each chunk should contain enough substantive text to form meaningful document chunks.
File Formatting Guidelines
Before uploading, make sure your files are properly formatted by following these steps:
PDF and DOCX | Markdown | JSON |
---|---|---|
1. Use clear headings and avoid images without context. 2. Ensure text content is structured logically for conversion. | 1. Use correct header hierarchy (# , ## , ### , etc.). 2. Avoid missing or empty headers and skipped levels. 3. Limit headers to six levels (###### ). 4. Ensure all sections have meaningful content under them. | 1. Validate the JSON structure using a tool like JSONLint. 2. Ensure all key-value pairs are well-formed and relevant. |
Note: Google Docs users can export files as Markdown, PDF, or DOCX.
Once your files look good, proceed to the next step.
Step 2: Upload Training Files
Use SeekrFlow’s API to upload key documents and generate an AI-ready dataset. Documents can be in the following formats:
- DOCX
- Markdown
- JSON
This dataset, a storage-efficient QA Pair Parquet file, can then be used to run a supervised instruction fine-tuning job.
Note: If fine-tuning is used in combination with user-provided training data, the resulting dataset from fine-tuning and the training data must be combined prior to using the data as part of the fine-tuning service.
The following example code shows you how to upload either a single file, or multiple files of varying types:
Upload a single file
Upload a single file, up to 100mb.
Endpoint: PUT /v1/flow/files
Upload Training File
import requests
# Define API Endpoint
url = "https://flow.seekr.com/v1/flow/files"
# Set Headers with API Key
headers = {
"Authorization": "YOUR_API_KEY",
"accept": "application/json"
}
# Open and Upload File
files = {
"files": ("example.pdf", open("/path/to/your/example.pdf", "rb")),
"purpose": (None, "alignment")
}
# Make API Request
response = requests.put(url, files=files, headers=headers)
# Print Response
print("Status Code:", response.status_code)
print("Response:", response.json())
Sample response
Status Code for example.pdf: 200
Response for example.pdf:
{
"id": "file_1234567890",
"object": "file",
"created_at": "example_timestamp",
"type": "pdf",
"purpose": "alignment",
"filename": "example.pdf",
"bytes": 102400,
"created_by": "user"
}
Upload multiple files
Upload up to 10 documents at a time, up to 100mb total size. The following example uploads two files, one PDF and one DOCX, simultaneously:
Endpoint: PUT /v1/flow/bulk_files
Upload Multiple Training Files
import requests
import os
url = "https://flow.seekr.com/v1/flow/bulk_files"
headers = {
"Authorization": "YOUR_API_KEY",
"accept": "application/json"
}
file_paths = ["file_1.docx", "file_2.pdf"]
files_to_upload = [
('files', (os.path.basename(file_path), open(file_path, 'rb')))
for file_path in file_paths
]
files_to_upload.append(('purpose', (None, 'alignment')))
response = requests.put(url, files=files_to_upload, headers=headers)
print("Status Code:", response.status_code)
print("Response:", response.json())
Sample response
Status Code for file_1.pdf: 200
Response for file_1.pdf
{
'id': 'file-12345',
'object': 'file',
'created_at': 'example_timestamp',
'type': 'pdf',
'purpose': 'alignment',
'filename': 'file_1.docx',
'bytes': 348111,
'created_by': 'user'
}
Status Code for file_2.pdf: 200
Response for file_2.pdf:
{
'id': 'file-67890',
'object': 'file',
'created_at': 'example_timestamp',
'type': 'pdf',
'purpose': 'alignment',
'filename': 'file_2.pdf',
'bytes': 1711963,
'created_by': 'user'
}
Step 3: Starting an Ingestion Job (for PDFs and DOCX only)
Endpoint: POST /v1/flow/alignment/ingestion
Multi-file Ingestion
If your original source was a PDF or DOCX, you must first convert it to Markdown.
If you already have a Markdown file, skip this step.
If your file has tables embedded, make sure there are no empty cells. Empty cells run the risk of shifting all of your content in your table over to fill in the empty cell.
url = "https://flow.seekr.com/v1/flow/alignment/ingestion"
payload = {
"file_ids": ["file-123456789"] # The file ID returned from upload
}
response = requests.post(url, json=payload, headers=headers)
print("Status Code:", response.status_code)
print("Response:", response.json())
Sample Response:
{
"id": "ij-123456789",
"created_at": "example_timestamp",
"status": "running",
"output_files": []
}
Once ingestion is complete, you’ll receive a Markdown file that you can use for fine-tuning.
Step 4: Checking Ingestion Status (Optional)
Endpoint: GET /v1/flow/alignment/ingestion/{job_id}
Get Detailed Job Status
After you initiate ingestion, you may want to check the status to confirm it has been completed successfully.
import requests
job_id = "ij-123456789"
url = "https://flow.seekr.com/v1/flow/alignment/ingestion/{job_id}"
response = requests.get(url, headers=headers)
print("Status Code:", response.status_code)
print("Response:", response.json())
Sample Response:
{
"id": "ij-123456789",
"created_at": "example_timestamp",
"status": "completed",
"output_files": [
"file-123456789"
]
}
Check the status
field. Once it shows completed
, your output file(s) will be available for the next step: generating data for standard instruction fine-tuning.
Updated about 12 hours ago