Data API
This section will show how to manage data with SeekrFlow, using the various data APIs.
Data API
Introduction
This section will show how to manage data with SeekrFlow, using the various data APIs.
Data format
SeekrFlow accepts parquet
and JSON
lines (.jsonl
) data formats for model fine-tuning, and .json
files for principle alignment (a precursor step to model fine-tuning in which synthetic data is generated). For model fine-tuning, we highly recommend to use the parquet
format for model training as it helps speed up the training.
Let's download an example dataset from HuggingFace. First, install the datasets
library from HuggingFace.
pip install datasets
We are going to use ChatQA from Nvidia.
import os
from seekrai import SeekrFlow
from datasets import load_dataset, Dataset
client = SeekrFlow(api_key=os.environ.get("SEEKR_API_KEY"))
hf_dataset = load_dataset("nvidia/ChatQA-Training-Data", "ropes")
The format that SeekrFlow expects is the standard Completions API protocol, which is a list of messages where each message has a role
and content
.
[{
"messages": [
{
"role": "system",
"content": "...."
},
{
"role": "user",
"content": "...."
},
{
"role": "assistant",
"content": "...."
}
]
}
]
Let's format our dataset in the way that SeekrFlow expects it.
seekrflow_dataset = []
for row in hf_dataset['train']:
seekrflow_dataset.append({
"messages": [
{
"role": "system",
"content": "SeekrBot is a news question and answering system"
},
{
"role": "user",
"content": f"Based on the following document: {row['document']} \n\n"
f"Answer the following question: {row['messages'][0]['content']}"
},
{
"role": "assistant",
"content": row['answers'][0]
}
]})
# convert to parquet
seekrflow_dataset = Dataset.from_list(seekrflow_dataset)
seekrflow_dataset.to_parquet("seekrflow_dataset.parquet")
Uploading data for fine-tuning
Now that we have converted our data into the format that SeekrFlow expects, we can upload it and create a SeekrFlow file.
file = client.files.upload(file="seekrflow_dataset.parquet", purpose="fine-tune") # uploads a file
# to get the id of the uploaded file
print(file.id)
SeekrFlow will generate a unique file identifier that can be used later on, e.g. for fine-tuning a model. We can also retrieve file information and content with the following commands:
# get file info
client.files.retrieve(file.id)
# get file content
client.files.retrieve_content(file.id, output = "filename.parquet")
Uploading data for alignment
An important precursor step to model fine-tuning is principle alignment, the process by which SeekrFlow ingests a source document, transforms into a graphical tree representation, and generates synthetic data for later model training.
In order to kick off an alignment job, we upload a file, here a JSON tree representation of the source document, shown below:
file = client.files.upload("seekrflow_alignment_file.json", purpose="alignment") # uploads a file
# to get the id of the uploaded file
print(file.id)
Just as for uploading a file for model fine-tuning, here SeekrFlow will generate a unique file identifier that can be used to start an alignment job.
Deleting data
We can delete a file that was previously created:
resp = client.files.delete(id=file.id) # deletes a file
Listing data
We can also list all files that were previously created:
client.files.list()
Updated 9 days ago