Data API

Introduction

This section will show how to manage data with SeekrFlow™, using the various data APIs.

Data format

SeekrFlow accepts parquet and JSON lines (.jsonl) data formats for model fine-tuning, and .json files for principle alignment (a precursor step to model fine-tuning in which synthetic data is generated). For model fine-tuning, we highly recommend to use the parquet format for model training as it helps speed up the training.

Let's download an example dataset from HuggingFace. First, install the datasets library from HuggingFace.

pip install datasets

We are going to use ChatQA from Nvidia.

import os
from seekrai import SeekrFlow
from datasets import load_dataset, Dataset

client = SeekrFlow(api_key=os.environ.get("SEEKR_API_KEY"))

hf_dataset = load_dataset("nvidia/ChatQA-Training-Data", "ropes")

The format that SeekrFlow expects is the standard Completions API protocol, which is a list of messages where each message has a role and content.

[{
  "messages": [
    {
      "role": "system",
      "content": "...."
    },
    {
      "role": "user",
      "content": "...."
    },
    {
      "role": "assistant",
      "content": "...."
    }
  ]
}
]

Let's format our dataset in the way that SeekrFlow expects it.

seekrflow_dataset = []
for row in hf_dataset['train']:
    seekrflow_dataset.append({
          "messages": [
    {
      "role": "system",
      "content": "SeekrBot is a news question and answering system"
    },
    {
      "role": "user",
      "content": f"Based on the following document: {row['document']} \n\n"
                 f"Answer the following question: {row['messages'][0]['content']}"
    },
    {
      "role": "assistant",
      "content": row['answers'][0]
    }
  ]})

# convert to parquet
seekrflow_dataset = Dataset.from_list(seekrflow_dataset)
seekrflow_dataset.to_parquet("seekrflow_dataset.parquet")

Uploading data for fine-tuning

Now that we have converted our data into the format that SeekrFlow expects, we can upload it and create a SeekrFlow file.

file = client.files.upload(file="seekrflow_dataset.parquet", purpose="fine-tune")  # uploads a file
# to get the id of the uploaded file
print(file.id)

SeekrFlow will generate a unique file identifier that can be used later on, e.g. for fine-tuning a model. We can also retrieve file information and content with the following commands:

# get file info
client.files.retrieve(file.id)
# get file content
client.files.retrieve_content(file.id, output = "filename.parquet")

Uploading data for alignment

An important precursor step to model fine-tuning is principle alignment, the process by which SeekrFlow ingests a source document, transforms into a graphical tree representation, and generates synthetic data for later model training.

In order to kick off an alignment job, we upload a file, here a JSON tree representation of the source document, shown below:

file = client.files.upload("seekrflow_alignment_file.json", purpose="alignment")  # uploads a file
# to get the id of the uploaded file
print(file.id)

Just as for uploading a file for model fine-tuning, here SeekrFlow will generate a unique file identifier that can be used to start an alignment job.

Deleting data

We can delete a file that was previously created:

resp = client.files.delete(id=file.id)  # deletes a file

Listing data

We can also list all files that were previously created:

client.files.list()