Fine-tuning
Fine-tuning
Why fine-tuning?
Generalist foundation models are pre-trained on massive datasets but as the pre-fix “pre” insinuates, there is a lot more room for improvement.
In most entreprise applications, prompt engineering and Retrieval Augmented Generation (RAG) over generalist foundation models will not be sufficient for production-grade accuracy. Fine-tuning solves this problem by training a base model on domain-specific data, making it suitable for applications that require domain-specific knowledge.
In addition to offering leading performance to generalist foundation models, fine-tuning comes at a much lower cost and yields much faster response times, when compared to prompt engineering or RAG over generalist models.
Fine-tuning methods
There are 2 general classes of fine-tuning methods: full-fine tuning (FFT), wherein all the parameters of the model being fine-tuned are updated and parameter-efficient fine-tuning (PEFT), wherein adapter modules with a small number of new trainable parameters are added to the existing base model parameters.
The table below summarizes popular fine-tuning methods. SeekrFlow currently supports FFT. Future releases will support additional methods.
Method | Category | Description | Status |
---|---|---|---|
Full Fine-Tuning | FFT | Fine-tuning the entire model on the target task. This is computationally expensive and requires a large amount of labeled data. | Current |
LoRA (Low-Rank Adaptation) | PEFT | Decomposing weight matrices into low-rank matrices and only training these low-rank components. | Upcoming |
Prefix Tuning | PEFT | Prepending trainable prefix tokens to the input sequence. | Upcoming |
MoRA (High-Rank Adaptation) | PEFT | Overcomes limitations of LoRA. | Upcoming |
Cost of fine-tuning vs RAG
The cost associated with a request to a LLM depends on
- the number of input tokens and the number of output tokens
- the size of the LLM and the amount of computing resources it takes up (large generalist models require more computing power than smaller specialist models)
With RAG, the cost for (1) may be high because the number of input tokens is driven by how many retrieved documents the LLM will have to "read" before it starts generating an answer. In other words, because the LLM does not know the answer to the question, it will first have to "read" all the retrieved documents. It's not uncommon for the number of requests to a LLM application to be in the hundreds or thousands per minute or even per second. Hence, the total cost for all the requests is driven up by the LLM having to "read" all the retrieved documents, for every single request!
As an example, let's take a proprietary model like gpt-4-turbo
and a LLM application that on average
- retrieves 100K tokens per request
- responds to 100 requests per second
At current costs for gpt-4-turbo
($10 per 1 Million input tokens), the input token cost for a single request is about $1; the total cost for 100 requests for our application is $100 per second!
In addition, for RAG to work well, it requires a large generalist model, which in turn requires a large amount of computing resources. That said, the cost for (2) is also high.
There are other costs associated with RAG, such as, maintaining a database to host the documents, developing and maintaining a RAG evaluation framework, and having to increase the number of serving instances to offset the latency that comes with an increased input size.
SeekrFlow removes these costs by enabling its users to build specialist models that are not required to "read" or lookup any information before providing a response.
SeekrFlow fine-tuned models and RAG can also work in conjunction. In the case where lookups to live data are needed, a SeekrFlow fine-tuned model can be used in a RAG flow, just like any other model.
Fine-tuning a base model
Let's begin the steps towards fine-tuning a model to a specific task. Here, we will demonstrate boosting the conversational question answering and RAG capabilites of llama-3-8b
in relation to news documents.
(We will use the file that we previously formatted and uploaded to SeekrFlow.)
To create a fine-tuning job, we must first create a project, to which we will associate our fine-tuning run. We can also retrieve project information and get a list of all of our projects:
import os
from seekrai import SeekrFlow
client = SeekrFlow(api_key=os.environ.get("SEEKR_API_KEY"))
proj = client.projects.create(name="project-name", description="project-description")
# get project info
client.projects.retrieve(proj.id)
# list all projects
client.projects.list()
Next, users need to specify a TrainingConfig
object and an InfrastructureConfig
object.
The TrainingConfig
defines all parameters that affect the actual code of the training script, such as the base model to be fine-tuned, number of epochs, quantization, etc.
The InfrasructureConfig
defines the infrastructure that the fine-tuning job will run on. As SeekrFlow is agnostic to the accelerator hardware that it runs on, it accepts multiple choices for the accelerator type including. Gaudi2
, A100
and H100
. SeekrFlow will take care of provisioning all necessary compute resources and configurations required for the accelerator chosen.
In this example will use 8 Gaudi2 instances, which will trigger SeekrFlow to run on multi-card training mode:
from seekrai.types import TrainingConfig, InfrastructureConfig
training_config = TrainingConfig(
training_files=[file.id], # NOTE: We are passing in the ID of a previously uploaded fine-tuning file
model='meta-llama/Meta-Llama-3-8B', # Base model choice
n_epochs=1,
n_checkpoints=3,
batch_size=4,
learning_rate=1e-5,
experiment_name="experiment-name",
)
infrastructure_config = InfrastructureConfig(
n_accel=8,
accel_type="GAUDI2"
)
Now that we have created our configuration files, we are ready to fine-tune a model!
fine_tune = client.fine_tuning.create(
training_config=training_config,
infrastructure_config=infrastructure_config,
project_id = proj.id, # NOTE: To associate this fine-tune with a project, we are passing in the ID of the project created above
)
Monitoring a fine-tuning job
SeekrFlow job runs are tracked using SeekrFlow's event monitoring and tracking system.
To retrieve the status and progress of a run, you can use:
print(client.fine_tuning.retrieve(fine_tune.id).status)
The following snippet can be used to plot the loss of the fine-tuning run:
import matplotlib.pyplot as plt
ft_id = fine_tune.id
events = client.fine_tuning.retrieve(ft_id).events
ft_response_events_sorted = sorted(events, key=lambda x: x.epoch)
epochs = [event.epoch for event in ft_response_events_sorted]
losses = [event.loss for event in ft_response_events_sorted]
plt.figure(figsize=(8, 4))
plt.plot(epochs, losses, marker="o", linestyle="-", color="b")
plt.xlabel("epoch")
plt.ylabel("loss")
plt.title("training loss over epochs")
plt.grid(True)
max_labels = 10
step = max(1, len(epochs) // max_labels)
plt.xticks(epochs[::step], rotation=45)
plt.tight_layout()
plt.show()
Inference
In order to run inference against a trained model, we have to promote it to the inference API:
dep = client.deployments.create(
name="model-deployment",
description="model-deployment-description",
model_type="Fine-tuned Run",
model_id=fine_tune.id,
n_instances=1,
)
ft_deploy = client.deployments.promote(dep.id)
# to demote your model (when you are finished with it)
client.deployments.demote(ft_deploy.id)
This may take a few minutes. Once a model is promoted, you can run inference to obtain chat completions:
stream = client.chat.completions.create(
model=ft_deploy.id,
messages=[
{"role": "system", "content": "You are SeekrBot, a helpful AI assistant"},
{"role": "user", "content": "who are you?"}
],
stream=True,
max_tokens=1024,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")
model
can be any of our base supported models or models that have been promoted for inference.
Returning token log probabilities during inference
We can also return the token log probabilities, or "logprobs". We follow the OpenAI convention for formatting the request:
- To return the logprobs of the generated tokens, set
logprobs=True
. - To additionally return the top n most likely tokens and their associated logprobs, set
top_logprobs=n
, where n > 0.
client = SeekrFlow(api_key=os.environ.get("SEEKR_API_KEY"))
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Tell me about New York."}],
stream=True,
logprobs=True,
top_logprobs=5, # NOTE: Max number, m, depends on model deployment spec; n > m may throw validation error
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
print(chunk.choices[0].logprobs)
Listing model fine-tunes
We can also list all model fine-tunes that were previously created, or get information about a particular model fine-tune:
client.fine_tuning.list()
client.fine_tuning.retrieve(ft_id)
Updated 2 months ago