What is Fine-Tuning and Why Use It?

Generalist foundation models are pre-trained on massive datasets, but as the “pre” implies, it's the initial stage in a two-stage training paradigm. Pre-training is where the model learns general patterns, for general use.

Fine-tuning is the subsequent step, where a generalist model is trained on a smaller, specific dataset to adapt it to a specialized domain.

Beyond offering leading performance to generalist models, fine-tuning costs less and yields much faster response times when compared to prompt engineering or RAG alone.

Cost Comparison: Fine-Tuning vs. RAG

The cost associated with a request to a LLM depends on:

The number of input tokens and output tokens
LLM size and the amount of computing resources it consumes (large models = more compute power than smaller specialist models)

With RAG, the cost for (1) may be high because the number of input tokens is driven by how many retrieved documents the LLM will have to "read" before it starts generating an answer. In other words, because the LLM does not know the answer to the question, it will first have to "read" all the retrieved documents. It's not uncommon for the number of requests to a LLM application to be in the hundreds or thousands per minute or even per second. Hence, the total cost for all the requests is driven up by the LLM having to "read" all the retrieved documents, for every single request!

As an example, let's take a proprietary model like gpt-4-turboand a LLM application that:

Retrieves 100K tokens per request
Responds to 100 requests per second

At current costs for gpt-4-turbo ($10 per 1 million input tokens), the input token cost for a single request is about $1. The total cost for 100 requests for our example application is $100 per second!

In addition, for RAG to work well, a large generalist model is required, which in turn requires a large amount of compute resources. That makes the cost of our second variable high as well.

There are other costs associated with RAG, such as: maintaining a database to host documents, developing and maintaining a RAG evaluation framework, and increasing the number of serving instances to offset the latency that comes with an increased input size.

SeekrFlow removes these costs by enabling its users to build specialist models that are not required to "read" or look up any information before providing a response.

SeekrFlow fine-tuned models and RAG can also work in conjunction. In the case where lookups to live data are needed, a SeekrFlow fine-tuned model can be used in a RAG flow, just like any other model.