Deploy a Fine-tuned Model for Inference
Launch and manage model fine-tuning jobs in SeekrFlow, with custom settings, event history, and inference.
The FineTuning resource provides a synchronous interface for launching and managing model fine-tuning jobs in SeekrFlow. It supports custom training and infrastructure settings, job cancellation, event history, and inference with chat completions.
Promote a model for inference
After promoting your model to production with SeekrFlow, the next step is to use it for inference. This involves sending input data to the model and receiving predictions. Here’s a detailed guide on setting up and performing inference with your promoted model.
List all fine-tuning jobs
Endpoint: GET v1/flow/fine-tunes
List Fine-Tuning Jobs
To get the job ID, list all model fine-tunes that were previously created.
print(client.fine_tuning.list())# List all jobs
Find information including training files, training params, project ID, and description.
FinetuneResponse(id='ft-1234567890', training_files=['file-0987654321'], model='meta-llama/Meta-Llama-3-8B-Instruct', accel_type=<AcceleratorType.GAUDI2: 'GAUDI2'>, n_accel=8, n_epochs=1, batch_size=1, learning_rate=1e-05, created_at=datetime.datetime(2025, 4, 25, 14, 29, 41, 817017, tzinfo=TzInfo(UTC)), experiment_name='12345-examplev1', status=<FinetuneJobStatus.STATUS_COMPLETED: 'completed'>, events=None, inference_available=False, project_id=679, completed_at=datetime.datetime(2025, 4, 25, 14, 52, 34, 785756, tzinfo=TzInfo(UTC)), description=None),
Retrieve specific fine-tuning job
Endpoint: GET v1/flow/fine-tunes/{fine_tune_id}
Retrieve Fine-Tuning Job
This provides the status and detailed information of a specific fine-tuning job, including logging events.
print(client.fine_tuning.retrieve('ft-123456789')) # List specific job
Promote a model
Endpoint: POST v1/flow/fine-tunes/{fine_tune_id}/promote-model
Promote Model
When you've found your job ID, promote it for inference:
from seekrai import SeekrFlow
from seekrai.types.deployments import DeploymentType
# Initialize the Seekr client with your API key
client = SeekrFlow()
deployments = client.deployments
deployment = client.deployments.create(
name=f"{f"customer-support-model"}-deployment", # Free-form, must be 5–100 chars
description="Serve LLM deployment for chat support", # 5–1000 chars
model_type=DeploymentType.FINE_TUNED_RUN, # Use the "Fine-tuned Run" enum
model_id="ft-1234567890", # Your fine-tune job ID goes here
n_instances=1 # Number of dedicated replicas
)
print("Deployment ID:", deployment.id)
# Promote to production
deployments.promote(deployment.id)
# List deployments
for d in deployments.list().data:
print(d.id, d.status)
# Retrieve a specific deployment
details = deployments.retrieve(deployment.id)
print(details.name, details.status)
Sample response:
Use this deployment ID to run inference with chat completions.
Deployment ID: deployment-1234567890
Run inference on streaming chat completions
Chat completions are a great way to test your model's task- or domain-specific performance, as well as gauge end-user experience.
Endpoint: POST v1/inference/chat/completions
Create Chat Completion Request
stream = client.chat.completions.create(
model=ft_deploy.id, #
messages=[
{"role": "system", "content": "You are SeekrBot, a helpful AI assistant trained to answer questions about financial products and services."},
{"role": "user", "content": "Who are you?"}
],
stream=True,
max_tokens=1024,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")
stream = client.chat.completions.create(
model="deployment-1234567890", # Deployment ID provided in the previous step
messages=[
{"role": "system", "content": "
{"role": "system", "content": "You are SeekrBot, a helpful AI assistant trained to answer questions about financial products and services."},
{"role": "user", "content": "Discuss what goes into a good horror movie soundtrack."},
],
stream=True,
max_tokens=1024,
)
try:
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
except Exception as e:
print(f"Error: {e}")
Sample response:
I am SeekrBot, a knowledgeable guide to the realm of financial products and services.
Note: For model
, you can choose any of our base supported models or models that have been promoted for inference.
Configure parameters in the request body
This guide will help you understand what each parameter does and how to tweak them in the request body for best results.
Parameter | Description | How to Use | Example |
---|---|---|---|
| The ID of the model to use for inference. |
|
|
| The input messages that form the context for the model. |
| See below. |
| When set to |
|
|
| The maximum number of tokens to generate in the response. |
|
|
| Controls the randomness of the model’s output. |
|
|
| Adjusts the likelihood of repeating tokens that have already been used. |
|
|
| The number of completions to generate for each input prompt. |
|
|
| Adjusts the likelihood of introducing new topics or elements not present in the context. |
|
|
| Specifies a sequence where the model will stop generating further tokens. |
|
|
| Limits the next token prediction to the top K tokens with the highest probabilities. |
|
|
| Nucleus sampling, where the model considers the smallest set of tokens whose cumulative probability is >= to the |
|
|
| Identifier for the end-user making the request. |
|
|
Example: Input messages
"messages": [
{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "To reset your password, go to the account settings and click on 'Reset Password'."}
]
Return token log probabilities during inference
You can also return the token log probabilities, or "logprobs". Logprobs reveal the model’s certainty for each generated token.
Low-confidence predictions highlight gaps in training data. During staging, you can flag outputs with low confidence (e.g., logprobs ≪ 0) for manual review or retraining.
Unusually high logprobs for irrelevant tokens can signal hallucinations. During staging, this can help refine prompts or adjust temperature settings.
The code below follows the OpenAI convention for request formatting:
- To return the logprobs of the generated tokens, set
logprobs=True
. - To additionally return the top n most likely tokens and their associated logprobs, set
top_logprobs=n
, where n > 0.
client = SeekrFlow()
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Tell me about New York."}],
stream=True,
logprobs=True,
top_logprobs=5, # NOTE: Max number, m, depends on model deployment spec; n > m may throw validation error
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
print(chunk.choices[0].logprobs)
Demote a model
Endpoint: GET v1/flow/fine-tunes/{fine_tune_id}/demote-model
Demote a Model
When you're done with your model, demote (i.e., unstage) it:
from seekrai import SeekrFlow
client = SeekrFlow()
deployments = client.deployments
# Specify the deployment ID to demote
DEPLOYMENT_ID = "ft-12345-67890"
# Demote the deployment
updated_deployment = deployments.demote(DEPLOYMENT_ID)
# Verify demotion status
print(f"Deployment {DEPLOYMENT_ID} status: {updated_deployment.status}")
print(f"Production status: {'Active' if updated_deployment.is_active else 'Demoted'}")
With SeekrFlow, deploying to a staging environment means promoting a model for inference. However, before using the inference model in your production environment, focus on validation and verification of your model’s performance to ensure a smooth transition.
Run validation checks
Make sure your model is ready for production deployment by running comprehensive validation checks.
Prepare representative validation data
Start by curating diverse validation datasets that mirror real-world inputs, including edge cases and difficult examples your model will encounter in production.
Example: For a customer service chatbot handling clothing returns, include:
- Simple queries ("How do I return this shirt?")
- Complex scenarios ("I received the wrong size in a different color than ordered")
- Edge cases ("I started a return but the tracking shows it's still at my house")
- Multi-intent queries ("I want to exchange this and add something to my order")
Run comprehensive checks
Next, evaluate prediction quality and system performance to ensure all production requirements are satisfied.
Track critical metrics
Statistics are a critical tool for making sure your AI is trustworthy. The following are some commonly-used statistical metrics used to evaluate a model's performance:
Prediction quality metrics
Metric | Definition | When to Prioritize |
---|---|---|
Accuracy | Correct predictions ÷ total predictions | Clear-cut, factual tasks (e.g., classification); in contexts where incorrect outputs could lead to significant consequences (medical, legal, financial) |
Precision | True positives ÷ predicted positives | When false positives are the most significant concern (e.g., content moderation) |
Recall | True positives ÷ actual positives | When false negatives are the most significant concern (e.g., compliance monitoring) |
F1 Score | Harmonic mean of precision and recall | When balance between precision and recall is needed |
In practice, there's often a trade-off between minimizing false positives and false negatives. The relative cost of each error type helps determine whether to prioritize precision or recall when optimizing a model. The F1 score is specifically designed to balance the concerns of both, by combining precision and recall into a single metric.
Going back to the clothing returns chatbot, you might prioritize F1 score when the costs of incorrectly rejecting valid returns (customer dissatisfaction) and incorrectly accepting invalid returns (financial loss) are both significant concerns that need to be balanced.
System performance metrics
Latency: Response time per prediction
Throughput: Prediction volume capacity (e.g., 1000 requests/second)
Identify areas for improvement and iterate
Conduct an error analysis
Categorize and investigate patterns in incorrect predictions to identify underlying causes.
Implement targeted improvements
Apply insights from error analysis to refine the model through iterative improvements:
- Hyperparameter tuning
- Additional training data
- Model architecture modifications
Updated 30 days ago