The FineTuning resource provides a synchronous interface for launching and managing model fine-tuning jobs in SeekrFlow. It supports custom training and infrastructure settings, job cancellation, event history, and inference with chat completions.

Promote a model for inference

After promoting your model to production with SeekrFlow, the next step is to use it for inference. This involves sending input data to the model and receiving predictions. Here’s a detailed guide on setting up and performing inference with your promoted model.

List all fine-tuning jobs

Endpoint: GET v1/flow/fine-tunes List Fine-Tuning Jobs

To get the job ID, list all model fine-tunes that were previously created.

print(client.fine_tuning.list())# List all jobs

Find information including training files, training params, project ID, and description.

FinetuneResponse(id='ft-1234567890', training_files=['file-0987654321'], model='meta-llama/Meta-Llama-3-8B-Instruct', accel_type=<AcceleratorType.GAUDI2: 'GAUDI2'>, n_accel=8, n_epochs=1, batch_size=1, learning_rate=1e-05, created_at=datetime.datetime(2025, 4, 25, 14, 29, 41, 817017, tzinfo=TzInfo(UTC)), experiment_name='12345-examplev1', status=<FinetuneJobStatus.STATUS_COMPLETED: 'completed'>, events=None, inference_available=False, project_id=679, completed_at=datetime.datetime(2025, 4, 25, 14, 52, 34, 785756, tzinfo=TzInfo(UTC)), description=None),

Retrieve specific fine-tuning job

Endpoint: GET v1/flow/fine-tunes/{fine_tune_id} Retrieve Fine-Tuning Job

This provides the status and detailed information of a specific fine-tuning job, including logging events.

print(client.fine_tuning.retrieve('ft-123456789')) # List specific job

Promote a model

Endpoint: POST v1/flow/fine-tunes/{fine_tune_id}/promote-model Promote Model

When you've found your job ID, promote it for inference:

from seekrai import SeekrFlow
from seekrai.types.deployments import DeploymentType

# Initialize the Seekr client with your API key
client = SeekrFlow()
deployments = client.deployments

deployment = client.deployments.create(
    name=f"{f"customer-support-model"}-deployment",          # Free-form, must be 5–100 chars
    description="Serve LLM deployment for chat support",  # 5–1000 chars
    model_type=DeploymentType.FINE_TUNED_RUN,         # Use the "Fine-tuned Run" enum
    model_id="ft-1234567890", # Your fine-tune job ID goes here
    n_instances=1                                     # Number of dedicated replicas
)

print("Deployment ID:", deployment.id)

# Promote to production
deployments.promote(deployment.id)

# List deployments
for d in deployments.list().data:
    print(d.id, d.status)

# Retrieve a specific deployment
details = deployments.retrieve(deployment.id)
print(details.name, details.status)

Sample response:

Use this deployment ID to run inference with chat completions.

Deployment ID: deployment-1234567890

Run inference on streaming chat completions

Chat completions are a great way to test your model's task- or domain-specific performance, as well as gauge end-user experience.

Endpoint: POST v1/inference/chat/completions Create Chat Completion Request

stream = client.chat.completions.create(
    model=ft_deploy.id, #
    messages=[
        {"role": "system", "content": "You are SeekrBot, a helpful AI assistant trained to answer questions about financial products and services."},
        {"role": "user", "content": "Who are you?"}
    ],
    stream=True,
    max_tokens=1024,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")
    
stream = client.chat.completions.create(
    model="deployment-1234567890", # Deployment ID provided in the previous step
    messages=[
        {"role": "system", "content": "
        {"role": "system", "content": "You are SeekrBot, a helpful AI assistant trained to answer questions about financial products and services."},
        {"role": "user", "content": "Discuss what goes into a good horror movie soundtrack."},
    ],
    stream=True,
    max_tokens=1024,
)
try:
    for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
except Exception as e:
    print(f"Error: {e}")

Sample response:

I am SeekrBot, a knowledgeable guide to the realm of financial products and services.

Note: For model, you can choose any of our base supported models or models that have been promoted for inference.

Configure parameters in the request body

This guide will help you understand what each parameter does and how to tweak them in the request body for best results.

Parameter	Description	How to Use	Example
`model`	The ID of the model to use for inference.	Suitability:*Choose a model trained or fine-tuned for the specific task or domain.Check Availability: Make sure the specified model ID is valid and available.	`meta-llama/Meta-Llama-3-8B`
`messages`	The input messages that form the context for the model.	Format:A list of message objects organized by "role" and "content."Context:* Provide proper context to help the model generate relevant and coherent responses.	See below.
`stream`	When set to `True`, the response is streamed back incrementally.	Use for:*Applications needing real-time response updates.Latency: May introduce slight latency, but with more immediate feedback.	`False`: Returns the response in one complete message. `True`: Streams the response incrementally.
`max_tokens`	The maximum number of tokens to generate in the response.	Value:* Based on the desired length of the response - set appropriately. Zero value indicates no limit, but this can lead to very long responses (which may not be desired).	`50`: Limits the response to 50 tokens. `0`: No limit, allowing the model to generate a response of any length.
`temperature`	Controls the randomness of the model’s output.	Range:* Typically between 0 and 1. Higher values (closer to 1) produce more random and creative responses, while lower values (closer to 0) make the output more focused and deterministic.	`0.7`: Balanced creativity and coherence. `0.2`: More deterministic and focused responses.
`frequency_penalty`	Adjusts the likelihood of repeating tokens that have already been used.	Range:* 0-1. Higher values penalize repeated words, encouraging more diverse outputs.	`0`: No penalty. `0.5`: Moderate penalty on repetition, useful for generating more varied responses.
`n`	The number of completions to generate for each input prompt.	Default value:*Typically set to 1 unless multiple responses are needed for comparison.Multiple completions: Multiple completions are good for generating diverse responses, but increase computational cost.	`1`: Generates a single response. `3`: Generates three different completions.
`presence_penalty`	Adjusts the likelihood of introducing new topics or elements not present in the context.	Range:* 0-1. Higher values encourage the model to bring in new topics or ideas, while lower values focus on the existing context.	`0`: No penalty, more conservative responses. `0.6`: Encourages introducing new topics, useful for creative or exploratory conversations.
`stop`	Specifies a sequence where the model will stop generating further tokens.	Custom Stop Sequence:*Define custom sequences to control where the response should end.Multiple Stops: Multiple stop sequences can be specified.	`"\\n\\n"`: Stops at a double newline, useful for ending paragraphs or sentences.
`top_k`	Limits the next token prediction to the top K tokens with the highest probabilities.	Use for:* Reducing the sampling space to make the model's output more predictable. Often used in combination with `top-p` to balance randomness and determinism.	`5`: Considers the top 5 tokens for each prediction step. `50`: Considers the top 50 tokens, allowing for more variability.
`top_p`	Nucleus sampling, where the model considers the smallest set of tokens whose cumulative probability is >= to the `top-p` value.	Range:*0-1.Use for: Achieving a dynamic cutoff based on cumulative probability, allowing for more flexibility compared to `top-k`.	`1`: Considers all tokens (equivalent to no nucleus sampling). `0.9`: Considers tokens until their cumulative probability reaches 0.9.
`user`	Identifier for the end-user making the request.	Use for:* Tracking and personalization; security and privacy	`"user123"`: Identifier for a specific user session or account.

Example: Input messages

"messages": [ 

    {"role": "user", "content": "How do I reset my password?"}, 

    {"role": "assistant", "content": "To reset your password, go to the account settings and click on 'Reset Password'."} 

]

Return token log probabilities during inference

You can also return the token log probabilities, or "logprobs". Logprobs reveal the model’s certainty for each generated token.

Low-confidence predictions highlight gaps in training data. During staging, you can flag outputs with low confidence (e.g., logprobs ≪ 0) for manual review or retraining.

Unusually high logprobs for irrelevant tokens can signal hallucinations. During staging, this can help refine prompts or adjust temperature settings.

The code below follows the OpenAI convention for request formatting:

To return the logprobs of the generated tokens, set logprobs=True.
To additionally return the top n most likely tokens and their associated logprobs, set top_logprobs=n, where n > 0.

client = SeekrFlow()
stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Tell me about New York."}],
    stream=True,
    logprobs=True,
    top_logprobs=5,  # NOTE: Max number, m, depends on model deployment spec; n > m may throw validation error
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
    print(chunk.choices[0].logprobs)

Demote a model

Endpoint: GET v1/flow/fine-tunes/{fine_tune_id}/demote-model Demote a Model

When you're done with your model, demote (i.e., unstage) it:

from seekrai import SeekrFlow

client = SeekrFlow()
deployments = client.deployments

# Specify the deployment ID to demote
DEPLOYMENT_ID = "ft-12345-67890"

# Demote the deployment
updated_deployment = deployments.demote(DEPLOYMENT_ID)

# Verify demotion status
print(f"Deployment {DEPLOYMENT_ID} status: {updated_deployment.status}")
print(f"Production status: {'Active' if updated_deployment.is_active else 'Demoted'}")

With SeekrFlow, deploying to a staging environment means promoting a model for inference. However, before using the inference model in your production environment, focus on validation and verification of your model’s performance to ensure a smooth transition.

Run validation checks

Make sure your model is ready for production deployment by running comprehensive validation checks.

Prepare representative validation data

Start by curating diverse validation datasets that mirror real-world inputs, including edge cases and difficult examples your model will encounter in production.

Example: For a customer service chatbot handling clothing returns, include:

Simple queries ("How do I return this shirt?")
Complex scenarios ("I received the wrong size in a different color than ordered")
Edge cases ("I started a return but the tracking shows it's still at my house")
Multi-intent queries ("I want to exchange this and add something to my order")

Run comprehensive checks

Next, evaluate prediction quality and system performance to ensure all production requirements are satisfied.

Track critical metrics

Statistics are a critical tool for making sure your AI is trustworthy. The following are some commonly-used statistical metrics used to evaluate a model's performance:

Prediction quality metrics

Metric	Definition	When to Prioritize
Accuracy	Correct predictions ÷ total predictions	Clear-cut, factual tasks (e.g., classification); in contexts where incorrect outputs could lead to significant consequences (medical, legal, financial)
Precision	True positives ÷ predicted positives	When false positives are the most significant concern (e.g., content moderation)
Recall	True positives ÷ actual positives	When false negatives are the most significant concern (e.g., compliance monitoring)
F1 Score	Harmonic mean of precision and recall	When balance between precision and recall is needed

In practice, there's often a trade-off between minimizing false positives and false negatives. The relative cost of each error type helps determine whether to prioritize precision or recall when optimizing a model. The F1 score is specifically designed to balance the concerns of both, by combining precision and recall into a single metric.

Going back to the clothing returns chatbot, you might prioritize F1 score when the costs of incorrectly rejecting valid returns (customer dissatisfaction) and incorrectly accepting invalid returns (financial loss) are both significant concerns that need to be balanced.

System performance metrics

Latency: Response time per prediction
Throughput: Prediction volume capacity (e.g., 1000 requests/second)

Identify areas for improvement and iterate

Conduct an error analysis

Categorize and investigate patterns in incorrect predictions to identify underlying causes.

Implement targeted improvements

Apply insights from error analysis to refine the model through iterative improvements:

Hyperparameter tuning
Additional training data
Model architecture modifications