Deploy to a Staging Environment

Validate your fine-tuned model, monitor key performance metrics, and promote it for inference

With SeekrFlow™, deploying to a staging environment means promoting a model for inference. However, before using the inference model in your production environment, focus on validation and verification of your model’s performance to ensure a smooth transition.

Run Validation Checks

Make sure your model is ready for production deployment by running comprehensive validation checks.

Prepare Representative Validation Data

Start by curating diverse validation datasets that mirror real-world inputs, including edge cases and difficult examples your model will encounter in production.

Example: For a customer service chatbot handling clothing returns, include:

  • Simple queries ("How do I return this shirt?")
  • Complex scenarios ("I received the wrong size in a different color than ordered")
  • Edge cases ("I started a return but the tracking shows it's still at my house")
  • Multi-intent queries ("I want to exchange this and add something to my order")

Run Comprehensive Checks

Next, evaluate prediction quality and system performance to ensure all production requirements are satisfied.

Track Critical Metrics

Prediction Quality Metrics

MetricDefinitionWhen to Prioritize
AccuracyCorrect predictions ÷ total predictionsClear-cut, factual tasks (e.g., classification); in contexts where incorrect outputs could lead to significant consequences (medical, legal, financial)
PrecisionTrue positives ÷ predicted positivesWhen false positives are the most significant concern (e.g., content moderation)
RecallTrue positives ÷ actual positivesWhen false negatives are the most significant concern (e.g., compliance monitoring)
F1 ScoreHarmonic mean of precision and recallWhen balance between precision and recall is needed

In practice, there's often a trade-off between minimizing false positives and false negatives. The relative cost of each error type helps determine whether to prioritize precision or recall when optimizing a model. The F1 score is specifically designed to balance the concerns of both, by combining precision and recall into a single metric.

Going back to the clothing returns chatbot, you might prioritize F1 score when the costs of incorrectly rejecting valid returns (customer dissatisfaction) and incorrectly accepting invalid returns (financial loss) are both significant concerns that need to be balanced.

System Performance Metrics

Latency: Response time per prediction
Throughput: Prediction volume capacity (e.g., 1000 requests/second)

Identify Areas for Improvement and Iterate

Conduct an Error Analysis

Categorize and investigate patterns in incorrect predictions to identify underlying causes.

Implement Targeted Improvements

Apply insights from error analysis to refine the model through iterative improvements:


Promoting a model for inference

After promoting your model to production with SeekrFlow, the next step is to use it for inference. This involves sending input data to the model and receiving predictions. Here’s a detailed guide on setting up and performing inference with your promoted model.

List fine-tuning jobs

Endpoint: GET v1/flow/fine-tunes List Fine-Tuning Jobs

To get the job ID, list all model fine-tunes that were previously created.

client.fine_tuning.list()# List all jobs
client.fine_tuning.retrieve(ft_id) # List specific job

Promote a model

Endpoint: POST v1/flow/fine-tunes/{fine_tune_id}/promote-model Promote Model

When you've found your job ID, promote it for inference:

client.fine_tuning.promote(id="ft-1234567890")

Run inference on streaming chat completions

Chat completions are a great way to test your model's task- or domain-specific performance, as well as gauge end-user experience.

Endpoint: POST v1/inference/chat/completions Create Chat Completion Request

stream = client.chat.completions.create(
    model=ft_deploy.id,
    messages=[
        {"role": "system", "content": "You are SeekrBot, a helpful AI assistant"},
        {"role": "user", "content": "who are you?"}
    ],
    stream=True,
    max_tokens=1024,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

Note: For model, you can choose any of our base supported models or models that have been promoted for inference.

Configure parameters in the request body

This guide will help you understand how to tweak the parameters in the request body for best results.

ParameterDescriptionHow to UseExample
modelThe ID of the model to use for inference.Suitability: Choose a model trained or fine-tuned for the specific task or domain. Check Availability: Make sure the specified model ID is valid and available.meta-llama/Meta-Llama-3-8B
messagesThe input messages that form the context for the model.Format: A list of message objects organized by "role" and "content." Context: Provide proper context to help the model generate relevant and coherent responses.See below.
streamWhen set to True, the response is streamed back incrementally.Use for: Applications needing real-time response updates. Latency: May introduce slight latency, but with more immediate feedback.False: Returns the response in one complete message.

True: Streams the response incrementally.
max_tokensThe maximum number of tokens to generate in the response.Value: Based on the desired length of the response - set appropriately. Zero value indicates no limit, but this can lead to very long responses (which may not be desired).50: Limits the response to 50 tokens.

0: No limit, allowing the model to generate a response of any length.
temperatureControls the randomness of the model’s output.Range: Typically between 0 and 1. Higher values (closer to 1) produce more random and creative responses, while lower values (closer to 0) make the output more focused and deterministic.0.7: Balanced creativity and coherence.

0.2: More deterministic and focused responses.
frequency_penaltyAdjusts the likelihood of repeating tokens that have already been used.Range: 0-1. Higher values penalize repeated words, encouraging more diverse outputs.0: No penalty.

0.5: Moderate penalty on repetition, useful for generating more varied responses.
nThe number of completions to generate for each input prompt.Default value:Typically set to 1 unless multiple responses are needed for comparison. Multiple completions: Multiple completions are good for generating diverse responses, but increase computational cost. 1: Generates a single response.

3: Generates three different completions.
presence_penaltyAdjusts the likelihood of introducing new topics or elements not present in the context.Range: 0-1. Higher values encourage the model to bring in new topics or ideas, while lower values focus on the existing context.0: No penalty, more conservative responses.

0.6: Encourages introducing new topics, useful for creative or exploratory conversations.
stopSpecifies a sequence where the model will stop generating further tokens.Custom Stop Sequence: Define custom sequences to control where the response should end. Multiple Stops: Multiple stop sequences can be specified."\\n\\n": Stops at a double newline, useful for ending paragraphs or sentences.
top_kLimits the next token prediction to the top K tokens with the highest probabilities.Use for: Reducing the sampling space to make the model's output more predictable. Often used in combination with top-p to balance randomness and determinism.5: Considers the top 5 tokens for each prediction step.

50: Considers the top 50 tokens, allowing for more variability.
top_pNucleus sampling, where the model considers the smallest set of tokens whose cumulative probability is >= to the top-p value.Range: 0-1. Use for: Achieving a dynamic cutoff based on cumulative probability, allowing for more flexibility compared to top-k. 1: Considers all tokens (equivalent to no nucleus sampling).

0.9: Considers tokens until their cumulative probability reaches 0.9.
userIdentifier for the end-user making the request.Use for: Tracking and personalization; security and privacy"user123": Identifier for a specific user session or account.

Example: Input messages

"messages": [ 

    {"role": "user", "content": "How do I reset my password?"}, 

    {"role": "assistant", "content": "To reset your password, go to the account settings and click on 'Reset Password'."} 

] 

Return token log probabilities during inference

You can also return the token log probabilities, or "logprobs". Logprobs reveal the model’s certainty for each generated token.

Low-confidence predictions highlight gaps in training data. During staging, you can flag outputs with low confidence (e.g., logprobs ≪ 0) for manual review or retraining.

Unusually high logprobs for irrelevant tokens can signal hallucinations. During staging, this can help refine prompts or adjust temperature settings.

The code below follows the OpenAI convention for request formatting:

  • To return the logprobs of the generated tokens, set logprobs=True.
  • To additionally return the top n most likely tokens and their associated logprobs, set top_logprobs=n, where n > 0.
client = SeekrFlow(api_key=os.environ.get("SEEKR_API_KEY"))
stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Tell me about New York."}],
    stream=True,
    logprobs=True,
    top_logprobs=5,  # NOTE: Max number, m, depends on model deployment spec; n > m may throw validation error
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
    print(chunk.choices[0].logprobs)

Demote a model

Endpoint: GET v1/flow/fine-tunes/{fine_tune_id}/demote-model Demote a Model

When you're done with your model, unstage it:

client.fine_tuning.demote(id="ft-e9c8928d-ef90-44b5-a837-76d089924639")