Deploy to a Staging Environment
Validate your fine-tuned model, monitor key performance metrics, and promote it for inference
With SeekrFlow™, deploying to a staging environment means promoting a model for inference. However, before using the inference model in your production environment, focus on validation and verification of your model’s performance to ensure a smooth transition.
Run Validation Checks
Make sure your model is ready for production deployment by running comprehensive validation checks.
Prepare Representative Validation Data
Start by curating diverse validation datasets that mirror real-world inputs, including edge cases and difficult examples your model will encounter in production.
Example: For a customer service chatbot handling clothing returns, include:
- Simple queries ("How do I return this shirt?")
- Complex scenarios ("I received the wrong size in a different color than ordered")
- Edge cases ("I started a return but the tracking shows it's still at my house")
- Multi-intent queries ("I want to exchange this and add something to my order")
Run Comprehensive Checks
Next, evaluate prediction quality and system performance to ensure all production requirements are satisfied.
Track Critical Metrics
Prediction Quality Metrics
Metric | Definition | When to Prioritize |
---|---|---|
Accuracy | Correct predictions ÷ total predictions | Clear-cut, factual tasks (e.g., classification); in contexts where incorrect outputs could lead to significant consequences (medical, legal, financial) |
Precision | True positives ÷ predicted positives | When false positives are the most significant concern (e.g., content moderation) |
Recall | True positives ÷ actual positives | When false negatives are the most significant concern (e.g., compliance monitoring) |
F1 Score | Harmonic mean of precision and recall | When balance between precision and recall is needed |
In practice, there's often a trade-off between minimizing false positives and false negatives. The relative cost of each error type helps determine whether to prioritize precision or recall when optimizing a model. The F1 score is specifically designed to balance the concerns of both, by combining precision and recall into a single metric.
Going back to the clothing returns chatbot, you might prioritize F1 score when the costs of incorrectly rejecting valid returns (customer dissatisfaction) and incorrectly accepting invalid returns (financial loss) are both significant concerns that need to be balanced.
System Performance Metrics
Latency: Response time per prediction
Throughput: Prediction volume capacity (e.g., 1000 requests/second)
Identify Areas for Improvement and Iterate
Conduct an Error Analysis
Categorize and investigate patterns in incorrect predictions to identify underlying causes.
Implement Targeted Improvements
Apply insights from error analysis to refine the model through iterative improvements:
- Hyperparameter tuning
- Additional training data
- Model architecture modifications
Promoting a model for inference
After promoting your model to production with SeekrFlow, the next step is to use it for inference. This involves sending input data to the model and receiving predictions. Here’s a detailed guide on setting up and performing inference with your promoted model.
List fine-tuning jobs
Endpoint: GET v1/flow/fine-tunes
List Fine-Tuning Jobs
To get the job ID, list all model fine-tunes that were previously created.
client.fine_tuning.list()# List all jobs
client.fine_tuning.retrieve(ft_id) # List specific job
Promote a model
Endpoint: POST v1/flow/fine-tunes/{fine_tune_id}/promote-model
Promote Model
When you've found your job ID, promote it for inference:
client.fine_tuning.promote(id="ft-1234567890")
Run inference on streaming chat completions
Chat completions are a great way to test your model's task- or domain-specific performance, as well as gauge end-user experience.
Endpoint: POST v1/inference/chat/completions
Create Chat Completion Request
stream = client.chat.completions.create(
model=ft_deploy.id,
messages=[
{"role": "system", "content": "You are SeekrBot, a helpful AI assistant"},
{"role": "user", "content": "who are you?"}
],
stream=True,
max_tokens=1024,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="")
Note: For model
, you can choose any of our base supported models or models that have been promoted for inference.
Configure parameters in the request body
This guide will help you understand how to tweak the parameters in the request body for best results.
Parameter | Description | How to Use | Example |
---|---|---|---|
model | The ID of the model to use for inference. | Suitability: Choose a model trained or fine-tuned for the specific task or domain. Check Availability: Make sure the specified model ID is valid and available. | meta-llama/Meta-Llama-3-8B |
messages | The input messages that form the context for the model. | Format: A list of message objects organized by "role" and "content." Context: Provide proper context to help the model generate relevant and coherent responses. | See below. |
stream | When set to True , the response is streamed back incrementally. | Use for: Applications needing real-time response updates. Latency: May introduce slight latency, but with more immediate feedback. | False : Returns the response in one complete message.True : Streams the response incrementally. |
max_tokens | The maximum number of tokens to generate in the response. | Value: Based on the desired length of the response - set appropriately. Zero value indicates no limit, but this can lead to very long responses (which may not be desired). | 50 : Limits the response to 50 tokens.0 : No limit, allowing the model to generate a response of any length. |
temperature | Controls the randomness of the model’s output. | Range: Typically between 0 and 1. Higher values (closer to 1) produce more random and creative responses, while lower values (closer to 0) make the output more focused and deterministic. | 0.7 : Balanced creativity and coherence.0.2 : More deterministic and focused responses. |
frequency_penalty | Adjusts the likelihood of repeating tokens that have already been used. | Range: 0-1. Higher values penalize repeated words, encouraging more diverse outputs. | 0 : No penalty.0.5 : Moderate penalty on repetition, useful for generating more varied responses. |
n | The number of completions to generate for each input prompt. | Default value:Typically set to 1 unless multiple responses are needed for comparison. Multiple completions: Multiple completions are good for generating diverse responses, but increase computational cost. | 1 : Generates a single response.3 : Generates three different completions. |
presence_penalty | Adjusts the likelihood of introducing new topics or elements not present in the context. | Range: 0-1. Higher values encourage the model to bring in new topics or ideas, while lower values focus on the existing context. | 0 : No penalty, more conservative responses.0.6 : Encourages introducing new topics, useful for creative or exploratory conversations. |
stop | Specifies a sequence where the model will stop generating further tokens. | Custom Stop Sequence: Define custom sequences to control where the response should end. Multiple Stops: Multiple stop sequences can be specified. | "\\n\\n" : Stops at a double newline, useful for ending paragraphs or sentences. |
top_k | Limits the next token prediction to the top K tokens with the highest probabilities. | Use for: Reducing the sampling space to make the model's output more predictable. Often used in combination with top-p to balance randomness and determinism. | 5 : Considers the top 5 tokens for each prediction step.50 : Considers the top 50 tokens, allowing for more variability. |
top_p | Nucleus sampling, where the model considers the smallest set of tokens whose cumulative probability is >= to the top-p value. | Range: 0-1. Use for: Achieving a dynamic cutoff based on cumulative probability, allowing for more flexibility compared to top-k . | 1 : Considers all tokens (equivalent to no nucleus sampling).0.9 : Considers tokens until their cumulative probability reaches 0.9. |
user | Identifier for the end-user making the request. | Use for: Tracking and personalization; security and privacy | "user123" : Identifier for a specific user session or account. |
Example: Input messages
"messages": [
{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "To reset your password, go to the account settings and click on 'Reset Password'."}
]
Return token log probabilities during inference
You can also return the token log probabilities, or "logprobs". Logprobs reveal the model’s certainty for each generated token.
Low-confidence predictions highlight gaps in training data. During staging, you can flag outputs with low confidence (e.g., logprobs ≪ 0) for manual review or retraining.
Unusually high logprobs for irrelevant tokens can signal hallucinations. During staging, this can help refine prompts or adjust temperature settings.
The code below follows the OpenAI convention for request formatting:
- To return the logprobs of the generated tokens, set
logprobs=True
. - To additionally return the top n most likely tokens and their associated logprobs, set
top_logprobs=n
, where n > 0.
client = SeekrFlow(api_key=os.environ.get("SEEKR_API_KEY"))
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Tell me about New York."}],
stream=True,
logprobs=True,
top_logprobs=5, # NOTE: Max number, m, depends on model deployment spec; n > m may throw validation error
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
print(chunk.choices[0].logprobs)
Demote a model
Endpoint: GET v1/flow/fine-tunes/{fine_tune_id}/demote-model
Demote a Model
When you're done with your model, unstage it:
client.fine_tuning.demote(id="ft-e9c8928d-ef90-44b5-a837-76d089924639")
Updated about 2 hours ago