Create a Fine-Tuning Job

Creating a Fine-Tuning Job

Here’s a comprehensive example of how to create a fine-tuning job for a Llama 2 model, with specific infrastructure requirements (86 CPUs, 8 GPUs) and training parameters.

To create a fine-tuning job, you'll first create a project, to which you can associate a fine-tuning run. You can also retrieve project information and get a list of all of your projects:

Create a project

import os
from seekrai import SeekrFlow

client = SeekrFlow(api_key=os.environ.get("SEEKR_API_KEY"))

# Create a new project
proj = client.projects.create(name="project-name", description="project-description")

You can also get a list of all projects, or locate a specific one:

Endpoints:

GET v1/flow/fine-tunes List Fine-Tuning Jobs

GET v1/flow/fine-tunes/{fine_tune_id} Retrieve Fine-Tuning Job

# List all projects
client.projects.list()

# Get project info
client.projects.retrieve(proj.id) 

Note: Project ID is an integer and can be found by listing all projects; e.g., project_id=588

Next, specify a TrainingConfig object and an InfrastructureConfig object.

The TrainingConfig defines all parameters that affect the actual code of the training script, such as the base model to be fine-tuned, number of epochs, quantization, etc.

The InfrasructureConfig defines the infrastructure for the fine-tuning job. Gaudi2 is available on SeekrFlow, with other compute options available for on-prem installations and AI appliances.

This example uses 8 Gaudi2 instances, which triggers SeekrFlow to run on multi-card training mode:

Training configuration

Endpoint: POST v1/flow/fine-tunes Create a Fine-Tuning Job

import requests

url = "https://build.seekr.com/v1/flow/fine-tunes"

payload = {
    "infrastructure_config": {
        "n_cpu": 86,
        "n_gpu": 8,
        "memory": 2400
    },
    "training_config": {
        "training_files": ["string"],
        "n_epochs": 10,
        "batch_size": 32,
        "learning_rate": 0.0001,
        "model": "meta-llama/Llama-2-7b-hf",
        "hf_token": "string",
        "experiment_name": "string",
        "max_length": 512,
        "bf16": True,
        "gradient_checkpointing": True
    }
}

headers = {
    "accept": "application/json",
    "content-type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)
print(response.text)

Now that you've created your configuration files, you're ready to fine-tune a model!

Fine-tune a base model

fine_tune = client.fine_tuning.create(
    training_config=training_config,
    infrastructure_config=infrastructure_config,
    project_id = proj.id,  # NOTE: To associate this fine-tune with a project, we are passing in the ID (an integer) of the project created above. 
)

Monitoring your fine-tuning run

All job runs are tracked using SeekrFlow's event monitoring and tracking system.

To retrieve the status and progress of a run, use this:

print(client.fine_tuning.retrieve(fine_tune.id).status)

Sample response:

Status: Running

Plot training loss

import matplotlib.pyplot as plt

ft_id = fine_tune.id
events = client.fine_tuning.retrieve(ft_id).events
ft_response_events_sorted = sorted(events, key=lambda x: x.epoch)

epochs = [event.epoch for event in ft_response_events_sorted]
losses = [event.loss for event in ft_response_events_sorted]

plt.figure(figsize=(8, 4))
plt.plot(epochs, losses, marker="o", linestyle="-", color="b")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training Loss Over Epochs")
plt.grid(True)

max_labels = 10
step = max(1, len(epochs) // max_labels)
plt.xticks(epochs[::step], rotation=45)

plt.tight_layout()
plt.show()

Interpreting a Training Loss Chart

The training loss measures how closely predictions match actual values. A lower value and downward curve signal progress; watch out for a flat line or rising lines to indicate learning issues.

The training loss measures how closely predictions match actual values. A lower value and downward curve forming an elbow shape signal progress; watch out for a flat line or rising lines to indicate learning issues.

The training loss chart gives a visual snapshot of how well your model is learning during fine-tuning by tracking its loss over time.

Loss: The Y-axis represents training loss, which quantifies the difference between the model's predictions and the actual target values. A lower loss indicates better performance.
Epochs: The upper X-axis shows epochs, where each epoch corresponds to one complete pass through the entire training dataset.
Steps: The lower X-axis represents the training steps, calculated as:
Total Steps = (Total Number of Samples ÷ number of instances * Batch Size) × Number of Epochs

Decreasing Loss Curve: Indicates that the model is learning and improving its predictions.
Plateauing Loss Curve: The model may have reached its learning capacity with the current configuration.

  • Try adjusting hyperparameters (e.g., learning rate, batch size) and retrain to see if the model improves.

Increasing Loss Curve: May indicate overfitting or issues with the training process.

  • Review data quality to ensure the training data is clean and representative of the problem space, and retrain with a higher-quality dataset.

Tuning Hyperparameters

Hyperparameter tuning allows you to tweak model performance for optimal results. Hyperparameters play a crucial role in the training process, impacting both performance and training efficiency. Here’s a guide to essential hyperparameters and how to set their values:

Learning Rate

Consider Impact on Convergence: The learning rate controls how much the model’s weights are updated with respect to the loss gradient. A high learning rate can lead to rapid convergence but risks overshooting the optimal solution, while a low learning rate ensures stable convergence but may require more training epochs.

Start with a Small Value: A common practice is to start with a small learning rate (e.g., 0.001) and adjust based on the training performance.

Batch Size

Memory Constraints: Larger batch sizes require more memory, but can lead to faster and more stable training due to more accurate gradient estimates.

Training Speed: Smaller batch sizes can lead to noisier updates, but may converge faster due to more frequent weight updates.

Experimentation: Start with a moderate batch size (e.g., 32 or 64) and adjust based on memory availability and training speed.

Number of Epochs

Overfitting: More epochs allow the model to learn more from the data, but also increase the risk of overfitting (where the model learns the task too well, leading to poor generalization on unseen data).

Training Time: The number of epochs impacts the total training time. Ensure that the chosen number of epochs balances training time with model performance.

Max Length

Sequence Length: The maximum length of input sequences the model will handle. Longer sequences can capture more context but require more memory and computation.

Balance Length: Choose a length that balances capturing sufficient context with computational efficiency.

Task Requirements: Set these based on the typical length of the input data for your task.

Bf16 (Bfloat16 Precision)

Memory Efficiency: Bfloat16 reduces memory usage, allowing for larger models or batch sizes.

Training Stability: Maintains training stability while offering computational efficiency.

Hardware Support: Ensure your selected hardware supports bf16 precision.

Gradient Checkpointing

Memory Usage: Gradient checkpointing trades increased computation for reduced memory usage, allowing for training larger models.

Complex Models: Particularly useful for training very large models where memory is a constraint.

Training Time: May increase training time due to additional computations during backpropagation.


Next

Next, prepare to deploy to a staging environment.