Meta Llama-Guard model

This page is about hwo to use the Llama Gaurd model in SeekrFlow.

Full Model Guide: meta-llama/Llama-Guard-3-8B

Summary

Llama Guard 3 (8B) is an open-source large language model from Meta designed for content moderation and safety classification. It evaluates user-generated text across categories such as hate speech, harassment, violence, and misinformation. Now available in the SeekrFlow model library, this model can be deployed to flag unsafe content in chat, social platforms, RAG pipelines, and agent interactions.

Key Features

  • 22 Risk Categories
    Based on MLCommons Responsible AI taxonomy (e.g., hate, violence, self-harm, sexual content, etc.)
  • Supports 1st or 2nd Person Framing
    Detects unsafe content from either user or assistant messages
  • Multi-language Capable
    While optimized for English, can moderately generalize to Spanish, French, German, etc.
  • Simple Prompt Format
    Uses a clean, JSON-like input for user and assistant messages
  • Open-source, Lightweight Model
    Uses the 8B Llama 3 base model for relatively fast, deployable inference

Target Input Type

  • Text-based user or assistant messages (chat-based systems, generative agents, Q&A apps)
  • Supports both single message and multi-turn conversation review
  • Not built for audio or video moderation — text only

Taxonomy Categories (MLCommons Hazard Set)

The model classifies text using the MLCommons Responsible AI hazard taxonomy, including:

  • Hate

  • Harassment

  • Sexual content

  • Self-harm

  • Violence

  • Criminal planning

  • Weapons

  • Drugs

  • Alcohol

  • Misinformation

  • Health misinfo

  • Legal misinfo

  • Spam

  • Profanity

  • Insults

  • Graphic content

  • NSFW

  • Solicitation

  • Extremism

  • Privacy violation

  • Malicious code

  • Financial harm

    Each category returns a "safe" or "unsafe" label.

Languages Supported

  • Primary: English
  • Partial Generalization: Spanish, French, German, Portuguese, Italian, Dutch, etc.
  • Model has not been fine-tuned for multi-language safety, so results may vary across languages.

How to Use in SeekrFlow (Step-by-Step)

  1. Deploy the Model
  2. Format Your Input
    • Use the structured message format:
{
  "messages": [
    {"role": "user", "content": "Hey, you suck and I hate you."}
  ]
}
  1. Run Inference via SDK or API
  • Pass the messages array with your deployed model ID
  • Results will return one or more unsafe category flags
  1. Handle Unsafe Output
  • Optional: Block, redact, route to human review, or rephrase based on categories flagged

Model Prompt Format & Best Practices

Input Format

{
  "messages": [
    { "role": "user", "content": "..." },
    { "role": "assistant", "content": "..." }
  ]
}
  • Can include just user input or both user + assistant responses
  • Order matters; system assumes a back-and-forth conversation

Output Format

{
  "unsafe": true,
  "categories": {
    "hate": true,
    "violence": false,
    "harassment": true,
    ...
  }
}

Best Practices

  • Keep messages short (single utterance or <300 tokens)
  • Structure your input for clarity (don't mix system prompts with user messages)
  • Use consistent formatting if auditing multi-turn chats

What SeekrFlow Provides

  • Fully hosted version of Llama-Guard-3-8B
  • Fast, scalable API and SDK integration
  • Outputs content moderation flags per message
  • Access control, observability, and logging

What You Can Build Separately

  • Pipeline to generate or retrieve user messages
  • Model inference logic (loop through messages, score each)
  • Optional: dashboard, moderation UI, policy logic, score aggregation, redaction tools

Summary Table

FeatureProvided by SeekrFlowBuilt by You
Hosted model deployment
SDK & API access
Moderation scores (per message)
Input structuring
UI or dashboard
Logging and analytics✅ (optional)
Moderation actions✅ (block, flag, etc.)

Limitations

  • No confidence scores — only binary (safe/unsafe) output
  • English-centric — lower accuracy in non-English text
  • Context window is limited to ~8k tokens (short convos recommended)
    • Not fine-tuned for extremely short, isolated phrases
    • No audio, image, or video input support

Use Cases

  • Chat moderation (real-time or post-hoc)
  • Filtering unsafe input/output from LLMs
  • RAG system safety layer before retrieval/inference
  • Agent communication review or agent guardrails
  • Comment section moderation for news or community platforms

FAQs

Q: Can I use Llama Guard 3 for non-English content?
A: It may work on major Western languages, but it's optimized for English. Accuracy may drop otherwise.

Q: Can I use this on a full conversation or only individual messages?
A: You can submit a full message history in messages[], but it's best to keep it short for clarity and performance.

Q: Does it return a confidence score?
A: No, only binary true/false flags per category.

Q: Can I fine-tune or add my own taxonomy?
A: Not currently via SeekrFlow. You could host a modified version on your own infra if needed.

Q: How does this compare to Seekr ContentGuard?
A: Llama Guard is general-purpose and supports wide content types. Seekr ContentGuard is specifically tuned for podcast episodes and uses different scoring systems (GARM + Civility).