Full Model Guide: `meta-llama/Llama-Guard-3-8B`

Summary

Llama Guard 3 (8B) is an open-source large language model from Meta designed for content moderation and safety classification. It evaluates user-generated text across categories such as hate speech, harassment, violence, and misinformation. Now available in the SeekrFlow model library, this model can be deployed to flag unsafe content in chat, social platforms, RAG pipelines, and agent interactions.

Moderation Type: Text-based classification
Taxonomy: MLCommons Responsible AI (22-category hazard set)
Usage Context: Real-time inference or post-hoc content safety auditing

Helpful Links:
- 🔗 Meta Model Card & Prompt Format
- 🔗 Hugging Face Page
- 🔗 SeekrFlow Model Library

Key Features

22 Risk Categories
Based on MLCommons Responsible AI taxonomy (e.g., hate, violence, self-harm, sexual content, etc.)
Supports 1st or 2nd Person Framing
Detects unsafe content from either user or assistant messages
Multi-language Capable
While optimized for English, can moderately generalize to Spanish, French, German, etc.
Simple Prompt Format
Uses a clean, JSON-like input for user and assistant messages
Open-source, Lightweight Model
Uses the 8B Llama 3 base model for relatively fast, deployable inference

Target Input Type

Text-based user or assistant messages (chat-based systems, generative agents, Q&A apps)
Supports both single message and multi-turn conversation review
Not built for audio or video moderation — text only

Taxonomy Categories (MLCommons Hazard Set)

The model classifies text using the MLCommons Responsible AI hazard taxonomy, including:

Hate
Harassment
Sexual content
Self-harm
Violence
Criminal planning
Weapons
Drugs
Alcohol
Misinformation
Health misinfo
Legal misinfo
Spam
Profanity
Insults
Graphic content
NSFW
Solicitation
Extremism
Privacy violation
Malicious code
Financial harm

Each category returns a "safe" or "unsafe" label.

Languages Supported

Primary: English
Partial Generalization: Spanish, French, German, Portuguese, Italian, Dutch, etc.
Model has not been fine-tuned for multi-language safety, so results may vary across languages.

How to Use in SeekrFlow (Step-by-Step)

Deploy the Model
- Go to SeekrFlow Model Library
- Deploy meta-llama/Llama-Guard-3-8B
- Copy the model ID
Format Your Input
- Use the structured message format:

{
  "messages": [
    {"role": "user", "content": "Hey, you suck and I hate you."}
  ]
}

Run Inference via SDK or API

Pass the messages array with your deployed model ID
Results will return one or more unsafe category flags

Handle Unsafe Output

Optional: Block, redact, route to human review, or rephrase based on categories flagged

Model Prompt Format & Best Practices

Input Format

{
  "messages": [
    { "role": "user", "content": "..." },
    { "role": "assistant", "content": "..." }
  ]
}

Can include just user input or both user + assistant responses
Order matters; system assumes a back-and-forth conversation

Output Format

{
  "unsafe": true,
  "categories": {
    "hate": true,
    "violence": false,
    "harassment": true,
    ...
  }
}

Best Practices

Keep messages short (single utterance or <300 tokens)
Structure your input for clarity (don't mix system prompts with user messages)
Use consistent formatting if auditing multi-turn chats

What SeekrFlow Provides

Fully hosted version of Llama-Guard-3-8B
Fast, scalable API and SDK integration
Outputs content moderation flags per message
Access control, observability, and logging

What You Can Build Separately

Pipeline to generate or retrieve user messages
Model inference logic (loop through messages, score each)
Optional: dashboard, moderation UI, policy logic, score aggregation, redaction tools

Summary Table

Feature	Provided by SeekrFlow	Built by You
Hosted model deployment	✅
SDK & API access	✅
Moderation scores (per message)	✅
Input structuring		✅
UI or dashboard		✅
Logging and analytics		✅ (optional)
Moderation actions		✅ (block, flag, etc.)

Limitations

No confidence scores — only binary (safe/unsafe) output
English-centric — lower accuracy in non-English text
Context window is limited to ~8k tokens (short convos recommended)
- Not fine-tuned for extremely short, isolated phrases
- No audio, image, or video input support

Use Cases

Chat moderation (real-time or post-hoc)
Filtering unsafe input/output from LLMs
RAG system safety layer before retrieval/inference
Agent communication review or agent guardrails
Comment section moderation for news or community platforms

FAQs

Q: Can I use Llama Guard 3 for non-English content?
A: It may work on major Western languages, but it's optimized for English. Accuracy may drop otherwise.

Q: Can I use this on a full conversation or only individual messages?
A: You can submit a full message history in messages[], but it's best to keep it short for clarity and performance.

Q: Does it return a confidence score?
A: No, only binary true/false flags per category.

Q: Can I fine-tune or add my own taxonomy?
A: Not currently via SeekrFlow. You could host a modified version on your own infra if needed.

Q: How does this compare to Seekr ContentGuard?
A: Llama Guard is general-purpose and supports wide content types. Seekr ContentGuard is specifically tuned for podcast episodes and uses different scoring systems (GARM + Civility).

Full Model Guide: meta-llama/Llama-Guard-3-8B