Meta Llama-Guard model
This page is about hwo to use the Llama Gaurd model in SeekrFlow.
Full Model Guide: meta-llama/Llama-Guard-3-8B
meta-llama/Llama-Guard-3-8B
Summary
Llama Guard 3 (8B) is an open-source large language model from Meta designed for content moderation and safety classification. It evaluates user-generated text across categories such as hate speech, harassment, violence, and misinformation. Now available in the SeekrFlow model library, this model can be deployed to flag unsafe content in chat, social platforms, RAG pipelines, and agent interactions.
-
Moderation Type: Text-based classification
-
Taxonomy: MLCommons Responsible AI (22-category hazard set)
-
Usage Context: Real-time inference or post-hoc content safety auditing
Helpful Links:

Key Features
- 22 Risk Categories
Based on MLCommons Responsible AI taxonomy (e.g., hate, violence, self-harm, sexual content, etc.) - Supports 1st or 2nd Person Framing
Detects unsafe content from either user or assistant messages - Multi-language Capable
While optimized for English, can moderately generalize to Spanish, French, German, etc. - Simple Prompt Format
Uses a clean, JSON-like input for user and assistant messages - Open-source, Lightweight Model
Uses the 8B Llama 3 base model for relatively fast, deployable inference
Target Input Type
- Text-based user or assistant messages (chat-based systems, generative agents, Q&A apps)
- Supports both single message and multi-turn conversation review
- Not built for audio or video moderation — text only
Taxonomy Categories (MLCommons Hazard Set)
The model classifies text using the MLCommons Responsible AI hazard taxonomy, including:
-
Hate
-
Harassment
-
Sexual content
-
Self-harm
-
Violence
-
Criminal planning
-
Weapons
-
Drugs
-
Alcohol
-
Misinformation
-
Health misinfo
-
Legal misinfo
-
Spam
-
Profanity
-
Insults
-
Graphic content
-
NSFW
-
Solicitation
-
Extremism
-
Privacy violation
-
Malicious code
-
Financial harm
Each category returns a "safe" or "unsafe" label.
Languages Supported
- Primary: English
- Partial Generalization: Spanish, French, German, Portuguese, Italian, Dutch, etc.
- Model has not been fine-tuned for multi-language safety, so results may vary across languages.
How to Use in SeekrFlow (Step-by-Step)
- Deploy the Model
- Go to SeekrFlow Model Library
- Deploy
meta-llama/Llama-Guard-3-8B
- Copy the model ID
- Format Your Input
- Use the structured message format:
{
"messages": [
{"role": "user", "content": "Hey, you suck and I hate you."}
]
}
- Run Inference via SDK or API
- Pass the
messages
array with your deployed model ID - Results will return one or more
unsafe
category flags
- Handle Unsafe Output
- Optional: Block, redact, route to human review, or rephrase based on categories flagged
Model Prompt Format & Best Practices
Input Format
{
"messages": [
{ "role": "user", "content": "..." },
{ "role": "assistant", "content": "..." }
]
}
- Can include just user input or both user + assistant responses
- Order matters; system assumes a back-and-forth conversation
Output Format
{
"unsafe": true,
"categories": {
"hate": true,
"violence": false,
"harassment": true,
...
}
}
Best Practices
- Keep messages short (single utterance or <300 tokens)
- Structure your input for clarity (don't mix system prompts with user messages)
- Use consistent formatting if auditing multi-turn chats
What SeekrFlow Provides
- Fully hosted version of
Llama-Guard-3-8B
- Fast, scalable API and SDK integration
- Outputs content moderation flags per message
- Access control, observability, and logging
What You Can Build Separately
- Pipeline to generate or retrieve user messages
- Model inference logic (loop through messages, score each)
- Optional: dashboard, moderation UI, policy logic, score aggregation, redaction tools
Summary Table
Feature | Provided by SeekrFlow | Built by You |
---|---|---|
Hosted model deployment | ✅ | |
SDK & API access | ✅ | |
Moderation scores (per message) | ✅ | |
Input structuring | ✅ | |
UI or dashboard | ✅ | |
Logging and analytics | ✅ (optional) | |
Moderation actions | ✅ (block, flag, etc.) |
Limitations
- No confidence scores — only binary (safe/unsafe) output
- English-centric — lower accuracy in non-English text
- Context window is limited to ~8k tokens (short convos recommended)
- Not fine-tuned for extremely short, isolated phrases
- No audio, image, or video input support
Use Cases
- Chat moderation (real-time or post-hoc)
- Filtering unsafe input/output from LLMs
- RAG system safety layer before retrieval/inference
- Agent communication review or agent guardrails
- Comment section moderation for news or community platforms
FAQs
Q: Can I use Llama Guard 3 for non-English content?
A: It may work on major Western languages, but it's optimized for English. Accuracy may drop otherwise.
Q: Can I use this on a full conversation or only individual messages?
A: You can submit a full message history in messages[]
, but it's best to keep it short for clarity and performance.
Q: Does it return a confidence score?
A: No, only binary true/false
flags per category.
Q: Can I fine-tune or add my own taxonomy?
A: Not currently via SeekrFlow. You could host a modified version on your own infra if needed.
Q: How does this compare to Seekr ContentGuard?
A: Llama Guard is general-purpose and supports wide content types. Seekr ContentGuard is specifically tuned for podcast episodes and uses different scoring systems (GARM + Civility).
Updated 1 day ago