Implementing AI Content Moderation for User-Generated Content
User-generated content — comments, reviews, images, chat messages — requires moderation. Manual review doesn't scale: at 10,000 posts per day, a team of five moderators simply can't handle it. AI moderation solves this automatically, leaving only borderline cases to humans.
What Can Be Automatically Moderated
Text content: spam, profanity, hate speech, threats, exposed personal data. Images: explicit material, violence, copyright infringement (via perceptual hashing). Links: phishing, malicious domains. Tone: toxic comments without explicit banned words.
Each category requires its own model or API endpoint — there's no universal solution.
Moderation System Architecture
Synchronous pre-publication check — user submits content, server checks before saving. Latency 200–800ms. Suitable for critical scenarios: paid reviews, legally significant posts.
Asynchronous queue — content is saved with status pending, background worker checks via queue (RabbitMQ, SQS, Redis Streams). Publication happens after approval or after N minutes if no violations. Suitable for high-load forums and chats.
Hybrid scheme — fast synchronous check by simple rules (banned words, length, patterns) + asynchronous ML-check for content that passed the initial filter.
POST /api/comment
→ sync: banned words check (< 5ms)
→ sync: OpenAI Moderation API (< 300ms)
→ save with status=published/flagged
→ async: image scan if attachments
Tools and APIs
OpenAI Moderation API — free /v1/moderations endpoint. Returns categories: hate, hate/threatening, self-harm, sexual, violence, harassment. Text only. No prompt needed — separate specialized model.
import openai
def moderate_text(content: str) -> dict:
response = openai.moderations.create(input=content)
result = response.results[0]
if result.flagged:
categories = {k: v for k, v in result.categories.__dict__.items() if v}
return {"allowed": False, "categories": categories}
return {"allowed": True}
Google Perspective API — toxicity analysis with score 0 to 1. Attributes: TOXICITY, SEVERE_TOXICITY, IDENTITY_ATTACK, INSULT, PROFANITY, THREAT. Supports multilingual content. Free quota: 1 QPS, paid from $0.25 per 1000 requests.
AWS Rekognition — image moderation. API DetectModerationLabels returns label hierarchy with confidence score. Categories: Explicit Nudity, Violence, Visually Disturbing, Hate Symbols.
Azure Content Safety — text and images in one API. Categories: hate, sexual, violence, self-harm. Each scored 0–6. Includes Groundedness Detection for response verification.
Custom Fine-Tuned Model
For domain-specific content (technical forum with specialized terminology, medical platform), third-party APIs produce many false positives. Solution — fine-tune on your own data.
Process: gather dataset of 2000–5000 labeled examples (approved/rejected), fine-tune distilbert-base-multilingual-cased via Hugging Face Transformers, deploy as separate service.
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="./moderation-model",
device=0 # GPU
)
def classify_content(text: str) -> tuple[str, float]:
result = classifier(text, truncation=True, max_length=512)[0]
return result["label"], result["score"]
Inference on CPU — ~50ms for text up to 512 tokens. On GPU (T4) — ~5ms.
Image Processing
Before sending to API, preprocess: resize to 2048px on longest side, convert to JPEG quality 85%, strip EXIF metadata. This reduces cost and speeds response.
To protect against uploading known-banned content — PhotoDNA (Microsoft) or pHash comparison with hash database. PhotoDNA integrates via Azure, pHash is self-implemented:
import imagehash
from PIL import Image
def compute_phash(image_path: str) -> str:
img = Image.open(image_path)
return str(imagehash.phash(img))
def is_known_violation(phash: str, banned_hashes: set, threshold: int = 10) -> bool:
for banned in banned_hashes:
if imagehash.hex_to_hash(phash) - imagehash.hex_to_hash(banned) < threshold:
return True
return False
Manual Moderation Dashboard
Automation doesn't decide borderline cases — they're shown to a moderator. Manual queue contains:
- content with confidence 0.4–0.7 (uncertain result)
- content reported by users
- content from new accounts without history
UI: list with filters, hotkeys for quick decisions (approve/reject/escalate), decision history tied to operator, accuracy metrics per operator.
Feedback and Retraining
Model degrades as content patterns change. Improvement cycle:
- Save all decisions (automatic and manual) with labels
- Weekly analyze discrepancies: where automation failed, moderator corrected
- Monthly retrain model on accumulated corrections
- A/B test new version on 10% traffic before full rollout
Monitoring
Metrics for Grafana/Datadog:
-
moderation.requests.total— total volume -
moderation.latency.p99— 99th percentile latency -
moderation.flagged.rate— share of blocked content -
moderation.false_positive.rate— share of incorrectly blocked (by appeals) -
moderation.queue.depth— manual moderation queue depth
Alert: if false_positive.rate > 5% in 24 hours — model needs review.
Timeline
| Stage | Timeline |
|---|---|
| OpenAI Moderation API + basic rules integration | 3–5 days |
| Asynchronous queue + content statuses | 3–4 days |
| Manual moderation dashboard | 5–7 days |
| Image moderation (AWS Rekognition) | 2–3 days |
| Fine-tune custom model | 10–15 days |
| Retraining cycle + monitoring | 3–5 days |
Basic integration with OpenAI Moderation API and manual review queue — 2 weeks. Full system with custom model, monitoring, and dashboard — 5–6 weeks.







