Content Aggregator Development
A content aggregator collects materials from multiple external sources (RSS, API, web scraping), organizes them, deduplicates, and publishes as a unified feed. Technically, this is a pipeline: parsing → normalization → deduplication → storage → display → personalization.
Data Collection Architecture
Content sources:
- RSS/Atom—standard format for news and blogs
- REST API—official platform APIs (Reddit, Twitter/X, YouTube Data API)
- Web scraping—for sites without RSS/API
- Email newsletters—parsing incoming messages
Pipeline components:
Scheduler (cron every N min)
↓
Fetcher Queue (Bull/BullMQ per source)
↓
Parser (RSS: rss-parser, HTML: Cheerio/Playwright)
↓
Normalizer (field unification: title, url, body, published_at, source_id, image_url)
↓
Deduplicator
↓
PostgreSQL storage
↓
Indexer (Elasticsearch/Meilisearch)
Deduplication
Duplicates occur when a single story is published by multiple sources. Methods:
Exact URL match: if url already in database → skip. Works only for identical URLs.
Title hash: hash(normalize(title))—normalization removes punctuation and extra whitespace, then MD5/SHA1. Effective for identical headlines.
SimHash / MinHash: algorithms for approximate near-duplicate detection. Documents with SimHash distance < N are considered duplicates. Implementation: simhash-py or near-duplicate library.
from simhash import Simhash
def is_duplicate(text1: str, text2: str, threshold: int = 5) -> bool:
h1, h2 = Simhash(text1.split()), Simhash(text2.split())
return h1.distance(h2) < threshold
Content Parsing and Normalization
RSS parsing is straightforward. Scraping article main text is harder. Tools:
-
Readability (Mozilla algorithm)—
@mozilla/readability(Node.js) extracts main text, removing navigation and ads - Trafilatura (Python)—text extraction with language detection
- Playwright—for JavaScript-heavy sites requiring full rendering
Categorization and Tags
Automatic article classification by topic:
- Keyword matching—rules: if headline contains "dollar", "stock market", "central bank" → "Finance" category
- ML classification—fastText or simple BERT-based classifier for multi-label classification
Language detection: langdetect (Python) or franc (Node.js) determine material language.
Feed Personalization
Users select sources and categories. Filtering:
- Enabled/disabled sources
- Enabled/disabled categories
- Keywords (track topic)
- Negative keywords (hide unwanted topics)
Algorithmic ranking (optional): content similar to what the user read before ranks higher. Collaborative filtering: "users with similar interests read this".
Copyright Compliance
Aggregators must:
- Show only brief previews (lead + source link), not full text
- Respect
robots.txtwhen scraping - Follow source rate limits
- Credit source and author
Fair use allows snippets, but not full reprinting.
Monetization
- Premium filter subscription (more sources, notifications)
- Native ads in feed
- API access to normalized content
Timeline
MVP (10–20 RSS sources, feed, basic search, categories): 4–6 weeks. Full-featured aggregator with ML classification, web scraping, personalization, and API: 3–5 months.







