Content Aggregator Development

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.
Development and maintenance of all types of websites:
Informational websites or web applications
Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators
E-commerce websites or web applications
Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers
Business process management web applications
CRM systems, ERP systems, corporate portals, production management systems, information parsers
Electronic service websites or web applications
Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Our competencies:
Development stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    847
  • image_website-sbh_0.png
    Website development for SBH Partners
    999
  • image_website-_0.png
    Website development for Red Pear
    451

Content Aggregator Development

A content aggregator collects materials from multiple external sources (RSS, API, web scraping), organizes them, deduplicates, and publishes as a unified feed. Technically, this is a pipeline: parsing → normalization → deduplication → storage → display → personalization.

Data Collection Architecture

Content sources:

  • RSS/Atom—standard format for news and blogs
  • REST API—official platform APIs (Reddit, Twitter/X, YouTube Data API)
  • Web scraping—for sites without RSS/API
  • Email newsletters—parsing incoming messages

Pipeline components:

Scheduler (cron every N min)
    ↓
Fetcher Queue (Bull/BullMQ per source)
    ↓
Parser (RSS: rss-parser, HTML: Cheerio/Playwright)
    ↓
Normalizer (field unification: title, url, body, published_at, source_id, image_url)
    ↓
Deduplicator
    ↓
PostgreSQL storage
    ↓
Indexer (Elasticsearch/Meilisearch)

Deduplication

Duplicates occur when a single story is published by multiple sources. Methods:

Exact URL match: if url already in database → skip. Works only for identical URLs.

Title hash: hash(normalize(title))—normalization removes punctuation and extra whitespace, then MD5/SHA1. Effective for identical headlines.

SimHash / MinHash: algorithms for approximate near-duplicate detection. Documents with SimHash distance < N are considered duplicates. Implementation: simhash-py or near-duplicate library.

from simhash import Simhash

def is_duplicate(text1: str, text2: str, threshold: int = 5) -> bool:
    h1, h2 = Simhash(text1.split()), Simhash(text2.split())
    return h1.distance(h2) < threshold

Content Parsing and Normalization

RSS parsing is straightforward. Scraping article main text is harder. Tools:

  • Readability (Mozilla algorithm)—@mozilla/readability (Node.js) extracts main text, removing navigation and ads
  • Trafilatura (Python)—text extraction with language detection
  • Playwright—for JavaScript-heavy sites requiring full rendering

Categorization and Tags

Automatic article classification by topic:

  • Keyword matching—rules: if headline contains "dollar", "stock market", "central bank" → "Finance" category
  • ML classification—fastText or simple BERT-based classifier for multi-label classification

Language detection: langdetect (Python) or franc (Node.js) determine material language.

Feed Personalization

Users select sources and categories. Filtering:

  • Enabled/disabled sources
  • Enabled/disabled categories
  • Keywords (track topic)
  • Negative keywords (hide unwanted topics)

Algorithmic ranking (optional): content similar to what the user read before ranks higher. Collaborative filtering: "users with similar interests read this".

Copyright Compliance

Aggregators must:

  • Show only brief previews (lead + source link), not full text
  • Respect robots.txt when scraping
  • Follow source rate limits
  • Credit source and author

Fair use allows snippets, but not full reprinting.

Monetization

  • Premium filter subscription (more sources, notifications)
  • Native ads in feed
  • API access to normalized content

Timeline

MVP (10–20 RSS sources, feed, basic search, categories): 4–6 weeks. Full-featured aggregator with ML classification, web scraping, personalization, and API: 3–5 months.