Price Aggregator Development
Price aggregator collects prices on identical products from different stores and shows together. User sees cheapest and clicks. Technically: parsing, data normalization, product matching. Each stage non-trivial at scale.
Data Sources
Data arrives three ways:
Price Lists and Feeds — store provides YML, XML, CSV with current assortment. Most reliable: structured, official partnership, no ban risk. Yandex.Market YML — de-facto standard for Russian market.
Partner APIs — some stores provide REST API. Documentation weak, request limits strict.
Web Parsing — for feedless stores. High risk: captcha, rate limiting, markup changes, IP blocking. Constant maintenance required.
Start with feeds and API — more stable. Parse selectively for key sources.
Data Collector Architecture
Scheduler (Celery Beat / Laravel Scheduler)
↓ every N hours
FeedFetcher workers (one per source)
↓
RawData storage (S3 or local FS)
↓
Parser workers (XML/CSV/JSON → normalized)
↓
Normalizer (unit conversion, text cleanup)
↓
Matcher (map to DB products)
↓
PriceHistory (timeseries write)
↓
ElasticsearchIndexer (update index)
Queue: Celery + Redis for Python, Laravel Horizon for PHP. Each feed independent, source error doesn't block others.
Product Matching
Hardest part. Task: determine Samsung Galaxy A55 128GB Blue from shop A and Smartphone Samsung Galaxy A55 (SM-A556B) 128 Гб синий from shop B are same.
Deterministic:
- GTIN/EAN: if both have barcode — exact match
- MPN: manufacturer sku unique per brand
- URL canonicalization: some stores include GTIN in URL
Fuzzy:
from rapidfuzz import fuzz
def match_score(title_a: str, title_b: str, brand_a: str, brand_b: str) -> float:
if brand_a.lower() != brand_b.lower():
return 0.0
title_similarity = fuzz.token_sort_ratio(title_a, title_b)
return title_similarity / 100
Threshold: 0.85+ auto-match, 0.65–0.85 manual review, below new product.
ML approach: product name embeddings (sentence-transformers, ruBERT) + cosine similarity. Much accurate especially for different formulations. Model trained on confirmed matches.
Price History
Main value — not current price but change history. Each price change recorded, not overwritten.
price_history (
id BIGSERIAL,
source_offer_id BIGINT,
price NUMERIC(12,2),
in_stock BOOLEAN,
recorded_at TIMESTAMPTZ DEFAULT NOW()
)
For PostgreSQL timeseries use TimescaleDB — extension auto-partitions by time, speeds queries. Alternative — InfluxDB, ClickHouse for high loads.
Graph — standard product page component. Chart.js or Recharts, aggregate by day: SELECT date_trunc('day', recorded_at), min(price) FROM price_history.
SEO Strategy
Aggregators generate organic traffic on product pages. Key queries: "[product name] buy", "[product name] price", "[product name] cheap".
- Each canonical product page: unique title with price range
- Structured data:
Product+AggregateOfferwithlowPrice,highPrice,offerCount - Static category pages with aggregated stats
- Blog reviews and curations — long-term SEO traffic
Timeline
- MVP: feeds from 3–5 sources, manual matching, product pages, basic search — 8–12 weeks
- Full aggregator: automatic matching (fuzzy + ML), price graphs, store cabinet, partner tracking — 20–30 weeks
- Each new source (parsing): 3–7 work days depending complexity
Aggregator requires operational support: sources change structure, products need re-matching, new stores connect. Not one-off project but platform with support team.







