Development of Competitor Catalog Parser Bot
A competitor catalog parser is a competitive intelligence tool. The task is narrow: regularly get an up-to-date product list with prices, specifications, and stock. Not a general scraping system, but a specialized collector for a specific source. Result—current copy of competitor catalog in your database.
Site Analysis Before Development
Before coding—analyze target site:
- Catalog URL structure: pagination via
?page=N, infinite scroll, or tree navigation by categories - Rendering: static HTML (fast and simple) or data loaded via XHR/fetch (need interception or headless)
- Protection: Cloudflare, rate limiting, authorization
- Data update frequency—how quickly new products appear and prices change
Typical minimum fields: SKU / article, title, price (regular + sale), availability, category, product page URL, collection date. For some niches important: rating, review count, weight/dimensions, brand.
Technical Implementation
For static sites—httpx + parsel (or Cheerio for Node.js). Async requests, connection pool 10–20 workers, 1–3 second delay between requests to same domain.
import httpx
import asyncio
import random
from parsel import Selector
UA_POOL = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
]
async def fetch_page(session: httpx.AsyncClient, url: str) -> str:
headers = {
'User-Agent': random.choice(UA_POOL),
'Accept-Language': 'en-US,en;q=0.9',
}
resp = await session.get(url, headers=headers, timeout=15)
resp.raise_for_status()
return resp.text
async def parse_catalog_page(html: str, base_url: str) -> list[dict]:
sel = Selector(html)
products = []
for item in sel.css('.product-card'):
price_raw = item.css('.price::text').get('').strip()
price = int(''.join(c for c in price_raw if c.isdigit())) if price_raw else None
products.append({
'title': item.css('.product-title::text').get('').strip(),
'price': price,
'sku': item.attrib.get('data-sku'),
'url': base_url + item.css('a::attr(href)').get(''),
'in_stock': bool(item.css('.in-stock')),
'image_url': item.css('img::attr(src)').get(),
})
return products
For SPAs with XHR—intercept API requests via Playwright. Many modern e-commerce sites fetch data from their own API on page load, returning JSON with product info:
from playwright.async_api import async_playwright
import json
async def intercept_catalog_api(catalog_url: str) -> list[dict]:
products = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
async def handle_response(response):
if '/api/catalog' in response.url and response.status == 200:
try:
data = await response.json()
if 'products' in data:
products.extend(data['products'])
except Exception:
pass
page.on('response', handle_response)
await page.goto(catalog_url, wait_until='networkidle')
await browser.close()
return products
If API returns JSON directly—can call it bypassing browser, 10–20x faster. To find endpoint—DevTools Network tab during manual catalog browsing.
Pagination and Full Traversal
For ?page=N pagination—sequential traversal until empty page:
async def scrape_full_catalog(base_url: str) -> list[dict]:
all_products = []
page_num = 1
async with httpx.AsyncClient() as session:
while True:
url = f'{base_url}?page={page_num}'
html = await fetch_page(session, url)
products = await parse_catalog_page(html, base_url)
if not products:
break
all_products.extend(products)
page_num += 1
await asyncio.sleep(random.uniform(1.5, 3.0)) # polite delay
return all_products
For category tree—recursively gather all category URLs first, then traverse each with pagination.
Storage and Incremental Updates
CREATE TABLE competitor_products (
id SERIAL PRIMARY KEY,
source VARCHAR(100) NOT NULL, -- 'competitor_a', 'competitor_b'
external_id VARCHAR(255) NOT NULL,
title TEXT NOT NULL,
price DECIMAL(10,2),
price_sale DECIMAL(10,2),
in_stock BOOLEAN DEFAULT TRUE,
category VARCHAR(500),
url TEXT NOT NULL,
image_url TEXT,
attributes JSONB DEFAULT '{}',
first_seen TIMESTAMPTZ DEFAULT NOW(),
last_seen TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(source, external_id)
);
CREATE TABLE competitor_price_history (
id BIGSERIAL PRIMARY KEY,
product_id INT REFERENCES competitor_products(id),
price DECIMAL(10,2),
price_sale DECIMAL(10,2),
in_stock BOOLEAN,
scraped_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON competitor_price_history(product_id, scraped_at DESC);
On re-traversal—INSERT ... ON CONFLICT (source, external_id) DO UPDATE SET last_seen = NOW(), price = EXCLUDED.price, .... History entry only when price or stock changed (compare with previous entry via LAG() or store price in main table).
Schedule and Notifications
Celery Beat or Node.js cron. Recommended frequency for competitor catalog—every 4–12 hours depending on price dynamics in niche. For marketplaces with rapidly changing prices—hourly for top products.
Alert when competitor price drops below yours—SQL query or PostgreSQL trigger with webhook notification to Slack/Telegram. Example query:
SELECT cp.title, cp.price AS competitor_price, mp.price AS my_price
FROM competitor_products cp
JOIN my_products mp ON mp.sku = cp.external_id
WHERE cp.source = 'competitor_a'
AND cp.price < mp.price
AND cp.in_stock = TRUE
ORDER BY (mp.price - cp.price) DESC;
Handling Site Structure Changes
Competitor sites change—parser breaks. Breakage signs: zero results on traversal, sharp drop in found products, empty fields in 80%+ records. Monitoring: alert if last run collected less than 50% of average products.
Timeline
Static catalog parser (1 site, up to 50k products)—3–5 days. With XHR interception and Playwright—5–8 days. Price history, alerts, dashboard—another 3–5 days. Support: on competitor site structure change—parser update usually takes 2–4 hours.







