Competitor Product Catalog Parser Development

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.
Development and maintenance of all types of websites:
Informational websites or web applications
Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators
E-commerce websites or web applications
Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers
Business process management web applications
CRM systems, ERP systems, corporate portals, production management systems, information parsers
Electronic service websites or web applications
Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Showing 1 of 1 servicesAll 2065 services
Competitor Product Catalog Parser Development
Medium
~3-5 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    848
  • image_website-sbh_0.png
    Website development for SBH Partners
    999
  • image_website-_0.png
    Website development for Red Pear
    451

Development of Competitor Catalog Parser Bot

A competitor catalog parser is a competitive intelligence tool. The task is narrow: regularly get an up-to-date product list with prices, specifications, and stock. Not a general scraping system, but a specialized collector for a specific source. Result—current copy of competitor catalog in your database.

Site Analysis Before Development

Before coding—analyze target site:

  • Catalog URL structure: pagination via ?page=N, infinite scroll, or tree navigation by categories
  • Rendering: static HTML (fast and simple) or data loaded via XHR/fetch (need interception or headless)
  • Protection: Cloudflare, rate limiting, authorization
  • Data update frequency—how quickly new products appear and prices change

Typical minimum fields: SKU / article, title, price (regular + sale), availability, category, product page URL, collection date. For some niches important: rating, review count, weight/dimensions, brand.

Technical Implementation

For static sites—httpx + parsel (or Cheerio for Node.js). Async requests, connection pool 10–20 workers, 1–3 second delay between requests to same domain.

import httpx
import asyncio
import random
from parsel import Selector

UA_POOL = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
]

async def fetch_page(session: httpx.AsyncClient, url: str) -> str:
    headers = {
        'User-Agent': random.choice(UA_POOL),
        'Accept-Language': 'en-US,en;q=0.9',
    }
    resp = await session.get(url, headers=headers, timeout=15)
    resp.raise_for_status()
    return resp.text

async def parse_catalog_page(html: str, base_url: str) -> list[dict]:
    sel = Selector(html)
    products = []

    for item in sel.css('.product-card'):
        price_raw = item.css('.price::text').get('').strip()
        price = int(''.join(c for c in price_raw if c.isdigit())) if price_raw else None

        products.append({
            'title': item.css('.product-title::text').get('').strip(),
            'price': price,
            'sku': item.attrib.get('data-sku'),
            'url': base_url + item.css('a::attr(href)').get(''),
            'in_stock': bool(item.css('.in-stock')),
            'image_url': item.css('img::attr(src)').get(),
        })

    return products

For SPAs with XHR—intercept API requests via Playwright. Many modern e-commerce sites fetch data from their own API on page load, returning JSON with product info:

from playwright.async_api import async_playwright
import json

async def intercept_catalog_api(catalog_url: str) -> list[dict]:
    products = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        async def handle_response(response):
            if '/api/catalog' in response.url and response.status == 200:
                try:
                    data = await response.json()
                    if 'products' in data:
                        products.extend(data['products'])
                except Exception:
                    pass

        page.on('response', handle_response)
        await page.goto(catalog_url, wait_until='networkidle')
        await browser.close()

    return products

If API returns JSON directly—can call it bypassing browser, 10–20x faster. To find endpoint—DevTools Network tab during manual catalog browsing.

Pagination and Full Traversal

For ?page=N pagination—sequential traversal until empty page:

async def scrape_full_catalog(base_url: str) -> list[dict]:
    all_products = []
    page_num = 1

    async with httpx.AsyncClient() as session:
        while True:
            url = f'{base_url}?page={page_num}'
            html = await fetch_page(session, url)
            products = await parse_catalog_page(html, base_url)

            if not products:
                break

            all_products.extend(products)
            page_num += 1
            await asyncio.sleep(random.uniform(1.5, 3.0))  # polite delay

    return all_products

For category tree—recursively gather all category URLs first, then traverse each with pagination.

Storage and Incremental Updates

CREATE TABLE competitor_products (
  id           SERIAL PRIMARY KEY,
  source       VARCHAR(100) NOT NULL,      -- 'competitor_a', 'competitor_b'
  external_id  VARCHAR(255) NOT NULL,
  title        TEXT NOT NULL,
  price        DECIMAL(10,2),
  price_sale   DECIMAL(10,2),
  in_stock     BOOLEAN DEFAULT TRUE,
  category     VARCHAR(500),
  url          TEXT NOT NULL,
  image_url    TEXT,
  attributes   JSONB DEFAULT '{}',
  first_seen   TIMESTAMPTZ DEFAULT NOW(),
  last_seen    TIMESTAMPTZ DEFAULT NOW(),
  UNIQUE(source, external_id)
);

CREATE TABLE competitor_price_history (
  id         BIGSERIAL PRIMARY KEY,
  product_id INT REFERENCES competitor_products(id),
  price      DECIMAL(10,2),
  price_sale DECIMAL(10,2),
  in_stock   BOOLEAN,
  scraped_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON competitor_price_history(product_id, scraped_at DESC);

On re-traversal—INSERT ... ON CONFLICT (source, external_id) DO UPDATE SET last_seen = NOW(), price = EXCLUDED.price, .... History entry only when price or stock changed (compare with previous entry via LAG() or store price in main table).

Schedule and Notifications

Celery Beat or Node.js cron. Recommended frequency for competitor catalog—every 4–12 hours depending on price dynamics in niche. For marketplaces with rapidly changing prices—hourly for top products.

Alert when competitor price drops below yours—SQL query or PostgreSQL trigger with webhook notification to Slack/Telegram. Example query:

SELECT cp.title, cp.price AS competitor_price, mp.price AS my_price
FROM competitor_products cp
JOIN my_products mp ON mp.sku = cp.external_id
WHERE cp.source = 'competitor_a'
  AND cp.price < mp.price
  AND cp.in_stock = TRUE
ORDER BY (mp.price - cp.price) DESC;

Handling Site Structure Changes

Competitor sites change—parser breaks. Breakage signs: zero results on traversal, sharp drop in found products, empty fields in 80%+ records. Monitoring: alert if last run collected less than 50% of average products.

Timeline

Static catalog parser (1 site, up to 50k products)—3–5 days. With XHR interception and Playwright—5–8 days. Price history, alerts, dashboard—another 3–5 days. Support: on competitor site structure change—parser update usually takes 2–4 hours.