Distributed Scraping (Multiple Workers) for Scaling

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.
Development and maintenance of all types of websites:
Informational websites or web applications
Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators
E-commerce websites or web applications
Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers
Business process management web applications
CRM systems, ERP systems, corporate portals, production management systems, information parsers
Electronic service websites or web applications
Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Showing 1 of 1 servicesAll 2065 services
Distributed Scraping (Multiple Workers) for Scaling
Complex
~5 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1171
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1094
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    831
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    879
  • image_website-sbh_0.png
    Website development for SBH Partners
    999
  • image_website-_0.png
    Website development for Red Pear
    453

Implementing Distributed Scraping (Multiple Workers) for Scaling

One worker is limited by network speed, target site restrictions, and proxy capacity. Distributed system allows horizontal scaling: 5 workers give not just ×5 speed — they work with different proxies, different IPs, parallel scrape different sections, and don't interfere with each other when writing results.

Overall Architecture

Coordinator (Scheduler)
        │
        ▼
  Task Queue (Redis + BullMQ / Celery / Sidekiq)
   ┌────┴────┐
   │         │
Worker-1  Worker-2  ...  Worker-N
(proxy-1) (proxy-2)     (proxy-N)
   │         │
   └────┬────┘
        ▼
  Shared Storage (PostgreSQL + S3)
        │
        ▼
  Deduplicator / Merger

Coordinator doesn't scrape itself — only generates tasks and monitors progress. Workers stateless: any can take any task.

Task Structure

{
  "task_id": "uuid-v4",
  "job_type": "scrape_product_list",
  "url": "https://target.com/catalog?page=42",
  "site_id": 7,
  "depth": 2,
  "retry_count": 0,
  "max_retries": 3,
  "proxy_pool": "residential",
  "created_at": "2025-03-01T12:00:00Z"
}

Coordinator: Task Generation

For paginated site coordinator first determines page count, then issues task per page:

class CatalogScheduler:
    def __init__(self, queue: Queue, site_config: dict):
        self.queue = queue
        self.config = site_config

    async def schedule_full_crawl(self):
        total_pages = await self.detect_total_pages(self.config['catalog_url'])

        tasks = [
            ScrapeTask(
                url=self.config['catalog_url'] + f"?page={i}",
                job_type="scrape_listing",
                site_id=self.config['id'],
            )
            for i in range(1, total_pages + 1)
        ]
        await self.queue.bulk_add(tasks, priority=5)
        logger.info(f"Scheduled {len(tasks)} tasks for site {self.config['id']}")

Listing page tasks spawn secondary tasks — for individual product cards. This is two-level queue with different priorities.

Proxy Management

Each worker tied to proxy pool. Rotation — round-robin with "quarantine" for banned IPs:

class ProxyRotator:
    def __init__(self, proxies: list[str]):
        self.proxies = proxies
        self.banned: dict[str, datetime] = {}
        self.idx = 0

    def get_proxy(self) -> str:
        for _ in range(len(self.proxies)):
            proxy = self.proxies[self.idx % len(self.proxies)]
            self.idx += 1
            ban_until = self.banned.get(proxy)
            if ban_until and ban_until > datetime.utcnow():
                continue
            return proxy
        raise NoProxyAvailable("All proxies are in cooldown")

    def report_banned(self, proxy: str, cooldown_minutes: int = 30):
        self.banned[proxy] = datetime.utcnow() + timedelta(minutes=cooldown_minutes)

Residential proxies rotated less often, datacenter — more often (detected faster).

Preventing Duplicate Work

Before adding URL to queue coordinator checks Bloom filter or Redis Set:

async def should_schedule(self, url: str) -> bool:
    normalized = normalize_url(url)  # remove UTM, trailing slash, etc.
    key = f"visited:{self.site_id}"
    return not await self.redis.sismember(key, normalized)

async def mark_visited(self, url: str):
    normalized = normalize_url(url)
    await self.redis.sadd(f"visited:{self.site_id}", normalized)
    await self.redis.expire(f"visited:{self.site_id}", 86400 * 7)

Bloom filter preferable for scale > 10M URLs: needs 50–100× less memory than Set, with acceptable error < 0.1%.

Consistent Result Writing

Multiple workers write to same DB simultaneously. Conflicts handled via INSERT ... ON CONFLICT DO UPDATE:

INSERT INTO scraped_products (site_id, external_id, url, data, scraped_at)
VALUES (%s, %s, %s, %s, NOW())
ON CONFLICT (site_id, external_id)
DO UPDATE SET
    data       = EXCLUDED.data,
    scraped_at = EXCLUDED.scraped_at,
    updated_at = NOW()
WHERE scraped_products.scraped_at < EXCLUDED.scraped_at - INTERVAL '1 hour';

Condition WHERE scraped_at < EXCLUDED.scraped_at - '1 hour' prevents two workers hitting same product from endlessly overwriting each other.

Scaling Workers

Workers run in Docker containers. Horizontal scaling via docker-compose scale or Kubernetes HPA:

# docker-compose.yml
services:
  scraper-worker:
    image: scraper:latest
    environment:
      - REDIS_URL=redis://redis:6379
      - DB_URL=postgresql://...
      - PROXY_LIST=/run/secrets/proxies
    deploy:
      replicas: 5
    restart: unless-stopped

When new workers added queue automatically distributes load — no manual config.

Progress Monitoring

Coordinator maintains counters in Redis:

scrape_job:{job_id}:total      → 2400  (total tasks)
scrape_job:{job_id}:done       → 1837  (completed)
scrape_job:{job_id}:failed     → 12    (errors)
scrape_job:{job_id}:started_at → timestamp

Estimated completion time = (total - done) / (done / elapsed_seconds).

Dashboard (BullMQ Board or custom UI) shows active tasks, processing speed, distribution across workers, error stats.

Typical Configurations and Performance

Workers Proxies Speed Suitable for
3 10 datacenter ~500 pages/min Catalogs up to 100k products
10 50 residential ~1500 pages/min Large marketplaces
20+ 100+ residential ~5000 pages/min Daily full crawls

Implementation Timeline

Basic distributed system with 2–3 workers, Redis queue, PostgreSQL storage — 8–10 business days. Dynamic proxy rotation, Bloom filter deduplication, Docker/K8s autoscaling, monitoring dashboard — additional 5–7 days.