Distributed Scraping (Multiple Workers) for Scaling

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and maintenance of all types of websites:

Informational websites or web applications

Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators

E-commerce websites or web applications

Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers

Business process management web applications

CRM systems, ERP systems, corporate portals, production management systems, information parsers

Electronic service websites or web applications

Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Offered services

Showing 1 of 1 servicesAll 2065 services

Distributed Scraping (Multiple Workers) for Scaling

Complex

~5 business days

FAQ

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

Development of a web application for FEEDME
1171
Development of an online store for the company FURNORO
1094
Development of a web application for Enviok
831
CRM development for Chasseurs
879
Website development for SBH Partners
999
Website development for Red Pear
453

Show more works

Implementing Distributed Scraping (Multiple Workers) for Scaling

One worker is limited by network speed, target site restrictions, and proxy capacity. Distributed system allows horizontal scaling: 5 workers give not just ×5 speed — they work with different proxies, different IPs, parallel scrape different sections, and don't interfere with each other when writing results.

Overall Architecture

Coordinator (Scheduler)
        │
        ▼
  Task Queue (Redis + BullMQ / Celery / Sidekiq)
   ┌────┴────┐
   │         │
Worker-1  Worker-2  ...  Worker-N
(proxy-1) (proxy-2)     (proxy-N)
   │         │
   └────┬────┘
        ▼
  Shared Storage (PostgreSQL + S3)
        │
        ▼
  Deduplicator / Merger

Coordinator doesn't scrape itself — only generates tasks and monitors progress. Workers stateless: any can take any task.

Task Structure

{
  "task_id": "uuid-v4",
  "job_type": "scrape_product_list",
  "url": "https://target.com/catalog?page=42",
  "site_id": 7,
  "depth": 2,
  "retry_count": 0,
  "max_retries": 3,
  "proxy_pool": "residential",
  "created_at": "2025-03-01T12:00:00Z"
}

Coordinator: Task Generation

For paginated site coordinator first determines page count, then issues task per page:

class CatalogScheduler:
    def __init__(self, queue: Queue, site_config: dict):
        self.queue = queue
        self.config = site_config

    async def schedule_full_crawl(self):
        total_pages = await self.detect_total_pages(self.config['catalog_url'])

        tasks = [
            ScrapeTask(
                url=self.config['catalog_url'] + f"?page={i}",
                job_type="scrape_listing",
                site_id=self.config['id'],
            )
            for i in range(1, total_pages + 1)
        ]
        await self.queue.bulk_add(tasks, priority=5)
        logger.info(f"Scheduled {len(tasks)} tasks for site {self.config['id']}")

Listing page tasks spawn secondary tasks — for individual product cards. This is two-level queue with different priorities.

Proxy Management

Each worker tied to proxy pool. Rotation — round-robin with "quarantine" for banned IPs:

class ProxyRotator:
    def __init__(self, proxies: list[str]):
        self.proxies = proxies
        self.banned: dict[str, datetime] = {}
        self.idx = 0

    def get_proxy(self) -> str:
        for _ in range(len(self.proxies)):
            proxy = self.proxies[self.idx % len(self.proxies)]
            self.idx += 1
            ban_until = self.banned.get(proxy)
            if ban_until and ban_until > datetime.utcnow():
                continue
            return proxy
        raise NoProxyAvailable("All proxies are in cooldown")

    def report_banned(self, proxy: str, cooldown_minutes: int = 30):
        self.banned[proxy] = datetime.utcnow() + timedelta(minutes=cooldown_minutes)

Residential proxies rotated less often, datacenter — more often (detected faster).

Preventing Duplicate Work

Before adding URL to queue coordinator checks Bloom filter or Redis Set:

async def should_schedule(self, url: str) -> bool:
    normalized = normalize_url(url)  # remove UTM, trailing slash, etc.
    key = f"visited:{self.site_id}"
    return not await self.redis.sismember(key, normalized)

async def mark_visited(self, url: str):
    normalized = normalize_url(url)
    await self.redis.sadd(f"visited:{self.site_id}", normalized)
    await self.redis.expire(f"visited:{self.site_id}", 86400 * 7)

Bloom filter preferable for scale > 10M URLs: needs 50–100× less memory than Set, with acceptable error < 0.1%.

Consistent Result Writing

Multiple workers write to same DB simultaneously. Conflicts handled via INSERT ... ON CONFLICT DO UPDATE:

INSERT INTO scraped_products (site_id, external_id, url, data, scraped_at)
VALUES (%s, %s, %s, %s, NOW())
ON CONFLICT (site_id, external_id)
DO UPDATE SET
    data       = EXCLUDED.data,
    scraped_at = EXCLUDED.scraped_at,
    updated_at = NOW()
WHERE scraped_products.scraped_at < EXCLUDED.scraped_at - INTERVAL '1 hour';

Condition WHERE scraped_at < EXCLUDED.scraped_at - '1 hour' prevents two workers hitting same product from endlessly overwriting each other.

Scaling Workers

Workers run in Docker containers. Horizontal scaling via docker-compose scale or Kubernetes HPA:

# docker-compose.yml
services:
  scraper-worker:
    image: scraper:latest
    environment:
      - REDIS_URL=redis://redis:6379
      - DB_URL=postgresql://...
      - PROXY_LIST=/run/secrets/proxies
    deploy:
      replicas: 5
    restart: unless-stopped

When new workers added queue automatically distributes load — no manual config.

Progress Monitoring

Coordinator maintains counters in Redis:

scrape_job:{job_id}:total      → 2400  (total tasks)
scrape_job:{job_id}:done       → 1837  (completed)
scrape_job:{job_id}:failed     → 12    (errors)
scrape_job:{job_id}:started_at → timestamp

Estimated completion time = (total - done) / (done / elapsed_seconds).

Dashboard (BullMQ Board or custom UI) shows active tasks, processing speed, distribution across workers, error stats.

Typical Configurations and Performance

Workers	Proxies	Speed	Suitable for
3	10 datacenter	~500 pages/min	Catalogs up to 100k products
10	50 residential	~1500 pages/min	Large marketplaces
20+	100+ residential	~5000 pages/min	Daily full crawls

Implementation Timeline

Basic distributed system with 2–3 workers, Redis queue, PostgreSQL storage — 8–10 business days. Dynamic proxy rotation, Bloom filter deduplication, Docker/K8s autoscaling, monitoring dashboard — additional 5–7 days.