Development of Web Scraping System
Scraping is not just "download HTML and extract tags." An industrial data collection system includes request queue management, proxy rotation, anti-bot bypass, data normalization, and reliable storage. A script written once in Beautiful Soup is not a system; it's a script. The difference becomes obvious after a week of operation.
What a Full System Consists Of
Scheduler and task queue. Celery with Redis or RabbitMQ as broker. Each URL is a separate task with priority, retry policy, and TTL. Scrapy-cluster or custom orchestrator coordinates multiple workers.
Page loader. Two modes:
- Static pages—httpx with async, connection pooling, keep-alive
- JavaScript rendering—Playwright (preferred) or Puppeteer, headless Chromium with browser profile management
ID rotation. Proxy pool (residential or datacenter depending on goal), User-Agent rotation from real fingerprint datasets, random delays with normal distribution, cookie session management.
Data extraction. CSS selectors or XPath—for stable structures. For complex logic—parsel (lxml wrapper). If structure is unstable—LLM extraction via OpenAI or local Ollama with few-shot prompts.
Storage and normalization. Raw HTML in S3/MinIO for reprocessing. Extracted data—PostgreSQL or ClickHouse (for analytics on billions of records). Deduplication by URL-hash + content-hash.
Anti-Bot Protections and Workarounds
| Protection | Bypass Method |
|---|---|
| Rate limiting | Adaptive delays, distribution across IPs |
| CAPTCHA (reCAPTCHA v2/v3) | 2captcha/Anti-Captcha API or train own model |
| Cloudflare Bot Management | Playwright with real fingerprint, TLS fingerprint cycling |
| JavaScript challenges | Headless browser with full JS execution |
| Honeypot links | Filter invisible elements before visiting |
| IP reputation blocks | Residential proxy (BrightData, Oxylabs, Smartproxy) |
Cloudflare with "Bot Fight Mode"—most complex case. Solution: Playwright with real Chromium, bypass via puppeteer-extra-plugin-stealth or playwright-stealth, imitate mouse movements via CDP.
Architecture for High-Load Scraping
[Scheduler] -> [Redis Queue] -> [Fetcher Workers x N]
|
[Parser Workers x M]
|
[Raw Store S3] + [DB Writer]
|
[Monitor / Dashboard]
Fetcher and Parser are different workers with different resource requirements. Fetcher—I/O bound, can handle 100+ async tasks per process. Parser—CPU bound, one process per core.
Data quality monitoring. Great Expectations or custom checks: percentage of non-empty fields, ranges for numeric values, identifier uniqueness. On quality degradation—alert to Slack/Telegram and pause workers.
Legal and Ethical Aspects
Before launching: check robots.txt, analyze site ToS, assess load on target server. For public data this is usually acceptable. For restricted sections—explicit permission required.
Stack and Implementation Timeline
Python stack: Scrapy / httpx + parsel, Playwright, Celery, PostgreSQL/ClickHouse, MinIO.
Timeline by stages:
- Basic parser for one site—3-5 days
- Queue + proxy rotation + retry—5-7 days
- JS rendering + anti-bot bypass—7-14 days
- Monitoring, normalization, storage—5-10 days
- Full system for 10+ sources—4-8 weeks
Use Cases
Competitor monitoring. Prices, assortment, stock—collection hourly with change history.
Ad aggregation. OLX, Avito-like platforms: tens of thousands of records daily, deduplication, address geocoding.
Research tasks. Dataset collection for ML, brand mention sentiment monitoring, SEO position analysis.
Content projects. News syndication, job aggregation, catalog building from open sources.
System Maintenance
Websites change—parsers break. Need a breakage detection strategy: page schema comparison with baseline, successful extraction percentage monitoring, automated fixture tests. Typical indicator: 95%+ successful extractions under stable operation.
A well-designed scraping system is not one-time development but infrastructure with a lifecycle. Budget time for support: roughly 20% of initial development time per year.







