Implementation of Anti-Scraping Protection Bypass (CAPTCHA, Rate Limiting)
Industrial scraping protections—DataDome, Cloudflare Bot Management, PerimeterX, Akamai Bot Manager—analyze user behavior by dozens of signals simultaneously. Bypassing requires understanding the specific protection system and applying several techniques in combination.
Protection Classification
Level 1—Rate limiting. Simple IP-based protection: >N requests/second → block. Solved by proxy rotation and request rate reduction.
Level 2—Browser fingerprinting. Check navigator.webdriver, canvas fingerprint, WebGL rendering, audio context, plugin list. Detects headless browsers without masking.
Level 3—Behavioral analysis. ML models on protection side: mouse movement patterns, timing between actions, event order. Distinguishes bots from humans even with correct fingerprint.
Level 4—CAPTCHA. Visual or behavioral tasks. Google reCAPTCHA v2/v3, hCaptcha, Arkose Labs (FunCaptcha), Cloudflare Turnstile.
Rate Limiting Bypass
import asyncio
import random
from aiohttp import ClientSession
async def fetch_with_delay(session, url, semaphore):
async with semaphore:
await asyncio.sleep(2 + random.gauss(1, 0.5)) # normal distribution
async with session.get(url) as resp:
return await resp.text()
semaphore = asyncio.Semaphore(3) # max 3 simultaneous requests
Random delays with normal distribution significantly more effective than fixed: pattern closer to human.
Stealth for Playwright
const { chromium } = require('playwright');
const { stealth } = require('playwright-stealth');
const browser = await chromium.launch({
args: [
'--disable-blink-features=AutomationControlled',
'--no-sandbox',
]
});
const context = await browser.newContext({
userAgent: getRandomUserAgent(),
locale: 'en-US',
timezoneId: 'America/New_York',
geolocation: { longitude: -74.0060, latitude: 40.7128 },
permissions: ['geolocation'],
});
await stealth(context);
playwright-stealth patches 30+ detectable fields: navigator.webdriver, window.chrome, navigator.languages, canvas noise, others.
CAPTCHA Solution
Automatic solving via services:
| Service | CAPTCHA Types | Time | Model |
|---|---|---|---|
| 2captcha | reCAPTCHA v2/v3, hCaptcha, Turnstile | 5–30 sec | Humans |
| Anti-Captcha | reCAPTCHA v2/v3, ImageToText | 7–15 sec | Humans |
| CapSolver | reCAPTCHA v3, Arkose Labs | 1–3 sec | AI |
| NopeCHA | hCaptcha, reCAPTCHA | 2–10 sec | AI |
from twocaptcha import TwoCaptcha
solver = TwoCaptcha(API_KEY)
# reCAPTCHA v2
result = solver.recaptcha(
sitekey='6LfXXXXXXXXXXXXXXXXXXXXX',
url='https://example.com/page'
)
token = result['code'] # insert in form
For reCAPTCHA v3 need high-score token. CapSolver specializes in this.
Proxy Infrastructure
Proxy quality critical. Efficiency hierarchy:
- Residential proxies (Bright Data, Oxylabs, Smartproxy)—real IPs of home users. Most expensive, barely blocked
- Mobile proxies—4G/5G operator IPs. High trust score, cheaper than residential
- ISP proxies (static residential)—constant provider IP
- Data-center proxies—cheap but easily blocked by serious protections
class ProxyRotator:
def __init__(self, proxies: list):
self.proxies = proxies
self.stats = {p: {'success': 0, 'fail': 0} for p in proxies}
def get_best_proxy(self):
# choose proxy with highest success percentage
return max(
self.proxies,
key=lambda p: self.stats[p]['success'] /
max(self.stats[p]['success'] + self.stats[p]['fail'], 1)
)
def report_success(self, proxy):
self.stats[proxy]['success'] += 1
def report_fail(self, proxy):
self.stats[proxy]['fail'] += 1
Cookie and Session Management
Session cookies are important signal for protections. Bot that doesn't accumulate cookies over several pages looks suspicious.
# Playwright storage state save/restore
await context.storage_state(path='session.json')
# Next run
context = await browser.new_context(storage_state='session.json')
For complex sites: "warm up" session first—visit main page, few random pages, imitate scroll—then proceed to target URLs.
Detecting Protection Algorithm Change
Protections update algorithms. Need monitoring:
- Track HTTP statuses: rise in 403/429/503 → check trigger
- Compare fingerprint requests (JavaScript that loads DataDome)
- Alerts on successful parsing percentage drop below threshold
Timeline
Basic rate limiting + stealth bypass: 3–5 days. Full system with CAPTCHA solver, proxy rotator, monitoring: 12–18 days.







