Implementation of Scraping via Puppeteer/Playwright (Headless Browser)
Headless browsers are tools for scraping sites that can't be processed with static HTML parser: React/Vue/Angular pages, lazy-loaded content, authenticated data, dynamic tables with infinite scroll.
Puppeteer vs Playwright
| Parameter | Puppeteer | Playwright |
|---|---|---|
| Browsers | Chrome/Chromium | Chrome, Firefox, Safari |
| Language | Node.js | Node.js, Python, Java, C# |
| Auto-wait | No (explicit waits) | Yes (auto-wait elements) |
| Development speed | Medium | Higher |
| Ecosystem maturity | High | Growing |
Playwright preferred for new projects: its auto-wait significantly reduces scraping errors—no need to manually wait for each element.
Typical Scraping Scenario
// Playwright: parse catalog with infinite scroll
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 ...',
viewport: { width: 1280, height: 900 }
});
const page = await context.newPage();
await page.goto('https://example.com/catalog');
// Scroll to bottom
let prevHeight = 0;
while (true) {
const height = await page.evaluate(() => document.body.scrollHeight);
if (height === prevHeight) break;
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1500 + Math.random() * 1000);
prevHeight = height;
}
// Extract data
const items = await page.$$eval('.product-card', cards =>
cards.map(card => ({
title: card.querySelector('.title')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
url: card.querySelector('a')?.href
}))
);
Performance Optimization
Browser startup is expensive. For industrial scraping:
- Browser context pool—one Chrome process, multiple isolated contexts
-
Disable resources—block font, image, analytics loading via
page.route() -
Clustering—
playwright-clusteror self-written pool withworker_threads
Blocking unnecessary traffic reduces page load time 40–70% and memory usage.
Headless Browser Detection
Modern protections (DataDome, PerimeterX, Cloudflare Bot Management) detect automation by dozens of signals. Main bypass methods:
-
playwright-stealth—patches
navigator.webdriverand other detectable fields -
Realistic mouse movements via
playwright-mouse-helper - Unique fingerprints—different viewport, timezone, locale per session
Timeline
Basic scraper for one site: 2–4 days. Scraper with protection bypass, proxy rotation, monitoring: 7–10 days.







