Competitor Site Structure Crawler
Analyzing competitor website structure manually is a time loss at any scale. Even for a medium project in a niche, this is 50–200 URLs that need to be collected and analyzed by nesting levels, anchors, metadata, markup schemas. A written crawler solves this in minutes and reproduces when competitor rebrand.
Stack and Approach
Two working variants: Python + Scrapy/Playwright for complex SPA with lazy loading, Node.js + Puppeteer/Cheerio for most standard sites. For tasks without dynamic JS rendering, HTTP client with HTML parser is sufficient — 5–10 times faster, simpler to deploy.
Core Features
- HTTP crawling with requests/lxml
- JavaScript rendering with Playwright for SPAs
- Schema.org extraction
- Heading hierarchy analysis
- Export to JSON/CSV/SQLite
Storage and Analysis
Collected structure exports to multiple formats depending on task:
JSON — for further programmatic processing CSV — for analysis in Excel/Google Sheets SQLite — for comparing multiple competitors or tracking changes over time
Regular Crawling and Diff
One-time data collection quickly becomes outdated. Competitors change structure, add sections, reformat headings. Useful to setup automatic runs weekly/monthly and compare results.
Ethics and Limitations
Crawler must respect robots.txt. Delay between requests is mandatory — minimum reasonable value is 1 second. For large sites better 2–3 seconds to not load server and avoid IP ban. If crawling needed regularly — IP rotation and User-Agent variation makes sense.
Timeline
Basic crawler (HTTP, no SPA) with CSV/JSON export — 1–2 working days. With JavaScript rendering support, Schema.org collection, diff comparison and SQLite storage — 3–4 days. Integration with scheduler (cron/Airflow) and notifications on changes — another 1–2 days.







