Development of Social Media Data Parser
Social media is a complex scraping object: actively fights automatic data collection, requires authentication to view some content, actively changes DOM structure and API endpoints. Yet publicly available data—posts, comments, profiles, statistics—remain legitimate for business analytics, mention monitoring, competitive analysis.
Official APIs vs Web Scraping
First choice—official APIs where available:
| Platform | API | Limitations |
|---|---|---|
| VKontakte | VK API v5.199 | Public groups unrestricted |
| Telegram | MTProto / Bot API | Public channels only |
| Graph API | Requires business account, limited fields | |
| Twitter/X | API v2 | Strict rate limits on free tier |
| YouTube | Data API v3 | 10,000 units/day quota |
If official API unavailable or insufficient—use headless scraping via Playwright with authentication via session cookies.
What We Collect
Typical tasks:
- Mention monitoring—search posts by keywords or hashtags
- Audience analysis—likes, reposts, comments, reach
- Competitive analysis—competitor publications, their engagement
- Contact collection—public profile data, group contact pages
Architecture
Scheduler (Celery Beat)
→ Task Queue (Redis)
→ Workers (Playwright / aiohttp)
→ Raw Storage (S3 / local disk)
→ Processor (normalization, deduplication)
→ PostgreSQL (final data)
Protection Bypass
Platforms track anomalous patterns: too frequent requests from one IP, absence of human delays between actions, user-agent and browser fingerprint mismatch. Solutions:
- Proxy rotation—residential proxies via Brightdata, Oxylabs or own pool
- Random delays between requests (2 to 15 seconds with normal distribution)
- Realistic fingerprint—via Playwright with unique profile per session
Timeline
Parser for one platform via official API: 3–5 days. Headless parser with protection bypass and proxies: 7–12 days.







