Development of Contact Data Parser from Open Sources
A contact data parser automatically collects email addresses, phone numbers, addresses, social media links, and names from publicly available sources: company websites, directories, aggregators, reference pages. The task is technically non-trivial: source structures differ dramatically, data is embedded in non-standard HTML, hidden behind JavaScript rendering, or protected from automatic collection.
Parser Architecture
Typical stack for such projects:
- Playwright or Puppeteer—for pages with dynamic content loading (SPA, lazy load)
- Cheerio (Node.js) or BeautifulSoup (Python)—for static HTML
- Scrapy with middlewares—if high performance and parallel traversal needed
- Redis—queue of URLs to visit, deduplication of already visited pages
- PostgreSQL—storage of results with full-text search
For extracting contacts, regular expressions account for regional formats: Russian numbers in formats +7 (XXX) XXX-XX-XX, 8-XXX-XXXXXXX, international per E.164. Email—standard RFC 5322 regex with post-filtering of technical addresses (noreply@, no-reply@, mailer-daemon@).
Data Sources
Parser configures for specific sources:
| Source Type | Example | Complexity |
|---|---|---|
| Business directories | 2GIS, Yandex.Maps (public data) | High |
| Industry reference guides | Construction, medical portals | Medium |
| Company websites | Contact pages, About us | Low |
| Social profiles | LinkedIn, VKontakte (public) | High |
Each source type needs separate spider class or handler with own navigation and extraction logic.
Normalization and Validation
Raw data goes through several processing stages:
-
Phone normalization via
libphonenumberlibrary (Google)—bring to single E.164 format - Email validation—DNS MX query to domain to check mail server existence
- Deduplication—compare normalized values, not original strings
- Address geocoding—via Nominatim (OpenStreetMap) or Yandex Geocoder
Export and Formats
Results available in several formats:
- CSV/XLSX—for CRM import
- JSON API—for integration with internal systems
- Direct PostgreSQL/MySQL write with normalized schema
Timeline and Volume
For one-two sources with normalization and basic storage: 5–8 working days. If need scalable system for 10+ sources with web management interface—from 3 weeks.







