Implementing a Scheduled Web Scraping Task Runner
A one-time parser run is a tool. A scheduled parser is a system. You need to ensure regular execution, result logging, alerts on failures, and the ability to manage tasks without code changes.
Implementation Options
Cron (Linux crontab) — simplest option for a small number of tasks:
# Run parser every 4 hours
0 */4 * * * /usr/bin/python3 /opt/scrapers/catalog_spider.py >> /var/log/scraper.log 2>&1
Drawback: no run history, no UI, difficult to manage with dozens of tasks.
Celery Beat — choice for Python projects:
# celery_config.py
from celery.schedules import crontab
CELERYBEAT_SCHEDULE = {
'parse-catalog': {
'task': 'scrapers.tasks.run_catalog_parser',
'schedule': crontab(hour='*/4'),
'options': {'queue': 'scraping'}
},
'parse-prices': {
'task': 'scrapers.tasks.run_price_parser',
'schedule': crontab(minute=0, hour=6),
},
}
Run history available through django-celery-results or flower for monitoring.
Node.js: node-cron / Agenda
const Agenda = require('agenda');
const agenda = new Agenda({ db: { address: MONGODB_URI } });
agenda.define('parse catalog', async job => {
const { sourceUrl } = job.attrs.data;
await runCatalogScraper(sourceUrl);
});
await agenda.every('4 hours', 'parse catalog', { sourceUrl: 'https://...' });
Agenda stores tasks in MongoDB, supports retries on failure, priorities, and locking.
What the Scheduler Must Support
- Scheduled execution (cron expression or interval)
- Parallel execution of multiple tasks with concurrency limits
- Automatic retry on error (with exponential backoff)
- Alert in Telegram/Slack when error threshold exceeded
- Execution history: when it ran, how many records collected, errors
Timeline for implementing a Celery Beat scheduler with history and alerts: 2–3 business days.







