Website Recovery After Failure
Recovery is a stressful process that must be planned in advance. Having a runbook with clear steps reduces recovery time by 3–5 times compared to on-the-fly diagnostics.
Failure Classification and First Steps
Site is unreachable (HTTP 5xx or timeout):
# 1. Check service
systemctl status nginx php8.2-fpm
# 2. Restart if hung
sudo systemctl restart php8.2-fpm
sudo systemctl restart nginx
# 3. Check logs
journalctl -u nginx -n 100 --no-pager
tail -100 /var/log/php8.2-fpm.log
Disk full:
df -h # find filled partition
du -sh /var/log/* | sort -rh | head -10 # where is space
# Clean logs
truncate -s 0 /var/log/nginx/access.log
journalctl --vacuum-size=100M
# Clean Docker
docker system prune -f
OOM (Out of Memory):
dmesg | grep -i "out of memory"
# Increase swap (temporary solution)
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Database crashed:
# PostgreSQL
sudo systemctl restart postgresql
# Check logs
tail -50 /var/log/postgresql/postgresql-14-main.log
# If files corrupted — restore from backup
Restore from Backup: Step by Step
# 1. Determine latest working backup
aws s3 ls s3://my-backups/database/ | tail -5
# 2. Download
aws s3 cp s3://my-backups/database/mysite_20241201_060000.dump.gz /tmp/
# 3. Create new DB (don't touch old one — test first)
createdb mysite_restored
gunzip < /tmp/mysite_20241201_060000.dump.gz | pg_restore -d mysite_restored --no-owner
# 4. Check integrity
psql -d mysite_restored -c "SELECT COUNT(*) FROM users;"
psql -d mysite_restored -c "SELECT MAX(created_at) FROM orders;"
# 5. Switch application to restored DB
# Change DB_NAME in .env
# Restart application
# 6. Rename database (if suitable)
# ALTER DATABASE mysite RENAME TO mysite_broken;
# ALTER DATABASE mysite_restored RENAME TO mysite;
File Recovery
# Restore uploads/ from S3
aws s3 sync s3://my-backups/files/ /var/www/mysite/storage/app/public/ \
--exact-timestamps
# Permissions after restore
chown -R www-data:www-data /var/www/mysite/storage/
chmod -R 755 /var/www/mysite/storage/
Code Rollback
# Via Git
git log --oneline -10 # find working commit
git checkout <commit-hash>
# or
git revert <bad-commit>
# Via Docker (if used)
docker pull myregistry/myapp:previous-tag
docker stop myapp
docker run -d --name myapp myregistry/myapp:previous-tag
Runbook template
## Runbook: Recovery After Complete Failure
**Condition:** site is unreachable for more than 5 minutes
### Step 1: Diagnostics (5 min)
- [ ] `ssh deploy@server`
- [ ] `systemctl status nginx php-fpm mysql`
- [ ] `df -h` — check disk
- [ ] `free -m` — check memory
### Step 2: Quick Recovery (5–10 min)
- [ ] Service restart
- [ ] If not working — restore from backup (see section above)
### Step 3: Notifications
- [ ] Slack: #incidents — incident status
- [ ] Status page — incident created
### Step 4: Post-mortem
- [ ] RCA within 24 hours after recovery
Recovery time with prepared runbook and up-to-date backups: 30 min – 2 hours.







