Automatic restore from backup on failure

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.
Development and maintenance of all types of websites:
Informational websites or web applications
Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators
E-commerce websites or web applications
Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers
Business process management web applications
CRM systems, ERP systems, corporate portals, production management systems, information parsers
Electronic service websites or web applications
Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Our competencies:
Development stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    847
  • image_website-sbh_0.png
    Website development for SBH Partners
    999
  • image_website-_0.png
    Website development for Red Pear
    451

Implementing Automatic Restoration from Backup on Failure

Automatic restoration is next level after having backups. System detects problem itself, chooses recovery point, spins up infrastructure, verifies result. Human involvement — only final verification.

Automatic Restoration Scenarios

DB data corruption. Trigger: monitoring detects anomaly (sharp error spike, checksum mismatch). Automation: stop writes to corrupted DB, restore from last valid snapshot, verify integrity, switch traffic.

Filesystem failure. Trigger: mount fails or read-only mode. Automation: Terraform creates new instance with clean disk, rsync or S3-sync restores data, application restarts.

Complete server failure. Trigger: health check fails N times in a row. Automation: Auto Scaling Group (AWS) or equivalent spins new instance from AMI, cloud-init deploys config, data mounted from persistent storage.

PostgreSQL Architecture

Point-in-Time Recovery (PITR) — foundation for automatic restoration in relational DBs.

WAL archiving to S3:

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://mybackups/wal/%f'
restore_command = 'aws s3 cp s3://mybackups/wal/%f %p'

Base snapshots via pgBackRest or pg_basebackup — daily to S3.

Restoration automation:

def auto_restore_postgres(target_time: datetime, db_config: dict):
    # 1. Find closest base backup before target_time
    base_backup = find_latest_base_backup_before(target_time)

    # 2. Provision new PostgreSQL instance
    instance = provision_postgres_instance(db_config)

    # 3. Restore base backup
    restore_base_backup(instance, base_backup)

    # 4. Apply WAL logs until target_time
    apply_wal_until(instance, target_time)

    # 5. Verify integrity
    verify_database_integrity(instance)

    return instance

Tools: pgBackRest (best for PostgreSQL), Barman, WAL-G (minimalist, popular in cloud).

Automatic File and Media Restoration

For S3/object storage: AWS S3 Versioning + S3 Object Lock protect from accidental deletion. Restore specific file version — via AWS Lambda, triggered by SNS event or app request.

For filesystems: EBS snapshots (AWS) or Persistent Disk (GCP) scheduled every 4-6 hours. Terraform script restores volume from snapshot and mounts to new instance.

Verification After Restoration

Automatic restoration without verification — half-baked solution. Required checks:

def verify_restoration(instance):
    checks = [
        check_db_connectivity(instance),
        check_row_counts(instance, expected_counts),
        check_referential_integrity(instance),
        check_recent_data_present(instance, min_age_minutes=5),
        run_application_smoke_tests(instance),
    ]
    return all(checks)

If verification fails — automation tries previous recovery point or escalates alert to team.

Restoration Orchestration

AWS Systems Manager Automation or Ansible playbook triggered by event:

  1. CloudWatch Alarm → SNS Topic → Lambda function
  2. Lambda initiates SSM Automation Document
  3. SSM executes steps: provision → restore → verify
  4. On result: switch Route 53 or escalate to PagerDuty

For Kubernetes: Velero restores namespace from snapshot. Operator pattern — custom Kubernetes Operator monitors PVC state and auto-restores on issue detection.

Testing Automatic Restoration

Weekly scheduled test: automation spins isolated backup copy in separate environment, runs verification, sends report. If verification passes — backups valid. If not — alert without waiting for real incident.

Metrics for Monitoring

  • RTO actual — time from problem detection to restoration verification
  • RPO actual — data lost (difference between last backup and failure moment)
  • Backup freshness — age of last successful backup per component
  • Restore test success rate — % successful automatic test-restores per month

Implementation Timeline

  • PostgreSQL PITR with WAL archiving — 3-5 days
  • S3 versioning + Lambda auto-restoration — 2-3 days
  • ASG + cloud-init server auto-restoration — 3-5 days
  • Orchestration + verification + alerts — 3-5 days
  • Testing and documentation — 2-3 days

Total: 2-3 weeks for complete automatic restoration system.