Implementing Cloud-Agnostic Architecture to Avoid Vendor Lock-in
Cloud-agnostic architecture is designing system to work on any cloud provider without major rework. Not abstraction for abstraction's sake, but concrete solutions preserving ability to switch providers or use multiple simultaneously.
What Vendor Lock-in Is and When It's a Problem
Vendor lock-in occurs with proprietary provider services:
- AWS Lambda + API Gateway + DynamoDB + SQS — complete AWS dependency
- GCP Firestore + Cloud Run + Pub/Sub — GCP dependency
Lock-in isn't always bad. Using managed services accelerates development and reduces operational burden. Problem occurs when:
- Provider sharply changes pricing (+200% in 3 months — real cases)
- Provider unavailable in required jurisdiction
- Need on-premise copy deployment
- M&A activity requires provider change
Cloud-Agnostic Solution Levels
Compute (containers). Docker + Kubernetes — de facto standard for cloud-agnostic compute. One Kubernetes manifest works in EKS, GKE, AKS, k3s on-premise. Avoid: Fargate-specific annotations, GKE autopilot-specific config.
Object storage. S3 API became de facto standard — MinIO, Ceph, Wasabi, Backblaze B2, GCS (via compatibility layer) all support S3 API. Write code through S3-compatible client:
import boto3
s3 = boto3.client(
's3',
endpoint_url=os.environ['STORAGE_ENDPOINT'], # can be S3, MinIO, or any S3-compatible
aws_access_key_id=os.environ['STORAGE_KEY'],
aws_secret_access_key=os.environ['STORAGE_SECRET'],
)
Database. PostgreSQL available everywhere: AWS RDS/Aurora, GCP Cloud SQL, Azure Database, Railway, Supabase, self-hosted. Avoid: Oracle-specific functions, SQL Server proprietary syntax.
Message queues. RabbitMQ or Kafka — work everywhere. NATS — cloud-native, self-hosted easy. Avoid SQS/Pub/Sub as primary if portability needed.
DNS and CDN. Cloudflare works on top of any provider and creates no lock-in.
Terraform as IaC Abstraction Tool
Terraform providers for AWS, GCP, Azure, Cloudflare — one tool for everything. But reusable modules need careful writing:
# Kubernetes cluster module — abstraction over EKS/GKE/AKS
module "k8s" {
source = "./modules/kubernetes"
provider_type = var.cloud_provider # "aws" | "gcp" | "azure"
node_count = 3
node_size = "standard-4cpu-16gb"
}
# Inside module — conditional resources by provider_type
resource "aws_eks_cluster" "this" {
count = var.provider_type == "aws" ? 1 : 0
...
}
resource "google_container_cluster" "this" {
count = var.provider_type == "gcp" ? 1 : 0
...
}
OpenTofu Instead of Terraform
HashiCorp changed Terraform license to BSL in 2023. OpenTofu — community fork under MPL-2.0, fully compatible. For cloud-agnostic strategy — OpenTofu preferable: no licensing risks.
Service Mesh as Network Abstraction
Istio or Linkerd create unified service mesh on any Kubernetes. Traffic policies, mutual TLS, circuit breaking — work same in EKS and GKE.
Observability: OpenTelemetry
OpenTelemetry — vendor-neutral standard for metrics, traces, logs. Instrument application once, send to any backend (Grafana, Datadog, Jaeger, Zipkin):
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint=os.environ['OTEL_EXPORTER_ENDPOINT']))
)
trace.set_tracer_provider(provider)
Configuration Through Environment Variables
Twelve-Factor App principle: configuration via env vars, not hardcoded endpoints. This itself ensures portability.
What Can't Be Cloud-Agnostic
Some things make sense only in specific cloud:
- AWS Lambda@Edge — AWS only
- GCP BigQuery — unique capabilities, alternatives exist (ClickHouse), with effort
- Azure Active Directory integration — if clients use O365
Pragmatic approach: isolate such dependencies in separate services with clear API. Rest of system cloud-agnostic, these components — replaceable adapters.
Implementation Timeline
- Audit current provider dependencies — 2-3 days
- Migrate to cloud-agnostic storage (S3 API) — 2-5 days
- Kubernetes-based compute (if not yet) — 5-10 days
- Terraform modules for abstraction — 5-10 days
- OpenTelemetry instrumentation — 3-7 days







