Setting up Spot/Preemptible Instances for batch jobs
Spot Instances (AWS) and Preemptible VMs (GCP) are cloud resources from provider surplus capacity at 60-90% lower prices than on-demand. The trade-off: they can be forcibly stopped with 2 minutes notice. This is not a problem for batch jobs that can restart themselves.
Workloads suitable for Spot
Good fit:
- CI/CD workers (each build is a separate task)
- Image and video processing (transcoding, resize)
- ML training (checkpoint-based)
- Parsing and ETL pipelines
- Rendering
- Antivirus scans, analytics queries
Not suitable:
- Stateful databases (critical data)
- Web servers without fast replacement
- Services with strict SLAs and no DR
AWS Spot Instances: practice
Spot Fleet with multiple instance types is the key to stability. If m5.xlarge is unavailable in one AZ, Fleet takes m4.xlarge or c5.xlarge in another:
{
"SpotFleetRequestConfig": {
"AllocationStrategy": "capacityOptimized",
"TargetCapacity": 10,
"LaunchTemplateConfigs": [
{
"LaunchTemplateSpecification": {"LaunchTemplateId": "lt-xxx", "Version": "1"},
"Overrides": [
{"InstanceType": "m5.xlarge", "WeightedCapacity": 1},
{"InstanceType": "m5a.xlarge", "WeightedCapacity": 1},
{"InstanceType": "m4.xlarge", "WeightedCapacity": 1},
{"InstanceType": "c5.xlarge", "WeightedCapacity": 1}
]
}
]
}
}
The capacityOptimized strategy reduces interruption probability by choosing pools with the highest available capacity.
Handling Spot Interruption Notice
2 minutes before termination, AWS sends an instance metadata event. Applications must catch and gracefully complete the task:
import requests
import signal
import sys
def check_spot_interruption():
"""Call every 5 seconds from worker"""
try:
response = requests.get(
'http://169.254.169.254/latest/meta-data/spot/interruption-notice',
timeout=1
)
if response.status_code == 200:
return True # Interruption expected
except requests.exceptions.RequestException:
pass
return False
class BatchWorker:
def process_task(self, task):
# Checkpoint every N items
for i, item in enumerate(task.items):
if i % 100 == 0 and check_spot_interruption():
self.save_checkpoint(task.id, i)
sys.exit(0) # Graceful exit, task will restart
self.process_item(item)
task.mark_complete()
AWS EventBridge for Spot Interruption
{
"source": ["aws.ec2"],
"detail-type": ["EC2 Spot Instance Interruption Warning"],
"detail": {
"instance-action": ["terminate"]
}
}
EventBridge → Lambda → save checkpoint + remove instance from pool + requeue task.
Kubernetes with Spot Nodes
Karpenter (AWS) automatically selects instance type (including Spot) and handles interruptions:
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: batch-workers
spec:
requirements:
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"]
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["m5.xlarge", "m5a.xlarge", "m4.xlarge", "c5.xlarge"]
taints:
- key: batch
effect: NoSchedule
consolidation:
enabled: true
On Spot Interruption, Karpenter cordons + drains the node; pods are rescheduled to other nodes.
GCP Preemptible / Spot VMs
GCP Preemptible: maximum 24 hours lifetime + 30 second notice (15 times less than AWS). Spot VMs: no 24-hour limit, availability-based only.
gcloud compute instances create batch-worker \
--machine-type=n2-standard-4 \
--provisioning-model=SPOT \
--instance-termination-action=STOP \
--zone=us-central1-a
Efficiency and savings
Real examples:
- CI/CD pipeline: moving from on-demand t3.xlarge to Spot → 70% savings
- ML training on Spot p3.2xlarge: $0.918/hour instead of $3.06/hour
Overhead from interruptions and restarts: typically 5-15% extra time. Total savings: 60-80% with proper checkpoint implementation.
Setup timeline
- Spot Fleet / Launch Template — 1-2 days
- Interruption handling in application — 2-3 days
- Kubernetes Karpenter with Spot — 2-3 days
- Testing with interruption simulation — 1 day







