Monitoring cloud infrastructure costs
Without cost monitoring, billing surprises become normal: someone left a GPU instance, NAT Gateway generates unexpected traffic, S3 stores gigabytes of stale logs. Systematic cost monitoring turns the bill from a surprise into a predictable value.
AWS Cost Explorer and Cost Anomaly Detection
AWS Cost Anomaly Detection automatically finds abnormal spending with ML model, no threshold tuning needed:
# Create monitor via AWS CLI
aws ce create-anomaly-monitor \
--anomaly-monitor '{
"MonitorName": "service-monitor",
"MonitorType": "DIMENSIONAL",
"MonitorDimension": "SERVICE"
}'
# Create subscription (notification on anomaly)
aws ce create-anomaly-subscription \
--anomaly-subscription '{
"SubscriptionName": "cost-anomaly-alerts",
"Threshold": 20,
"Frequency": "DAILY",
"MonitorArnList": ["arn:aws:ce::123456789:anomalymonitor/xxx"],
"Subscribers": [{
"Address": "arn:aws:sns:eu-central-1:123456789:cost-alerts",
"Type": "SNS"
}]
}'
Cost Explorer API for programmatic data access:
import boto3
from datetime import date, timedelta
ce = boto3.client('ce', region_name='us-east-1')
def get_daily_costs_by_service(days=30):
end = date.today()
start = end - timedelta(days=days)
response = ce.get_cost_and_usage(
TimePeriod={
'Start': start.strftime('%Y-%m-%d'),
'End': end.strftime('%Y-%m-%d')
},
Granularity='DAILY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
costs = {}
for result in response['ResultsByTime']:
date_str = result['TimePeriod']['Start']
for group in result['Groups']:
service = group['Keys'][0]
amount = float(group['Metrics']['UnblendedCost']['Amount'])
if service not in costs:
costs[service] = {}
costs[service][date_str] = amount
return costs
# Find services with > 50% growth in last 7 days
def find_cost_spikes(threshold_pct=50):
costs = get_daily_costs_by_service(14)
spikes = []
for service, daily in costs.items():
dates = sorted(daily.keys())
if len(dates) < 14:
continue
week1_avg = sum(daily[d] for d in dates[:7]) / 7
week2_avg = sum(daily[d] for d in dates[7:]) / 7
if week1_avg > 0 and week2_avg > week1_avg * (1 + threshold_pct/100):
spikes.append({
'service': service,
'prev_avg': round(week1_avg, 2),
'curr_avg': round(week2_avg, 2),
'increase_pct': round((week2_avg/week1_avg - 1) * 100, 1)
})
return sorted(spikes, key=lambda x: x['increase_pct'], reverse=True)
Infracost for pre-deploy estimation
Infracost shows cost of Terraform changes before applying:
# .github/workflows/infracost.yml
name: Infracost
on: [pull_request]
jobs:
infracost:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: infracost/actions/setup@v3
with:
api-key: ${{ secrets.INFRACOST_API_KEY }}
- name: Generate Infracost cost estimate baseline
run: |
infracost breakdown --path=. \
--format=json \
--out-file=/tmp/infracost-base.json
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Generate Infracost diff
run: |
infracost diff --path=. \
--format=json \
--compare-to=/tmp/infracost-base.json \
--out-file=/tmp/infracost.json
- name: Post Infracost comment
run: |
infracost comment github \
--path=/tmp/infracost.json \
--repo=$GITHUB_REPOSITORY \
--github-token=${{ secrets.GITHUB_TOKEN }} \
--pull-request=${{ github.event.pull_request.number }} \
--behavior=update
CloudWatch Billing Alarms
# billing_alarms.tf
resource "aws_cloudwatch_metric_alarm" "monthly_estimate" {
alarm_name = "monthly-bill-estimate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "EstimatedCharges"
namespace = "AWS/Billing"
period = 86400 # 1 day
statistic = "Maximum"
threshold = 500 # $500 alert threshold
alarm_description = "Monthly AWS estimate exceeds $500"
alarm_actions = [aws_sns_topic.billing_alerts.arn]
dimensions = {
Currency = "USD"
}
}
# Alert for specific service
resource "aws_cloudwatch_metric_alarm" "ec2_cost" {
alarm_name = "ec2-daily-cost"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "EstimatedCharges"
namespace = "AWS/Billing"
period = 86400
statistic = "Maximum"
threshold = 100
dimensions = {
Currency = "USD"
ServiceName = "Amazon Elastic Compute Cloud - Compute"
}
}
FinOps dashboard in Grafana
Grafana + AWS CloudWatch datasource for cost visualization:
{
"panels": [{
"title": "Daily Cost by Service (Last 30d)",
"type": "timeseries",
"targets": [{
"dimensions": {"Currency": "USD"},
"expression": "SELECT SUM(EstimatedCharges) FROM SCHEMA(\"AWS/Billing\", Currency,ServiceName) GROUP BY ServiceName",
"metricQueryType": 1,
"refId": "A"
}]
}, {
"title": "Cost by Tag: Environment",
"type": "piechart",
"targets": [{
"queryMode": "Metrics Insights",
"expression": "SELECT SUM(EstimatedCharges) FROM AWS/Billing WHERE Tags.Environment != '' GROUP BY Tags.Environment",
"refId": "B"
}]
}]
}
Implementation timeline
- AWS Cost Anomaly Detection + SNS notifications — 1 day
- Billing CloudWatch alarms — 0.5 day
- Infracost in CI/CD pipeline — 1-2 days
- Grafana cost dashboard — 1-2 days
- Tagging setup for cost allocation — 1-3 days (depends on resource count)







