Automated Cloud Cost Anomaly Detection and Alerting
The $14,000 Friday Night Surprise
A developer at a previous company spun up a batch of p4d.24xlarge GPU instances for an ML experiment on a Friday afternoon. Each instance costs $32.77/hour. They launched 6 of them. Then they went home for the weekend.
By Monday morning, the bill was $14,162. Nobody noticed because there were no cost alerts. The instances ran for 72 hours straight doing nothing after the job completed in the first 3 hours.
This happens more than you'd think:
| Incident Type | How It Happens | Typical Cost Impact |
|---|---|---|
| Forgotten GPU instances | ML experiments left running | $5,000-$50,000 |
| NAT Gateway data explosion | Misconfigured logging or data pipeline | $2,000-$20,000/mo |
| S3 request floods | Retry loops hitting S3 millions of times | $1,000-$10,000 |
| Undeleted EBS snapshots | Snapshot lifecycle policies not set | $500-$5,000/mo (cumulative) |
| RDS storage auto-scaling | Runaway storage growth | $1,000-$8,000/mo |
| DynamoDB on-demand spikes | Traffic spike without capacity planning | $2,000-$30,000 |
Every one of these is catchable with anomaly detection. Let's set it up.
AWS Cost Anomaly Detection (Built-In)
AWS provides a native anomaly detection service in Cost Explorer. It uses machine learning to baseline your spend and alert on deviations. Start here — it's free and takes 5 minutes.
Setting Up via CLI
# Create a cost anomaly monitor for the entire account
aws ce create-anomaly-monitor \
--anomaly-monitor '{
"MonitorName": "account-wide-monitor",
"MonitorType": "DIMENSIONAL",
"MonitorDimension": "SERVICE"
}'
# Create a monitor for a specific cost allocation tag
aws ce create-anomaly-monitor \
--anomaly-monitor '{
"MonitorName": "team-level-monitor",
"MonitorType": "CUSTOM",
"MonitorSpecification": {
"Tags": {
"Key": "team",
"Values": ["search", "data-engineering", "platform"],
"MatchOptions": ["EQUALS"]
}
}
}'
Create Alert Subscriptions
# Alert when anomaly impact exceeds $100
MONITOR_ARN=$(aws ce list-anomaly-monitors \
--query 'AnomalyMonitors[?MonitorName==`account-wide-monitor`].MonitorArn' \
--output text)
aws ce create-anomaly-subscription \
--anomaly-subscription "{
\"SubscriptionName\": \"cost-anomaly-alerts\",
\"MonitorArnList\": [\"$MONITOR_ARN\"],
\"Subscribers\": [
{
\"Address\": \"finops@company.com\",
\"Type\": \"EMAIL\"
},
{
\"Address\": \"arn:aws:sns:us-east-1:123456789012:cost-alerts\",
\"Type\": \"SNS\"
}
],
\"Threshold\": 100,
\"Frequency\": \"IMMEDIATE\"
}"
Terraform Configuration
resource "aws_ce_anomaly_monitor" "service_monitor" {
name = "service-level-anomaly-monitor"
monitor_type = "DIMENSIONAL"
monitor_dimension = "SERVICE"
}
resource "aws_ce_anomaly_monitor" "team_monitor" {
name = "team-level-anomaly-monitor"
monitor_type = "CUSTOM"
monitor_specification = jsonencode({
Tags = {
Key = "team"
Values = []
MatchOptions = ["ABSENT"]
}
})
}
resource "aws_ce_anomaly_subscription" "alerts" {
name = "cost-anomaly-subscription"
monitor_arn_list = [
aws_ce_anomaly_monitor.service_monitor.arn,
aws_ce_anomaly_monitor.team_monitor.arn,
]
subscriber {
type = "SNS"
address = aws_sns_topic.cost_alerts.arn
}
threshold_expression {
dimension {
key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
values = ["100"]
match_options = ["GREATER_THAN_OR_EQUAL"]
}
}
frequency = "IMMEDIATE"
}
resource "aws_sns_topic" "cost_alerts" {
name = "cloud-cost-anomaly-alerts"
}
Custom Lambda-Based Anomaly Detection
AWS Cost Anomaly Detection is good for broad monitoring, but it has limitations: 24-48 hour detection lag, no custom thresholds per service, and limited integration options. For faster, more granular detection, build a custom monitor.
Architecture
CloudWatch Metrics ──▶ Lambda (hourly) ──▶ Compare vs baseline ──▶ Slack/PagerDuty
│ │
└── CUR data via Athena ◀────────────────────┘
The Detection Lambda
# cost_anomaly_detector.py
import boto3
import json
import os
from datetime import datetime, timedelta
from urllib.request import urlopen, Request
ce_client = boto3.client('ce')
SLACK_WEBHOOK = os.environ['SLACK_WEBHOOK_URL']
THRESHOLD_PERCENT = float(os.environ.get('THRESHOLD_PERCENT', '25'))
MIN_DOLLAR_THRESHOLD = float(os.environ.get('MIN_DOLLAR_THRESHOLD', '50'))
def get_daily_cost(start_date, end_date, granularity='DAILY'):
"""Get cost broken down by service."""
response = ce_client.get_cost_and_usage(
TimePeriod={
'Start': start_date.strftime('%Y-%m-%d'),
'End': end_date.strftime('%Y-%m-%d')
},
Granularity=granularity,
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
costs = {}
for result in response['ResultsByTime']:
for group in result['Groups']:
service = group['Keys'][0]
amount = float(group['Metrics']['UnblendedCost']['Amount'])
costs[service] = costs.get(service, 0) + amount
return costs
def calculate_baseline(days=14):
"""Calculate average daily cost per service over the baseline window."""
end = datetime.utcnow().replace(hour=0, minute=0, second=0, microsecond=0)
start = end - timedelta(days=days)
total_costs = get_daily_cost(start, end)
return {svc: cost / days for svc, cost in total_costs.items()}
def detect_anomalies(current_costs, baseline_costs):
"""Compare current costs to baseline and flag anomalies."""
anomalies = []
for service, current in current_costs.items():
baseline = baseline_costs.get(service, 0)
if baseline == 0 and current > MIN_DOLLAR_THRESHOLD:
anomalies.append({
'service': service,
'current': current,
'baseline': 0,
'deviation_pct': 100,
'excess_spend': current,
'type': 'NEW_SERVICE'
})
continue
if baseline > 0:
deviation = ((current - baseline) / baseline) * 100
excess = current - baseline
if deviation > THRESHOLD_PERCENT and excess > MIN_DOLLAR_THRESHOLD:
anomalies.append({
'service': service,
'current': round(current, 2),
'baseline': round(baseline, 2),
'deviation_pct': round(deviation, 1),
'excess_spend': round(excess, 2),
'type': 'SPIKE'
})
return sorted(anomalies, key=lambda x: x['excess_spend'], reverse=True)
def send_slack_alert(anomalies, total_current, total_baseline):
"""Send anomaly report to Slack."""
total_excess = sum(a['excess_spend'] for a in anomalies)
blocks = [
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"Cost Anomaly Alert — ${total_excess:,.2f} above baseline"
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": (
f"*Today's spend:* ${total_current:,.2f} | "
f"*14-day avg:* ${total_baseline:,.2f} | "
f"*Deviation:* {((total_current-total_baseline)/total_baseline)*100:.1f}%"
)
}
},
{"type": "divider"}
]
for anomaly in anomalies[:5]: # Top 5 anomalies
emoji = "🆕" if anomaly['type'] == 'NEW_SERVICE' else "📈"
blocks.append({
"type": "section",
"text": {
"type": "mrkdwn",
"text": (
f"{emoji} *{anomaly['service']}*\n"
f"Today: ${anomaly['current']:,.2f} | "
f"Baseline: ${anomaly['baseline']:,.2f} | "
f"Excess: *${anomaly['excess_spend']:,.2f}* "
f"(+{anomaly['deviation_pct']}%)"
)
}
})
payload = json.dumps({"blocks": blocks}).encode('utf-8')
req = Request(SLACK_WEBHOOK, data=payload,
headers={'Content-Type': 'application/json'})
urlopen(req)
def handler(event, context):
"""Main Lambda handler — runs hourly."""
today = datetime.utcnow().replace(hour=0, minute=0, second=0, microsecond=0)
tomorrow = today + timedelta(days=1)
current_costs = get_daily_cost(today, tomorrow)
baseline_costs = calculate_baseline(days=14)
anomalies = detect_anomalies(current_costs, baseline_costs)
if anomalies:
total_current = sum(current_costs.values())
total_baseline = sum(baseline_costs.values())
send_slack_alert(anomalies, total_current, total_baseline)
return {
'anomalies_detected': len(anomalies),
'total_excess_spend': sum(a['excess_spend'] for a in anomalies),
'details': anomalies
}
Deploy with Terraform
resource "aws_lambda_function" "cost_anomaly_detector" {
function_name = "cost-anomaly-detector"
runtime = "python3.12"
handler = "cost_anomaly_detector.handler"
architectures = ["arm64"]
memory_size = 256
timeout = 120
role = aws_iam_role.anomaly_detector.arn
filename = data.archive_file.anomaly_detector.output_path
source_code_hash = data.archive_file.anomaly_detector.output_base64sha256
environment {
variables = {
SLACK_WEBHOOK_URL = var.slack_webhook_url
THRESHOLD_PERCENT = "25"
MIN_DOLLAR_THRESHOLD = "50"
}
}
}
resource "aws_cloudwatch_event_rule" "hourly_check" {
name = "cost-anomaly-hourly-check"
schedule_expression = "rate(1 hour)"
}
resource "aws_cloudwatch_event_target" "anomaly_detector" {
rule = aws_cloudwatch_event_rule.hourly_check.name
arn = aws_lambda_function.cost_anomaly_detector.arn
}
resource "aws_lambda_permission" "allow_eventbridge" {
statement_id = "AllowEventBridge"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.cost_anomaly_detector.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.hourly_check.arn
}
resource "aws_iam_role" "anomaly_detector" {
name = "cost-anomaly-detector-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "lambda.amazonaws.com" }
}]
})
}
resource "aws_iam_role_policy" "anomaly_detector" {
name = "cost-anomaly-detector-policy"
role = aws_iam_role.anomaly_detector.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"ce:GetCostAndUsage",
"ce:GetAnomalies"
]
Resource = "*"
},
{
Effect = "Allow"
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
]
Resource = "arn:aws:logs:*:*:*"
}
]
})
}
Setting Up Budget Alerts as a Safety Net
Anomaly detection catches spikes. Budget alerts catch gradual creep. Use both.
resource "aws_budgets_budget" "monthly_total" {
name = "monthly-total-budget"
budget_type = "COST"
limit_amount = "50000"
limit_unit = "USD"
time_unit = "MONTHLY"
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["finops@company.com"]
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = ["finops@company.com", "engineering-lead@company.com"]
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
}
}
# Per-service budgets for the usual suspects
resource "aws_budgets_budget" "ec2_budget" {
name = "ec2-monthly-budget"
budget_type = "COST"
limit_amount = "25000"
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "Service"
values = ["Amazon Elastic Compute Cloud - Compute"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 90
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
}
}
Multi-Layer Alerting Strategy
No single alert catches everything. Layer your defenses:
| Layer | Tool | Detection Speed | Coverage | Cost |
|---|---|---|---|---|
| 1 — Budget alerts | AWS Budgets | Same day | Total spend thresholds | Free |
| 2 — AWS Anomaly Detection | Cost Anomaly Detection | 24-48 hours | ML-based per-service | Free |
| 3 — Custom Lambda monitor | Lambda + CUR | 1 hour | Custom rules, per-tag | ~$5/mo |
| 4 — Real-time CloudWatch | CloudWatch billing metrics | 4-6 hours | Account-level totals | Free |
| 5 — Weekly report | Lambda + S3 + SES | Weekly | Trend analysis | ~$2/mo |
CloudWatch Billing Alarm (Layer 4)
resource "aws_cloudwatch_metric_alarm" "billing_alarm" {
alarm_name = "daily-billing-anomaly"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "EstimatedCharges"
namespace = "AWS/Billing"
period = 21600 # 6 hours
statistic = "Maximum"
threshold = 2000 # Alert if daily pace exceeds $2K
alarm_actions = [aws_sns_topic.cost_alerts.arn]
dimensions = {
Currency = "USD"
}
}
Automatic Remediation
For known patterns, go beyond alerting — auto-remediate.
# auto_remediation.py
import boto3
ec2 = boto3.client('ec2')
def stop_untagged_gpu_instances(event, context):
"""Find and stop GPU instances without a 'keep-alive' tag."""
gpu_types = ['p4d', 'p3', 'p5', 'g5', 'g4dn', 'g6', 'trn1', 'inf2']
response = ec2.describe_instances(
Filters=[
{'Name': 'instance-state-name', 'Values': ['running']},
]
)
stopped = []
for reservation in response['Reservations']:
for instance in reservation['Instances']:
instance_type = instance['InstanceType']
family = instance_type.split('.')[0]
if family not in gpu_types:
continue
tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}
# Skip instances with keep-alive tag
if tags.get('keep-alive') == 'true':
continue
# Stop instances running more than 4 hours without keep-alive
launch_time = instance['LaunchTime']
hours_running = (
datetime.utcnow().replace(tzinfo=launch_time.tzinfo) - launch_time
).total_seconds() / 3600
if hours_running > 4:
ec2.stop_instances(InstanceIds=[instance['InstanceId']])
stopped.append({
'instance_id': instance['InstanceId'],
'type': instance_type,
'hours_running': round(hours_running, 1),
'hourly_cost': get_instance_cost(instance_type),
'saved': round(get_instance_cost(instance_type) * hours_running, 2)
})
if stopped:
notify_slack(stopped)
return {'stopped_instances': len(stopped), 'details': stopped}
What Good Alerting Looks Like
After implementing the full stack, here's what the alert cadence should look like:
| Alert Type | Frequency | Action Required |
|---|---|---|
| Budget at 80% of monthly target | Monthly | Review and forecast |
| Service cost anomaly (AWS native) | As detected | Investigate root cause |
| Custom hourly anomaly alert | When threshold breached | Investigate within 1 hour |
| GPU instance auto-stop | Automatic | Review stopped instances |
| Weekly cost summary | Every Monday | Trend review with team leads |
| Quarterly commitment review | Quarterly | Adjust RIs/Savings Plans |
The goal is catching anomalies within hours, not days. That $14,000 weekend GPU bill? With hourly checks and a $100 threshold, you'd catch it within 4 hours — turning a $14,000 incident into a $400 blip.
Getting Started in 30 Minutes
- Minute 0-5: Enable AWS Cost Anomaly Detection (CLI commands above)
- Minute 5-10: Create SNS topic and Slack integration for alerts
- Minute 10-15: Set up AWS Budgets for total spend and top 3 services
- Minute 15-25: Deploy the custom Lambda anomaly detector
- Minute 25-30: Test by creating a budget alert with a very low threshold
Don't wait for the next surprise bill. The detection infrastructure costs less than $10/month to run. A single caught anomaly pays for a lifetime of monitoring.
Related Articles
Related Articles
AWS Lambda Cost Optimization: Memory Tuning, Provisioned Concurrency, and ARM
Cut your AWS Lambda costs by 40-70% with memory right-sizing, ARM/Graviton migration, and smart provisioned concurrency strategies.
The Complete AWS Cost Optimization Playbook: Compute, Storage, Networking, and Reserved Capacity
A data-driven playbook for cutting AWS costs across compute, storage, networking, and reserved capacity with real numbers and actions.
Kubecost Setup for Kubernetes Cost Visibility and Showback
Deploy Kubecost for real-time Kubernetes cost monitoring with namespace-level showback, idle cost detection, and actionable Slack alerts.