DevOpsil
Cloud Cost
88%
Fresh

Automated Cloud Cost Anomaly Detection and Alerting

Dev PatelDev Patel10 min read

The $14,000 Friday Night Surprise

A developer at a previous company spun up a batch of p4d.24xlarge GPU instances for an ML experiment on a Friday afternoon. Each instance costs $32.77/hour. They launched 6 of them. Then they went home for the weekend.

By Monday morning, the bill was $14,162. Nobody noticed because there were no cost alerts. The instances ran for 72 hours straight doing nothing after the job completed in the first 3 hours.

This happens more than you'd think:

Incident TypeHow It HappensTypical Cost Impact
Forgotten GPU instancesML experiments left running$5,000-$50,000
NAT Gateway data explosionMisconfigured logging or data pipeline$2,000-$20,000/mo
S3 request floodsRetry loops hitting S3 millions of times$1,000-$10,000
Undeleted EBS snapshotsSnapshot lifecycle policies not set$500-$5,000/mo (cumulative)
RDS storage auto-scalingRunaway storage growth$1,000-$8,000/mo
DynamoDB on-demand spikesTraffic spike without capacity planning$2,000-$30,000

Every one of these is catchable with anomaly detection. Let's set it up.

AWS Cost Anomaly Detection (Built-In)

AWS provides a native anomaly detection service in Cost Explorer. It uses machine learning to baseline your spend and alert on deviations. Start here — it's free and takes 5 minutes.

Setting Up via CLI

# Create a cost anomaly monitor for the entire account
aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "account-wide-monitor",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

# Create a monitor for a specific cost allocation tag
aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "team-level-monitor",
    "MonitorType": "CUSTOM",
    "MonitorSpecification": {
      "Tags": {
        "Key": "team",
        "Values": ["search", "data-engineering", "platform"],
        "MatchOptions": ["EQUALS"]
      }
    }
  }'

Create Alert Subscriptions

# Alert when anomaly impact exceeds $100
MONITOR_ARN=$(aws ce list-anomaly-monitors \
  --query 'AnomalyMonitors[?MonitorName==`account-wide-monitor`].MonitorArn' \
  --output text)

aws ce create-anomaly-subscription \
  --anomaly-subscription "{
    \"SubscriptionName\": \"cost-anomaly-alerts\",
    \"MonitorArnList\": [\"$MONITOR_ARN\"],
    \"Subscribers\": [
      {
        \"Address\": \"finops@company.com\",
        \"Type\": \"EMAIL\"
      },
      {
        \"Address\": \"arn:aws:sns:us-east-1:123456789012:cost-alerts\",
        \"Type\": \"SNS\"
      }
    ],
    \"Threshold\": 100,
    \"Frequency\": \"IMMEDIATE\"
  }"

Terraform Configuration

resource "aws_ce_anomaly_monitor" "service_monitor" {
  name              = "service-level-anomaly-monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_monitor" "team_monitor" {
  name         = "team-level-anomaly-monitor"
  monitor_type = "CUSTOM"

  monitor_specification = jsonencode({
    Tags = {
      Key          = "team"
      Values       = []
      MatchOptions = ["ABSENT"]
    }
  })
}

resource "aws_ce_anomaly_subscription" "alerts" {
  name = "cost-anomaly-subscription"

  monitor_arn_list = [
    aws_ce_anomaly_monitor.service_monitor.arn,
    aws_ce_anomaly_monitor.team_monitor.arn,
  ]

  subscriber {
    type    = "SNS"
    address = aws_sns_topic.cost_alerts.arn
  }

  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
      values        = ["100"]
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }

  frequency = "IMMEDIATE"
}

resource "aws_sns_topic" "cost_alerts" {
  name = "cloud-cost-anomaly-alerts"
}

Custom Lambda-Based Anomaly Detection

AWS Cost Anomaly Detection is good for broad monitoring, but it has limitations: 24-48 hour detection lag, no custom thresholds per service, and limited integration options. For faster, more granular detection, build a custom monitor.

Architecture

CloudWatch Metrics ──▶ Lambda (hourly) ──▶ Compare vs baseline ──▶ Slack/PagerDuty
     │                                            │
     └── CUR data via Athena ◀────────────────────┘

The Detection Lambda

# cost_anomaly_detector.py
import boto3
import json
import os
from datetime import datetime, timedelta
from urllib.request import urlopen, Request

ce_client = boto3.client('ce')
SLACK_WEBHOOK = os.environ['SLACK_WEBHOOK_URL']
THRESHOLD_PERCENT = float(os.environ.get('THRESHOLD_PERCENT', '25'))
MIN_DOLLAR_THRESHOLD = float(os.environ.get('MIN_DOLLAR_THRESHOLD', '50'))


def get_daily_cost(start_date, end_date, granularity='DAILY'):
    """Get cost broken down by service."""
    response = ce_client.get_cost_and_usage(
        TimePeriod={
            'Start': start_date.strftime('%Y-%m-%d'),
            'End': end_date.strftime('%Y-%m-%d')
        },
        Granularity=granularity,
        Metrics=['UnblendedCost'],
        GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
    )
    costs = {}
    for result in response['ResultsByTime']:
        for group in result['Groups']:
            service = group['Keys'][0]
            amount = float(group['Metrics']['UnblendedCost']['Amount'])
            costs[service] = costs.get(service, 0) + amount
    return costs


def calculate_baseline(days=14):
    """Calculate average daily cost per service over the baseline window."""
    end = datetime.utcnow().replace(hour=0, minute=0, second=0, microsecond=0)
    start = end - timedelta(days=days)
    total_costs = get_daily_cost(start, end)
    return {svc: cost / days for svc, cost in total_costs.items()}


def detect_anomalies(current_costs, baseline_costs):
    """Compare current costs to baseline and flag anomalies."""
    anomalies = []
    for service, current in current_costs.items():
        baseline = baseline_costs.get(service, 0)
        if baseline == 0 and current > MIN_DOLLAR_THRESHOLD:
            anomalies.append({
                'service': service,
                'current': current,
                'baseline': 0,
                'deviation_pct': 100,
                'excess_spend': current,
                'type': 'NEW_SERVICE'
            })
            continue

        if baseline > 0:
            deviation = ((current - baseline) / baseline) * 100
            excess = current - baseline
            if deviation > THRESHOLD_PERCENT and excess > MIN_DOLLAR_THRESHOLD:
                anomalies.append({
                    'service': service,
                    'current': round(current, 2),
                    'baseline': round(baseline, 2),
                    'deviation_pct': round(deviation, 1),
                    'excess_spend': round(excess, 2),
                    'type': 'SPIKE'
                })
    return sorted(anomalies, key=lambda x: x['excess_spend'], reverse=True)


def send_slack_alert(anomalies, total_current, total_baseline):
    """Send anomaly report to Slack."""
    total_excess = sum(a['excess_spend'] for a in anomalies)
    blocks = [
        {
            "type": "header",
            "text": {
                "type": "plain_text",
                "text": f"Cost Anomaly Alert — ${total_excess:,.2f} above baseline"
            }
        },
        {
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": (
                    f"*Today's spend:* ${total_current:,.2f} | "
                    f"*14-day avg:* ${total_baseline:,.2f} | "
                    f"*Deviation:* {((total_current-total_baseline)/total_baseline)*100:.1f}%"
                )
            }
        },
        {"type": "divider"}
    ]

    for anomaly in anomalies[:5]:  # Top 5 anomalies
        emoji = "🆕" if anomaly['type'] == 'NEW_SERVICE' else "📈"
        blocks.append({
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": (
                    f"{emoji} *{anomaly['service']}*\n"
                    f"Today: ${anomaly['current']:,.2f} | "
                    f"Baseline: ${anomaly['baseline']:,.2f} | "
                    f"Excess: *${anomaly['excess_spend']:,.2f}* "
                    f"(+{anomaly['deviation_pct']}%)"
                )
            }
        })

    payload = json.dumps({"blocks": blocks}).encode('utf-8')
    req = Request(SLACK_WEBHOOK, data=payload,
                  headers={'Content-Type': 'application/json'})
    urlopen(req)


def handler(event, context):
    """Main Lambda handler — runs hourly."""
    today = datetime.utcnow().replace(hour=0, minute=0, second=0, microsecond=0)
    tomorrow = today + timedelta(days=1)

    current_costs = get_daily_cost(today, tomorrow)
    baseline_costs = calculate_baseline(days=14)

    anomalies = detect_anomalies(current_costs, baseline_costs)

    if anomalies:
        total_current = sum(current_costs.values())
        total_baseline = sum(baseline_costs.values())
        send_slack_alert(anomalies, total_current, total_baseline)

    return {
        'anomalies_detected': len(anomalies),
        'total_excess_spend': sum(a['excess_spend'] for a in anomalies),
        'details': anomalies
    }

Deploy with Terraform

resource "aws_lambda_function" "cost_anomaly_detector" {
  function_name = "cost-anomaly-detector"
  runtime       = "python3.12"
  handler       = "cost_anomaly_detector.handler"
  architectures = ["arm64"]
  memory_size   = 256
  timeout       = 120
  role          = aws_iam_role.anomaly_detector.arn

  filename         = data.archive_file.anomaly_detector.output_path
  source_code_hash = data.archive_file.anomaly_detector.output_base64sha256

  environment {
    variables = {
      SLACK_WEBHOOK_URL  = var.slack_webhook_url
      THRESHOLD_PERCENT  = "25"
      MIN_DOLLAR_THRESHOLD = "50"
    }
  }
}

resource "aws_cloudwatch_event_rule" "hourly_check" {
  name                = "cost-anomaly-hourly-check"
  schedule_expression = "rate(1 hour)"
}

resource "aws_cloudwatch_event_target" "anomaly_detector" {
  rule = aws_cloudwatch_event_rule.hourly_check.name
  arn  = aws_lambda_function.cost_anomaly_detector.arn
}

resource "aws_lambda_permission" "allow_eventbridge" {
  statement_id  = "AllowEventBridge"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.cost_anomaly_detector.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.hourly_check.arn
}

resource "aws_iam_role" "anomaly_detector" {
  name = "cost-anomaly-detector-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
    }]
  })
}

resource "aws_iam_role_policy" "anomaly_detector" {
  name = "cost-anomaly-detector-policy"
  role = aws_iam_role.anomaly_detector.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ce:GetCostAndUsage",
          "ce:GetAnomalies"
        ]
        Resource = "*"
      },
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ]
        Resource = "arn:aws:logs:*:*:*"
      }
    ]
  })
}

Setting Up Budget Alerts as a Safety Net

Anomaly detection catches spikes. Budget alerts catch gradual creep. Use both.

resource "aws_budgets_budget" "monthly_total" {
  name         = "monthly-total-budget"
  budget_type  = "COST"
  limit_amount = "50000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["finops@company.com"]
    subscriber_sns_topic_arns  = [aws_sns_topic.cost_alerts.arn]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_email_addresses = ["finops@company.com", "engineering-lead@company.com"]
    subscriber_sns_topic_arns  = [aws_sns_topic.cost_alerts.arn]
  }
}

# Per-service budgets for the usual suspects
resource "aws_budgets_budget" "ec2_budget" {
  name         = "ec2-monthly-budget"
  budget_type  = "COST"
  limit_amount = "25000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "Service"
    values = ["Amazon Elastic Compute Cloud - Compute"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 90
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.cost_alerts.arn]
  }
}

Multi-Layer Alerting Strategy

No single alert catches everything. Layer your defenses:

LayerToolDetection SpeedCoverageCost
1 — Budget alertsAWS BudgetsSame dayTotal spend thresholdsFree
2 — AWS Anomaly DetectionCost Anomaly Detection24-48 hoursML-based per-serviceFree
3 — Custom Lambda monitorLambda + CUR1 hourCustom rules, per-tag~$5/mo
4 — Real-time CloudWatchCloudWatch billing metrics4-6 hoursAccount-level totalsFree
5 — Weekly reportLambda + S3 + SESWeeklyTrend analysis~$2/mo

CloudWatch Billing Alarm (Layer 4)

resource "aws_cloudwatch_metric_alarm" "billing_alarm" {
  alarm_name          = "daily-billing-anomaly"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "EstimatedCharges"
  namespace           = "AWS/Billing"
  period              = 21600  # 6 hours
  statistic           = "Maximum"
  threshold           = 2000   # Alert if daily pace exceeds $2K
  alarm_actions       = [aws_sns_topic.cost_alerts.arn]

  dimensions = {
    Currency = "USD"
  }
}

Automatic Remediation

For known patterns, go beyond alerting — auto-remediate.

# auto_remediation.py
import boto3

ec2 = boto3.client('ec2')

def stop_untagged_gpu_instances(event, context):
    """Find and stop GPU instances without a 'keep-alive' tag."""
    gpu_types = ['p4d', 'p3', 'p5', 'g5', 'g4dn', 'g6', 'trn1', 'inf2']

    response = ec2.describe_instances(
        Filters=[
            {'Name': 'instance-state-name', 'Values': ['running']},
        ]
    )

    stopped = []
    for reservation in response['Reservations']:
        for instance in reservation['Instances']:
            instance_type = instance['InstanceType']
            family = instance_type.split('.')[0]

            if family not in gpu_types:
                continue

            tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}

            # Skip instances with keep-alive tag
            if tags.get('keep-alive') == 'true':
                continue

            # Stop instances running more than 4 hours without keep-alive
            launch_time = instance['LaunchTime']
            hours_running = (
                datetime.utcnow().replace(tzinfo=launch_time.tzinfo) - launch_time
            ).total_seconds() / 3600

            if hours_running > 4:
                ec2.stop_instances(InstanceIds=[instance['InstanceId']])
                stopped.append({
                    'instance_id': instance['InstanceId'],
                    'type': instance_type,
                    'hours_running': round(hours_running, 1),
                    'hourly_cost': get_instance_cost(instance_type),
                    'saved': round(get_instance_cost(instance_type) * hours_running, 2)
                })

    if stopped:
        notify_slack(stopped)

    return {'stopped_instances': len(stopped), 'details': stopped}

What Good Alerting Looks Like

After implementing the full stack, here's what the alert cadence should look like:

Alert TypeFrequencyAction Required
Budget at 80% of monthly targetMonthlyReview and forecast
Service cost anomaly (AWS native)As detectedInvestigate root cause
Custom hourly anomaly alertWhen threshold breachedInvestigate within 1 hour
GPU instance auto-stopAutomaticReview stopped instances
Weekly cost summaryEvery MondayTrend review with team leads
Quarterly commitment reviewQuarterlyAdjust RIs/Savings Plans

The goal is catching anomalies within hours, not days. That $14,000 weekend GPU bill? With hourly checks and a $100 threshold, you'd catch it within 4 hours — turning a $14,000 incident into a $400 blip.

Getting Started in 30 Minutes

  1. Minute 0-5: Enable AWS Cost Anomaly Detection (CLI commands above)
  2. Minute 5-10: Create SNS topic and Slack integration for alerts
  3. Minute 10-15: Set up AWS Budgets for total spend and top 3 services
  4. Minute 15-25: Deploy the custom Lambda anomaly detector
  5. Minute 25-30: Test by creating a budget alert with a very low threshold

Don't wait for the next surprise bill. The detection infrastructure costs less than $10/month to run. A single caught anomaly pays for a lifetime of monitoring.

Share:
Dev Patel
Dev Patel

Cloud Cost Optimization Specialist

I find the money your cloud is wasting. FinOps practitioner, data-driven analyst, and the person your CFO wishes they'd hired sooner. Every dollar saved is a dollar earned.

Related Articles