DevOpsil
Cloud Cost
93%
Fresh
Part 3 of 5 in Cloud Cost Cutting

AWS EC2 Right-Sizing: Stop Overpaying for Compute

Dev PatelDev Patel8 min read

Let Me Show You What This Actually Costs

The average company wastes 35% of their EC2 spend on oversized instances. Let me put that in dollars.

Monthly EC2 SpendTypical Waste (35%)Annual Waste
$5,000$1,750$21,000
$20,000$7,000$84,000
$100,000$35,000$420,000

That's money you're burning every month because someone chose m5.2xlarge when m5.large would've been fine. Let's fix that.

Step 1: Find the Waste

AWS Cost Explorer Right-Sizing Recommendations

The easiest starting point. AWS already knows which instances are oversized.

aws ce get-rightsizing-recommendation \
  --service "AmazonEC2" \
  --configuration '{
    "RecommendationTarget": "SAME_INSTANCE_FAMILY",
    "BenefitsConsidered": true
  }'

This returns recommendations like: "Your m5.2xlarge averages 12% CPU utilization. Downsize to m5.large and save $156/month."

CloudWatch Metrics Deep Dive

Don't trust recommendations blindly. Check the actual utilization:

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123def456 \
  --start-time 2026-03-01T00:00:00Z \
  --end-time 2026-03-20T00:00:00Z \
  --period 3600 \
  --statistics Average Maximum p99

Key metrics to check:

  • CPU Average < 20% → Almost certainly oversized
  • CPU p99 < 60% → Safe to downsize
  • Memory < 40% (requires CloudWatch Agent) → Consider smaller instance
  • Network < 30% of baseline → Smaller instance handles the traffic

Step 2: Build Your Right-Sizing Plan

Here's the decision framework I use:

Current UtilizationActionExpected Savings
CPU avg < 10%Downsize 2 levels (e.g., 2xlarge → large)60-75%
CPU avg 10-25%Downsize 1 level40-50%
CPU avg 25-50%Consider ARM (Graviton)20-30%
CPU avg 50-70%Right-sized, look at Savings Plans10-20%
CPU avg > 70%Monitor for headroom issues0%

The Graviton Play

This one change saved me $14,000/month at my last job.

# Before: x86 instance
resource "aws_instance" "app" {
  instance_type = "m5.xlarge"   # $0.192/hr = $140/month
  ami           = "ami-x86-app"
}

# After: ARM Graviton instance
resource "aws_instance" "app" {
  instance_type = "m7g.xlarge"  # $0.1632/hr = $119/month
  ami           = "ami-arm-app" # ARM-compatible AMI required
}

Savings: ~15% per instance. Graviton instances also deliver 20-30% better performance per dollar. It's not just cheaper — it's faster AND cheaper.

Step 3: Implement Safely

Never right-size in production without a safety net.

Terraform Module for Gradual Right-Sizing

variable "instance_type" {
  description = "EC2 instance type — change this for right-sizing"
  type        = string
  default     = "m5.xlarge"
}

variable "min_healthy_percentage" {
  description = "Minimum healthy instances during resize"
  type        = number
  default     = 90
}

resource "aws_autoscaling_group" "app" {
  name                = "app-asg"
  min_size            = 2
  max_size            = 6
  desired_capacity    = 3

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = var.min_healthy_percentage
    }
  }

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }
}

resource "aws_launch_template" "app" {
  instance_type = var.instance_type
  # ... other config
}

Change instance_type, run terraform apply, and the ASG rolls instances one at a time while maintaining 90% capacity.

Step 4: Enable the CloudWatch Agent for Memory Metrics

CPU is only half the story. AWS doesn't expose memory utilization by default. You need the CloudWatch Agent.

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "CWAgent",
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}",
      "InstanceType": "${aws:InstanceType}",
      "AutoScalingGroupName": "${aws:AutoScalingGroupName}"
    },
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent", "mem_available_percent"],
        "metrics_collection_interval": 60
      },
      "disk": {
        "measurement": ["disk_used_percent"],
        "metrics_collection_interval": 300,
        "resources": ["*"]
      },
      "net": {
        "measurement": ["bytes_sent", "bytes_recv"],
        "metrics_collection_interval": 60
      }
    }
  }
}

Deploy the agent via SSM for your fleet:

aws ssm send-command \
  --document-name "AWS-ConfigureAWSPackage" \
  --targets '[{"Key":"tag:Environment","Values":["production"]}]' \
  --parameters '{"action":["Install"],"name":["AmazonCloudWatchAgent"]}'

Once memory data flows in, query it alongside CPU:

aws cloudwatch get-metric-statistics \
  --namespace CWAgent \
  --metric-name mem_used_percent \
  --dimensions Name=InstanceId,Value=i-0abc123def456 \
  --start-time 2026-03-01T00:00:00Z \
  --end-time 2026-03-20T00:00:00Z \
  --period 3600 \
  --statistics Average Maximum

Instances running at 15% CPU and 20% memory are wasting 70-80% of their capacity. Without memory data, you're guessing.

Step 5: Right-Size by Workload Type

Different workloads have different right-sizing strategies. Don't apply the same rule everywhere.

Compute-Bound (CI Runners, Batch Jobs)

These spike to 100% CPU during builds and sit idle otherwise. Look at the p99 CPU over a week, not the average.

# Get p99 CPU for a build server over 7 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123def456 \
  --start-time $(date -d '7 days ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 86400 \
  --statistics "p99"

If p99 is below 70%, downsize. If it's above 90%, the instance is properly sized — leave it alone.

Memory-Bound (Caches, JVM Apps)

Java apps preallocate heap. The memory usage graph looks flat. Use the r family (memory-optimized) instead of m (general purpose). Moving from m5.xlarge ($0.192/hr) to r5.large ($0.126/hr) gives you the same 16 GiB RAM at 34% less cost.

# Memory-optimized right-sizing
resource "aws_instance" "cache" {
  # Before: general purpose with 16 GiB
  # instance_type = "m5.xlarge"   # 4 vCPU, 16 GiB, $0.192/hr

  # After: memory-optimized with 16 GiB
  instance_type = "r7g.large"   # 2 vCPU, 16 GiB, $0.1008/hr (Graviton)
}

Network-Bound (API Gateways, Proxies)

Check network bandwidth utilization. Each instance type has a baseline network performance. An m5.large provides "Up to 10 Gbps" but the sustained baseline is much lower.

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name NetworkIn \
  --dimensions Name=InstanceId,Value=i-0abc123def456 \
  --start-time $(date -d '7 days ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 3600 \
  --statistics Maximum

If the maximum network throughput is under 30% of the instance's baseline, a smaller instance handles it fine.

Step 6: Automate Ongoing Right-Sizing

One-time right-sizing is good. Continuous right-sizing is better.

#!/bin/bash
# Monthly right-sizing report script
# Save as right-sizing-report.sh and run via cron

REPORT_DATE=$(date +%Y-%m-%d)
OUTPUT_FILE="right-sizing-report-${REPORT_DATE}.csv"

echo "Instance ID,Current Type,Recommended Type,Monthly Savings" > "$OUTPUT_FILE"

aws ce get-rightsizing-recommendation \
  --service "AmazonEC2" \
  --configuration '{"RecommendationTarget":"SAME_INSTANCE_FAMILY","BenefitsConsidered":true}' \
  --query 'RightsizingRecommendations[?RightsizingType==`Downsize`].[CurrentInstance.ResourceDetails.EC2ResourceDetails.InstanceType,ModifyRecommendationDetail.TargetInstances[0].ResourceDetails.EC2ResourceDetails.InstanceType,ModifyRecommendationDetail.TargetInstances[0].EstimatedMonthlySavings]' \
  --output table

# Send to Slack
TOTAL_SAVINGS=$(aws ce get-rightsizing-recommendation \
  --service "AmazonEC2" \
  --configuration '{"RecommendationTarget":"SAME_INSTANCE_FAMILY","BenefitsConsidered":true}' \
  --query 'Summary.TotalRecommendationCount' \
  --output text)

curl -X POST "$SLACK_WEBHOOK_URL" \
  -H 'Content-Type: application/json' \
  -d "{\"text\":\"Monthly Right-Sizing Report: ${TOTAL_SAVINGS} instances have downsizing recommendations. Check #cloud-cost for details.\"}"

Schedule it with cron or a Lambda function:

# Crontab entry — first Monday of every month at 9 AM
0 9 1-7 * 1 /opt/scripts/right-sizing-report.sh

Step 7: Set Up Automated Tagging for Cost Attribution

Right-sizing without cost attribution is flying blind. You need to know which team owns which instances.

# Enforce tagging with AWS Organizations SCP
data "aws_iam_policy_document" "require_tags" {
  statement {
    sid    = "DenyEC2WithoutTags"
    effect = "Deny"
    actions = [
      "ec2:RunInstances"
    ]
    resources = ["arn:aws:ec2:*:*:instance/*"]
    condition {
      test     = "Null"
      variable = "aws:RequestTag/Team"
      values   = ["true"]
    }
    condition {
      test     = "Null"
      variable = "aws:RequestTag/Environment"
      values   = ["true"]
    }
  }
}

No tag, no instance. Teams can't spin up resources without ownership attribution.

Common Pitfalls

Pitfall 1: Right-sizing production without a canary. Never downsize your entire fleet at once. Start with one instance in the ASG. Monitor for 48 hours. Check response times, error rates, and queue depth. Then roll to the rest.

Pitfall 2: Ignoring burst workloads. A batch job that runs for 2 hours at 95% CPU and idles for 22 hours shows 8% average CPU. The average lies. Check the maximum and p99 before downsizing.

Pitfall 3: Forgetting about Savings Plans after right-sizing. If you right-size FROM an instance covered by a Reserved Instance or Savings Plan, you might not save anything until the commitment expires. Check your RI/SP coverage before making changes.

Pitfall 4: Not accounting for headroom. Target 60-70% peak utilization after right-sizing, not 90%. Auto Scaling needs time to react, and your application needs room for traffic spikes. A perfectly right-sized instance with zero headroom is one spike away from degraded performance.

Cost Impact Summary

For a typical 20-instance fleet averaging 25% CPU utilization:

ActionPer-Instance SavingsFleet Savings/MonthAnnual
Downsize 1 level$50-$80$1,000-$1,600$12,000-$19,200
Switch to Graviton$20-$40$400-$800$4,800-$9,600
Combined$70-$120$1,400-$2,400$16,800-$28,800

That's $17K-$29K/year in savings from a single afternoon's work. The ROI on right-sizing is the highest of any cloud optimization activity.

Tools Worth Knowing

Beyond AWS native tools, these help with right-sizing at scale:

  • AWS Compute Optimizer — ML-based recommendations considering CPU, memory, disk, and network. More accurate than Cost Explorer for complex workloads.
  • Spot.io (now Spot by NetApp) — Automatic instance selection and right-sizing with spot instance management.
  • Kubecost — For Kubernetes workloads, shows per-pod resource waste and recommends request/limit changes.
# Enable Compute Optimizer (one-time setup)
aws compute-optimizer update-enrollment-status \
  --status Active \
  --include-member-accounts

# Get recommendations
aws compute-optimizer get-ec2-instance-recommendations \
  --query 'instanceRecommendations[].{InstanceId:instanceArn,Current:currentInstanceType,Recommended:recommendationOptions[0].instanceType,Savings:recommendationOptions[0].projectedUtilizationMetrics}' \
  --output table

Compute Optimizer uses 14 days of CloudWatch data by default. For more accurate results, enable enhanced infrastructure metrics (3-month lookback) for $0.0003 per resource per hour.

Share:
Dev Patel
Dev Patel

Cloud Cost Optimization Specialist

I find the money your cloud is wasting. FinOps practitioner, data-driven analyst, and the person your CFO wishes they'd hired sooner. Every dollar saved is a dollar earned.

Related Articles