AWS EC2 Right-Sizing: Stop Overpaying for Compute
Let Me Show You What This Actually Costs
The average company wastes 35% of their EC2 spend on oversized instances. Let me put that in dollars.
| Monthly EC2 Spend | Typical Waste (35%) | Annual Waste |
|---|---|---|
| $5,000 | $1,750 | $21,000 |
| $20,000 | $7,000 | $84,000 |
| $100,000 | $35,000 | $420,000 |
That's money you're burning every month because someone chose m5.2xlarge when m5.large would've been fine. Let's fix that.
Step 1: Find the Waste
AWS Cost Explorer Right-Sizing Recommendations
The easiest starting point. AWS already knows which instances are oversized.
aws ce get-rightsizing-recommendation \
--service "AmazonEC2" \
--configuration '{
"RecommendationTarget": "SAME_INSTANCE_FAMILY",
"BenefitsConsidered": true
}'
This returns recommendations like: "Your m5.2xlarge averages 12% CPU utilization. Downsize to m5.large and save $156/month."
CloudWatch Metrics Deep Dive
Don't trust recommendations blindly. Check the actual utilization:
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456 \
--start-time 2026-03-01T00:00:00Z \
--end-time 2026-03-20T00:00:00Z \
--period 3600 \
--statistics Average Maximum p99
Key metrics to check:
- CPU Average < 20% → Almost certainly oversized
- CPU p99 < 60% → Safe to downsize
- Memory < 40% (requires CloudWatch Agent) → Consider smaller instance
- Network < 30% of baseline → Smaller instance handles the traffic
Step 2: Build Your Right-Sizing Plan
Here's the decision framework I use:
| Current Utilization | Action | Expected Savings |
|---|---|---|
| CPU avg < 10% | Downsize 2 levels (e.g., 2xlarge → large) | 60-75% |
| CPU avg 10-25% | Downsize 1 level | 40-50% |
| CPU avg 25-50% | Consider ARM (Graviton) | 20-30% |
| CPU avg 50-70% | Right-sized, look at Savings Plans | 10-20% |
| CPU avg > 70% | Monitor for headroom issues | 0% |
The Graviton Play
This one change saved me $14,000/month at my last job.
# Before: x86 instance
resource "aws_instance" "app" {
instance_type = "m5.xlarge" # $0.192/hr = $140/month
ami = "ami-x86-app"
}
# After: ARM Graviton instance
resource "aws_instance" "app" {
instance_type = "m7g.xlarge" # $0.1632/hr = $119/month
ami = "ami-arm-app" # ARM-compatible AMI required
}
Savings: ~15% per instance. Graviton instances also deliver 20-30% better performance per dollar. It's not just cheaper — it's faster AND cheaper.
Step 3: Implement Safely
Never right-size in production without a safety net.
Terraform Module for Gradual Right-Sizing
variable "instance_type" {
description = "EC2 instance type — change this for right-sizing"
type = string
default = "m5.xlarge"
}
variable "min_healthy_percentage" {
description = "Minimum healthy instances during resize"
type = number
default = 90
}
resource "aws_autoscaling_group" "app" {
name = "app-asg"
min_size = 2
max_size = 6
desired_capacity = 3
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = var.min_healthy_percentage
}
}
launch_template {
id = aws_launch_template.app.id
version = "$Latest"
}
}
resource "aws_launch_template" "app" {
instance_type = var.instance_type
# ... other config
}
Change instance_type, run terraform apply, and the ASG rolls instances one at a time while maintaining 90% capacity.
Step 4: Enable the CloudWatch Agent for Memory Metrics
CPU is only half the story. AWS doesn't expose memory utilization by default. You need the CloudWatch Agent.
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"namespace": "CWAgent",
"append_dimensions": {
"InstanceId": "${aws:InstanceId}",
"InstanceType": "${aws:InstanceType}",
"AutoScalingGroupName": "${aws:AutoScalingGroupName}"
},
"metrics_collected": {
"mem": {
"measurement": ["mem_used_percent", "mem_available_percent"],
"metrics_collection_interval": 60
},
"disk": {
"measurement": ["disk_used_percent"],
"metrics_collection_interval": 300,
"resources": ["*"]
},
"net": {
"measurement": ["bytes_sent", "bytes_recv"],
"metrics_collection_interval": 60
}
}
}
}
Deploy the agent via SSM for your fleet:
aws ssm send-command \
--document-name "AWS-ConfigureAWSPackage" \
--targets '[{"Key":"tag:Environment","Values":["production"]}]' \
--parameters '{"action":["Install"],"name":["AmazonCloudWatchAgent"]}'
Once memory data flows in, query it alongside CPU:
aws cloudwatch get-metric-statistics \
--namespace CWAgent \
--metric-name mem_used_percent \
--dimensions Name=InstanceId,Value=i-0abc123def456 \
--start-time 2026-03-01T00:00:00Z \
--end-time 2026-03-20T00:00:00Z \
--period 3600 \
--statistics Average Maximum
Instances running at 15% CPU and 20% memory are wasting 70-80% of their capacity. Without memory data, you're guessing.
Step 5: Right-Size by Workload Type
Different workloads have different right-sizing strategies. Don't apply the same rule everywhere.
Compute-Bound (CI Runners, Batch Jobs)
These spike to 100% CPU during builds and sit idle otherwise. Look at the p99 CPU over a week, not the average.
# Get p99 CPU for a build server over 7 days
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456 \
--start-time $(date -d '7 days ago' -u +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 86400 \
--statistics "p99"
If p99 is below 70%, downsize. If it's above 90%, the instance is properly sized — leave it alone.
Memory-Bound (Caches, JVM Apps)
Java apps preallocate heap. The memory usage graph looks flat. Use the r family (memory-optimized) instead of m (general purpose). Moving from m5.xlarge ($0.192/hr) to r5.large ($0.126/hr) gives you the same 16 GiB RAM at 34% less cost.
# Memory-optimized right-sizing
resource "aws_instance" "cache" {
# Before: general purpose with 16 GiB
# instance_type = "m5.xlarge" # 4 vCPU, 16 GiB, $0.192/hr
# After: memory-optimized with 16 GiB
instance_type = "r7g.large" # 2 vCPU, 16 GiB, $0.1008/hr (Graviton)
}
Network-Bound (API Gateways, Proxies)
Check network bandwidth utilization. Each instance type has a baseline network performance. An m5.large provides "Up to 10 Gbps" but the sustained baseline is much lower.
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name NetworkIn \
--dimensions Name=InstanceId,Value=i-0abc123def456 \
--start-time $(date -d '7 days ago' -u +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 3600 \
--statistics Maximum
If the maximum network throughput is under 30% of the instance's baseline, a smaller instance handles it fine.
Step 6: Automate Ongoing Right-Sizing
One-time right-sizing is good. Continuous right-sizing is better.
#!/bin/bash
# Monthly right-sizing report script
# Save as right-sizing-report.sh and run via cron
REPORT_DATE=$(date +%Y-%m-%d)
OUTPUT_FILE="right-sizing-report-${REPORT_DATE}.csv"
echo "Instance ID,Current Type,Recommended Type,Monthly Savings" > "$OUTPUT_FILE"
aws ce get-rightsizing-recommendation \
--service "AmazonEC2" \
--configuration '{"RecommendationTarget":"SAME_INSTANCE_FAMILY","BenefitsConsidered":true}' \
--query 'RightsizingRecommendations[?RightsizingType==`Downsize`].[CurrentInstance.ResourceDetails.EC2ResourceDetails.InstanceType,ModifyRecommendationDetail.TargetInstances[0].ResourceDetails.EC2ResourceDetails.InstanceType,ModifyRecommendationDetail.TargetInstances[0].EstimatedMonthlySavings]' \
--output table
# Send to Slack
TOTAL_SAVINGS=$(aws ce get-rightsizing-recommendation \
--service "AmazonEC2" \
--configuration '{"RecommendationTarget":"SAME_INSTANCE_FAMILY","BenefitsConsidered":true}' \
--query 'Summary.TotalRecommendationCount' \
--output text)
curl -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"Monthly Right-Sizing Report: ${TOTAL_SAVINGS} instances have downsizing recommendations. Check #cloud-cost for details.\"}"
Schedule it with cron or a Lambda function:
# Crontab entry — first Monday of every month at 9 AM
0 9 1-7 * 1 /opt/scripts/right-sizing-report.sh
Step 7: Set Up Automated Tagging for Cost Attribution
Right-sizing without cost attribution is flying blind. You need to know which team owns which instances.
# Enforce tagging with AWS Organizations SCP
data "aws_iam_policy_document" "require_tags" {
statement {
sid = "DenyEC2WithoutTags"
effect = "Deny"
actions = [
"ec2:RunInstances"
]
resources = ["arn:aws:ec2:*:*:instance/*"]
condition {
test = "Null"
variable = "aws:RequestTag/Team"
values = ["true"]
}
condition {
test = "Null"
variable = "aws:RequestTag/Environment"
values = ["true"]
}
}
}
No tag, no instance. Teams can't spin up resources without ownership attribution.
Common Pitfalls
Pitfall 1: Right-sizing production without a canary. Never downsize your entire fleet at once. Start with one instance in the ASG. Monitor for 48 hours. Check response times, error rates, and queue depth. Then roll to the rest.
Pitfall 2: Ignoring burst workloads. A batch job that runs for 2 hours at 95% CPU and idles for 22 hours shows 8% average CPU. The average lies. Check the maximum and p99 before downsizing.
Pitfall 3: Forgetting about Savings Plans after right-sizing. If you right-size FROM an instance covered by a Reserved Instance or Savings Plan, you might not save anything until the commitment expires. Check your RI/SP coverage before making changes.
Pitfall 4: Not accounting for headroom. Target 60-70% peak utilization after right-sizing, not 90%. Auto Scaling needs time to react, and your application needs room for traffic spikes. A perfectly right-sized instance with zero headroom is one spike away from degraded performance.
Cost Impact Summary
For a typical 20-instance fleet averaging 25% CPU utilization:
| Action | Per-Instance Savings | Fleet Savings/Month | Annual |
|---|---|---|---|
| Downsize 1 level | $50-$80 | $1,000-$1,600 | $12,000-$19,200 |
| Switch to Graviton | $20-$40 | $400-$800 | $4,800-$9,600 |
| Combined | $70-$120 | $1,400-$2,400 | $16,800-$28,800 |
That's $17K-$29K/year in savings from a single afternoon's work. The ROI on right-sizing is the highest of any cloud optimization activity.
Tools Worth Knowing
Beyond AWS native tools, these help with right-sizing at scale:
- AWS Compute Optimizer — ML-based recommendations considering CPU, memory, disk, and network. More accurate than Cost Explorer for complex workloads.
- Spot.io (now Spot by NetApp) — Automatic instance selection and right-sizing with spot instance management.
- Kubecost — For Kubernetes workloads, shows per-pod resource waste and recommends request/limit changes.
# Enable Compute Optimizer (one-time setup)
aws compute-optimizer update-enrollment-status \
--status Active \
--include-member-accounts
# Get recommendations
aws compute-optimizer get-ec2-instance-recommendations \
--query 'instanceRecommendations[].{InstanceId:instanceArn,Current:currentInstanceType,Recommended:recommendationOptions[0].instanceType,Savings:recommendationOptions[0].projectedUtilizationMetrics}' \
--output table
Compute Optimizer uses 14 days of CloudWatch data by default. For more accurate results, enable enhanced infrastructure metrics (3-month lookback) for $0.0003 per resource per hour.
Related Articles
Related Articles
The Complete AWS Cost Optimization Playbook: Compute, Storage, Networking, and Reserved Capacity
A data-driven playbook for cutting AWS costs across compute, storage, networking, and reserved capacity with real numbers and actions.
Reserved Instances vs Savings Plans: Which to Buy When
A data-driven comparison of AWS Reserved Instances vs Savings Plans — with decision frameworks, break-even math, and real purchase recommendations.
AWS Lambda Cost Optimization: Memory Tuning, Provisioned Concurrency, and ARM
Cut your AWS Lambda costs by 40-70% with memory right-sizing, ARM/Graviton migration, and smart provisioned concurrency strategies.