The Complete AWS Cost Optimization Playbook: Compute, Storage, Networking, and Reserved Capacity
The Number That Should Scare You
The average AWS customer wastes 32% of their cloud spend. Not my opinion — that's data from multiple FinOps Foundation studies. For a company spending $50,000/month, that's $192,000 per year set on fire.
I've run cost optimization engagements across dozens of organizations, from startups burning through runway to enterprises with seven-figure monthly bills. The savings are always there. Every single time. And they're usually larger than anyone expected.
This playbook is the complete system I use. We're covering every major cost category, from the obvious wins to the optimizations that require real engineering effort. Every recommendation includes the expected savings range so you can prioritize.
Before You Optimize: Build Visibility
You can't optimize what you can't see. Before touching anything, set up cost allocation.
Tagging Strategy
Every resource needs at minimum these tags:
# Enforce required tags with AWS Config
aws configservice put-config-rule --config-rule '{
"ConfigRuleName": "required-tags",
"Source": {
"Owner": "AWS",
"SourceIdentifier": "REQUIRED_TAGS"
},
"InputParameters": "{\"tag1Key\":\"Environment\",\"tag2Key\":\"Team\",\"tag3Key\":\"Service\",\"tag4Key\":\"CostCenter\"}"
}'
Cost and Usage Report
Enable CUR with hourly granularity. This is your single source of truth.
aws cur put-report-definition --report-definition '{
"ReportName": "hourly-cost-report",
"TimeUnit": "HOURLY",
"Format": "Parquet",
"Compression": "Parquet",
"AdditionalSchemaElements": ["RESOURCES", "SPLIT_COST_ALLOCATION_DATA"],
"S3Bucket": "your-cur-bucket",
"S3Region": "us-east-1",
"S3Prefix": "cur",
"RefreshClosedReports": true,
"ReportVersioning": "OVERWRITE_REPORT"
}'
Query your CUR data with Athena to find waste:
-- Top 20 most expensive resources last 30 days
SELECT
line_item_resource_id,
product_product_name,
SUM(line_item_unblended_cost) AS total_cost,
MAX(resource_tags_user_team) AS team
FROM cur_database.cur_table
WHERE line_item_usage_start_date >= date_add('day', -30, current_date)
AND line_item_line_item_type = 'Usage'
GROUP BY 1, 2
ORDER BY total_cost DESC
LIMIT 20;
Category 1: Compute (Typically 50-60% of Spend)
EC2 Right-Sizing — Expected Savings: 20-40%
Most instances are oversized. Here's how to find them systematically.
# Get right-sizing recommendations
aws ce get-rightsizing-recommendation \
--service "AmazonEC2" \
--configuration '{
"RecommendationTarget": "SAME_INSTANCE_FAMILY",
"BenefitsConsidered": true
}' \
--query 'RightsizingRecommendations[*].{
Instance: CurrentInstance.ResourceId,
Current: CurrentInstance.InstanceType,
Recommended: ModifyRecommendationDetail.TargetInstances[0].EstimatedMonthlyCost,
Savings: ModifyRecommendationDetail.TargetInstances[0].EstimatedMonthlySavings
}' \
--output table
For deeper analysis, pull CloudWatch metrics:
# Find instances with avg CPU < 10% over 14 days
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
--start-time $(date -d '14 days ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Average Maximum \
--query 'Datapoints[*].[Timestamp,Average,Maximum]' \
--output table
Rules I follow:
- Average CPU < 10% for 14 days: downsize by 50%.
- Average CPU 10-30%: downsize one instance size.
- Memory utilization requires the CloudWatch agent — install it everywhere.
- Peak utilization matters. Check the p99, not just the average.
Graviton Migration — Expected Savings: 20%
AWS Graviton (ARM) instances are 20% cheaper and often faster than x86 equivalents. The migration is straightforward for most workloads.
| x86 Instance | Graviton Equivalent | Monthly Savings (on-demand) |
|---|---|---|
| m5.xlarge ($140) | m7g.xlarge ($113) | $27 (19%) |
| c5.2xlarge ($248) | c7g.2xlarge ($199) | $49 (20%) |
| r5.4xlarge ($731) | r7g.4xlarge ($590) | $141 (19%) |
# Identify instances eligible for Graviton migration
aws ec2 describe-instances \
--filters "Name=instance-type,Values=m5.*,m6i.*,c5.*,c6i.*,r5.*,r6i.*" \
--query 'Reservations[*].Instances[*].{
ID: InstanceId,
Type: InstanceType,
Name: Tags[?Key==`Name`].Value | [0]
}' --output table
Spot Instances for Fault-Tolerant Workloads — Expected Savings: 60-90%
Spot gives you 60-90% off on-demand prices. Use it for anything that can handle interruptions.
# EKS managed node group with Spot
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production
region: us-east-1
managedNodeGroups:
- name: spot-workers
instanceTypes:
- m5.large
- m5a.large
- m5d.large
- m6i.large
- m7g.large
spot: true
desiredCapacity: 5
minSize: 2
maxSize: 20
labels:
workload-type: fault-tolerant
taints:
- key: spot
value: "true"
effect: NoSchedule
Golden rule: never run Spot with a single instance type. Use at least 4-6 types across multiple sizes and families. Diversification reduces interruption rates dramatically.
Lambda Optimization — Expected Savings: 30-50%
# Find over-provisioned Lambda functions using AWS Cost Optimization Hub
aws cost-optimization-hub list-recommendations \
--filter '{
"resourceTypes": ["Lambda"],
"actionTypes": ["Rightsize"]
}' \
--query 'items[*].{
Function: resourceId,
CurrentCost: currentResourceSummary.monthlyCost,
RecommendedCost: recommendedResourceSummary.monthlyCost,
Savings: estimatedMonthlySavings.value
}' --output table
Power-tune every function with the AWS Lambda Power Tuning tool:
# Deploy the power tuning Step Function
aws serverlessrepo create-cloud-formation-change-set \
--application-id arn:aws:serverlessrepo:us-east-1:451282441545:applications/aws-lambda-power-tuning \
--stack-name lambda-power-tuning \
--capabilities CAPABILITY_IAM
Category 2: Storage (Typically 15-25% of Spend)
S3 Lifecycle Policies — Expected Savings: 40-70%
Most S3 data is accessed once and then sits in Standard tier forever. Fix this with lifecycle rules.
{
"Rules": [
{
"ID": "intelligent-tiering-and-archive",
"Status": "Enabled",
"Filter": { "Prefix": "" },
"Transitions": [
{
"Days": 30,
"StorageClass": "INTELLIGENT_TIERING"
},
{
"Days": 90,
"StorageClass": "GLACIER_INSTANT_RETRIEVAL"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
],
"NoncurrentVersionTransitions": [
{
"NoncurrentDays": 30,
"StorageClass": "GLACIER_INSTANT_RETRIEVAL"
}
],
"NoncurrentVersionExpiration": {
"NoncurrentDays": 90
},
"AbortIncompleteMultipartUpload": {
"DaysAfterInitiation": 7
}
}
]
}
The AbortIncompleteMultipartUpload rule is money you're throwing away right now. Incomplete multipart uploads accumulate silently and cost real money.
# Find incomplete multipart uploads across all buckets
for bucket in $(aws s3api list-buckets --query 'Buckets[*].Name' --output text); do
count=$(aws s3api list-multipart-uploads --bucket "$bucket" \
--query 'length(Uploads)' --output text 2>/dev/null)
if [ "$count" != "None" ] && [ "$count" -gt 0 ]; then
echo "$bucket: $count incomplete uploads"
fi
done
EBS Optimization — Expected Savings: 20-40%
# Find unattached EBS volumes (you're paying for these right now)
aws ec2 describe-volumes \
--filters "Name=status,Values=available" \
--query 'Volumes[*].{
ID: VolumeId,
Size: Size,
Type: VolumeType,
Created: CreateTime
}' --output table
# Find volumes with low IOPS utilization (candidates for gp3 migration)
# gp3 is cheaper than gp2 in every scenario
aws ec2 describe-volumes \
--filters "Name=volume-type,Values=gp2" \
--query 'Volumes[*].{
ID: VolumeId,
Size: Size,
Cost: "Migrate to gp3 for 20% savings"
}' --output table
Every gp2 volume should be gp3. No exceptions. gp3 gives you 3000 IOPS and 125 MB/s baseline for 20% less money. The migration is online and zero-downtime:
aws ec2 modify-volume --volume-id vol-0123456789abcdef0 --volume-type gp3
Category 3: Networking (The Hidden Cost Monster)
NAT Gateway — Expected Savings: 50-80%
NAT Gateway charges $0.045/GB for data processing plus $0.045/hour. For a cluster doing heavy pulls from the internet, this adds up fast.
# Find NAT Gateway costs
aws ce get-cost-and-usage \
--time-period Start=2026-02-01,End=2026-03-01 \
--granularity MONTHLY \
--filter '{
"Dimensions": {
"Key": "USAGE_TYPE",
"Values": ["NatGateway-Bytes"]
}
}' \
--metrics "UnblendedCost" \
--query 'ResultsByTime[0].Total.UnblendedCost'
Optimizations:
- Use VPC endpoints for S3, DynamoDB, ECR, and other AWS services. This removes NAT Gateway from the path entirely.
- Deploy NAT Gateway in one AZ and route through it for non-critical traffic.
- Consider NAT instances (Fck-NAT or a t4g.nano) for dev/staging environments.
# Create VPC endpoints for common services (free for Gateway endpoints)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-12345678 \
--service-name com.amazonaws.us-east-1.s3 \
--route-table-ids rtb-12345678
aws ec2 create-vpc-endpoint \
--vpc-id vpc-12345678 \
--service-name com.amazonaws.us-east-1.dynamodb \
--route-table-ids rtb-12345678
Cross-AZ Data Transfer — Expected Savings: 10-20%
Every byte that crosses an AZ boundary costs $0.01/GB in each direction. For services communicating heavily across AZs, this adds up.
# Check cross-AZ transfer costs
aws ce get-cost-and-usage \
--time-period Start=2026-02-01,End=2026-03-01 \
--granularity MONTHLY \
--filter '{
"Dimensions": {
"Key": "USAGE_TYPE",
"Values": ["DataTransfer-Regional-Bytes"]
}
}' \
--metrics "UnblendedCost"
Use topology-aware routing in Kubernetes to keep traffic within AZs:
apiVersion: v1
kind: Service
metadata:
name: my-service
annotations:
service.kubernetes.io/topology-mode: Auto
spec:
selector:
app: my-app
ports:
- port: 80
Category 4: Reserved Capacity — Expected Savings: 30-72%
Savings Plans vs Reserved Instances
| Commitment Type | Flexibility | Discount | Best For |
|---|---|---|---|
| Compute Savings Plans | Any instance, any region | Up to 66% | Most teams |
| EC2 Instance Savings Plans | Specific instance family, any size | Up to 72% | Stable workloads |
| Reserved Instances (Standard) | Specific instance type and AZ | Up to 72% | Very predictable usage |
| Reserved Instances (Convertible) | Can change instance type | Up to 66% | Evolving workloads |
My recommendation: Start with Compute Savings Plans. They cover EC2, Fargate, and Lambda, and you can change instance types freely. Only go to EC2-specific RIs when you have 6+ months of stable usage data.
# Analyze your commitment coverage
aws ce get-savings-plans-coverage \
--time-period Start=2026-02-01,End=2026-03-01 \
--granularity MONTHLY \
--query 'SavingsPlansCoverages[0].{
OnDemandCost: Coverage.OnDemandCost,
CoveredCost: Coverage.SpendCoveredBySavingsPlans,
CoveragePercent: Coverage.CoveragePercentage
}'
# Get purchase recommendations
aws ce get-savings-plans-purchase-recommendation \
--savings-plans-type "COMPUTE_SP" \
--term-in-years "ONE_YEAR" \
--payment-option "NO_UPFRONT" \
--lookback-period-in-days "THIRTY_DAYS"
The 80/20 Commitment Rule
Never commit to 100% of your usage. Here's the rule I follow:
- 80% of baseline: Covered by 1-year Savings Plans (No Upfront).
- Next 15%: On-demand, evaluated quarterly for additional commitments.
- Top 5% (peaks): Spot or on-demand.
This gives you the bulk of the savings without locking yourself into capacity you might not need after a re-architecture.
Category 5: Database Optimization — Expected Savings: 20-50%
RDS Right-Sizing
Database instances are the most commonly oversized resources I encounter. Teams provision for peak load and never revisit.
# Check RDS instance utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name CPUUtilization \
--dimensions Name=DBInstanceIdentifier,Value=production-db \
--start-time $(date -d '14 days ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Average Maximum \
--output table
# Check freeable memory (if consistently > 50% of total, downsize)
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name FreeableMemory \
--dimensions Name=DBInstanceIdentifier,Value=production-db \
--start-time $(date -d '14 days ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Average Minimum \
--output table
Aurora Serverless v2 for Variable Workloads
If your database usage swings significantly between peak and off-peak, Aurora Serverless v2 can reduce costs by 40-60% compared to provisioned instances:
# Modify existing Aurora cluster to use Serverless v2
aws rds modify-db-instance \
--db-instance-identifier production-db-instance-1 \
--db-instance-class db.serverless \
--apply-immediately
# Set capacity range
aws rds modify-db-cluster \
--db-cluster-identifier production-cluster \
--serverless-v2-scaling-configuration MinCapacity=2,MaxCapacity=64
The MinCapacity is your floor — you always pay for at least this many ACUs. Set it to handle your baseline traffic, and let the scaling handle peaks. I've seen teams save $3,000-$5,000/month per cluster by switching from a db.r6g.4xlarge to Aurora Serverless v2 with a 4-32 ACU range.
DynamoDB On-Demand vs Provisioned
# Check your DynamoDB table's consumed capacity
aws dynamodb describe-table --table-name UserSessions \
--query 'Table.{
BillingMode: BillingModeSummary.BillingMode,
ReadCapacity: ProvisionedThroughput.ReadCapacityUnits,
WriteCapacity: ProvisionedThroughput.WriteCapacityUnits,
ItemCount: ItemCount,
TableSize: TableSizeBytes
}'
Rules I follow:
- Consistent traffic (less than 2x variance peak to trough): Use provisioned with auto-scaling. Add reserved capacity for the baseline.
- Spiky traffic (more than 4x variance): Use on-demand. The per-request price is higher but you don't pay for unused capacity.
- New tables with unknown traffic: Start on-demand, switch to provisioned once you have 2 weeks of data.
Category 6: Container and Serverless Optimization — Expected Savings: 25-40%
EKS Node Right-Sizing with Karpenter
Kubernetes clusters are often running nodes far larger than needed. Karpenter provides right-sized, just-in-time node provisioning:
# Karpenter NodePool for cost-optimized provisioning
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: cost-optimized
spec:
template:
spec:
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["m", "c", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["6"]
- key: kubernetes.io/arch
operator: In
values: ["arm64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: "200"
memory: 800Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
Karpenter's consolidation feature automatically replaces underutilized nodes with smaller ones. I've seen this reduce node costs by 30-40% compared to static node groups.
Lambda Right-Sizing with Power Tuning
Most Lambda functions are either over-provisioned (wasting money) or under-provisioned (slow and still wasting money because they take longer to execute). The AWS Lambda Power Tuning tool runs your function at different memory sizes and finds the optimal cost/performance balance:
# Deploy the power tuning state machine
aws serverlessrepo create-cloud-formation-change-set \
--application-id arn:aws:serverlessrepo:us-east-1:451282441545:applications/aws-lambda-power-tuning \
--stack-name lambda-power-tuning \
--capabilities CAPABILITY_IAM
# Run it against a function
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:us-east-1:123456789012:stateMachine:powerTuningStateMachine \
--input '{
"lambdaARN": "arn:aws:lambda:us-east-1:123456789012:function:my-function",
"powerValues": [128, 256, 512, 1024, 2048, 3072],
"num": 50,
"payload": "{}",
"parallelInvocation": true,
"strategy": "cost"
}'
The tool outputs a visualization showing cost vs execution time at each memory level. I've seen functions running at 1024MB that performed identically at 256MB — that's a 75% cost reduction for zero performance loss.
ECR Image Lifecycle Policies
Container images accumulate in ECR and cost $0.10/GB/month. Most teams never clean up old images:
# Apply lifecycle policy to expire untagged images older than 7 days
aws ecr put-lifecycle-policy \
--repository-name my-app \
--lifecycle-policy-text '{
"rules": [
{
"rulePriority": 1,
"description": "Expire untagged images after 7 days",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 7
},
"action": { "type": "expire" }
},
{
"rulePriority": 2,
"description": "Keep only last 20 tagged images",
"selection": {
"tagStatus": "tagged",
"tagPatternList": ["*"],
"countType": "imageCountMoreThan",
"countNumber": 20
},
"action": { "type": "expire" }
}
]
}'
Automated Cleanup for Non-Production
For dev and staging environments, schedule regular cleanup of abandoned resources:
# Find idle EKS node groups in dev
aws eks list-nodegroups --cluster-name dev-cluster \
--query 'nodegroups' --output text | while read ng; do
DESIRED=$(aws eks describe-nodegroup \
--cluster-name dev-cluster \
--nodegroup-name "$ng" \
--query 'nodegroup.scalingConfig.desiredSize' --output text)
echo "$ng: desired=$DESIRED"
done
# Scale down dev cluster outside business hours (cron job)
aws eks update-nodegroup-config \
--cluster-name dev-cluster \
--nodegroup-name general \
--scaling-config minSize=0,maxSize=5,desiredSize=0
Building a Cost Culture: FinOps Practices
Technical optimizations only stick if the organization supports them. Here's what I've seen work.
Weekly Cost Review Meeting
Set up a 30-minute weekly meeting with one dashboard and three questions:
- What changed this week? Look at the cost delta from the previous week.
- What's the top-growing service? Identify the fastest cost increase.
- What's the next action item? Pick one optimization to implement before next week.
Team Cost Accountability
# Generate per-team cost report using tags
aws ce get-cost-and-usage \
--time-period Start=2026-03-01,End=2026-03-23 \
--granularity MONTHLY \
--group-by Type=TAG,Key=Team \
--metrics "UnblendedCost" \
--query 'ResultsByTime[0].Groups[*].{
Team: Keys[0],
Cost: Metrics.UnblendedCost.Amount
}' --output table
Send this to team leads monthly. When teams see their own costs, behavior changes. I've watched a team cut 40% of their spend within a month of getting their first cost report — they didn't even know they had 15 unused RDS snapshots.
Budget Alerts
# Create a budget with alerts at 80% and 100%
aws budgets create-budget --account-id 123456789012 --budget '{
"BudgetName": "monthly-infrastructure",
"BudgetLimit": {"Amount": "50000", "Unit": "USD"},
"BudgetType": "COST",
"TimeUnit": "MONTHLY",
"CostFilters": {}
}' --notifications-with-subscribers '[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{"SubscriptionType": "EMAIL", "Address": "devops-team@company.com"}
]
},
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 100,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{"SubscriptionType": "EMAIL", "Address": "engineering-leads@company.com"}
]
}
]'
The Optimization Checklist
Run through this quarterly. Every item has a dollar amount attached.
| Priority | Action | Expected Savings | Effort |
|---|---|---|---|
| 1 | Delete unattached EBS volumes | Immediate | 30 min |
| 2 | Delete unused Elastic IPs | Immediate | 10 min |
| 3 | Migrate gp2 to gp3 | 20% on EBS | 1 hour |
| 4 | Add S3 lifecycle policies | 40-70% on S3 | 2 hours |
| 5 | Right-size EC2 instances | 20-40% on EC2 | 1 week |
| 6 | Add VPC endpoints for S3/DynamoDB | 50%+ on NAT | 1 hour |
| 7 | Purchase Savings Plans | 30-66% on compute | 2 hours |
| 8 | Migrate to Graviton | 20% on EC2 | 2-4 weeks |
| 9 | Implement Spot for fault-tolerant | 60-90% on batch | 1-2 weeks |
| 10 | Optimize cross-AZ traffic | 10-20% on networking | 1-2 weeks |
| 11 | Right-size RDS instances | 20-40% on databases | 1 week |
| 12 | Evaluate Aurora Serverless v2 | 40-60% on Aurora | 1-2 weeks |
| 13 | Implement Karpenter for EKS | 30-40% on nodes | 2 weeks |
| 14 | Schedule dev/staging shutdowns | 60%+ on non-prod | 1-2 days |
The Bottom Line
AWS cost optimization isn't a one-time project. It's a continuous practice. The companies that save the most money are the ones that review costs weekly, tag everything, and treat cloud spend as an engineering metric — not just a finance problem.
Start with the quick wins at the top of the checklist. They take hours, not weeks, and they'll fund the engineering time for the bigger optimizations. I've never run this playbook and found less than 25% savings. Usually it's north of 35%.
The hardest part isn't the technical implementation — it's building the organizational habit. Set up the dashboards, send the reports, celebrate the wins publicly. When an engineer saves $2,000/month by right-sizing a database, make sure the whole team knows about it. Cost consciousness is a culture, and cultures are built one visible success at a time.
Your CFO will thank you. Your runway will thank you. And the next time someone spins up an m5.4xlarge for a cron job, you'll have the dashboards to catch it.
One last point: cost optimization never ends because AWS never stops adding services, and your infrastructure never stops growing. Build the review cadence into your team's rhythm. Make it a monthly habit, not an annual crisis. The teams that treat cost as a first-class engineering concern — right alongside performance, reliability, and security — are the ones that sustain their optimizations long term.
Related Articles
Related Articles
Reserved Instances vs Savings Plans: Which to Buy When
A data-driven comparison of AWS Reserved Instances vs Savings Plans — with decision frameworks, break-even math, and real purchase recommendations.
AWS EC2 Right-Sizing: Stop Overpaying for Compute
Find and fix oversized EC2 instances with this practical right-sizing guide. Save 30-50% on AWS compute costs using CloudWatch metrics and tooling.
S3 Storage Class Optimization: Stop Paying Hot Prices for Cold Data
A practical guide to S3 storage class selection and lifecycle policies — with real dollar figures showing how to cut storage costs by 60-80%.