DevOpsil
Cloud Cost
82%
Fresh
Part 1 of 5 in Cloud Cost Cutting

The Complete AWS Cost Optimization Playbook: Compute, Storage, Networking, and Reserved Capacity

Dev PatelDev Patel15 min read

The Number That Should Scare You

The average AWS customer wastes 32% of their cloud spend. Not my opinion — that's data from multiple FinOps Foundation studies. For a company spending $50,000/month, that's $192,000 per year set on fire.

I've run cost optimization engagements across dozens of organizations, from startups burning through runway to enterprises with seven-figure monthly bills. The savings are always there. Every single time. And they're usually larger than anyone expected.

This playbook is the complete system I use. We're covering every major cost category, from the obvious wins to the optimizations that require real engineering effort. Every recommendation includes the expected savings range so you can prioritize.

Before You Optimize: Build Visibility

You can't optimize what you can't see. Before touching anything, set up cost allocation.

Tagging Strategy

Every resource needs at minimum these tags:

# Enforce required tags with AWS Config
aws configservice put-config-rule --config-rule '{
  "ConfigRuleName": "required-tags",
  "Source": {
    "Owner": "AWS",
    "SourceIdentifier": "REQUIRED_TAGS"
  },
  "InputParameters": "{\"tag1Key\":\"Environment\",\"tag2Key\":\"Team\",\"tag3Key\":\"Service\",\"tag4Key\":\"CostCenter\"}"
}'

Cost and Usage Report

Enable CUR with hourly granularity. This is your single source of truth.

aws cur put-report-definition --report-definition '{
  "ReportName": "hourly-cost-report",
  "TimeUnit": "HOURLY",
  "Format": "Parquet",
  "Compression": "Parquet",
  "AdditionalSchemaElements": ["RESOURCES", "SPLIT_COST_ALLOCATION_DATA"],
  "S3Bucket": "your-cur-bucket",
  "S3Region": "us-east-1",
  "S3Prefix": "cur",
  "RefreshClosedReports": true,
  "ReportVersioning": "OVERWRITE_REPORT"
}'

Query your CUR data with Athena to find waste:

-- Top 20 most expensive resources last 30 days
SELECT
  line_item_resource_id,
  product_product_name,
  SUM(line_item_unblended_cost) AS total_cost,
  MAX(resource_tags_user_team) AS team
FROM cur_database.cur_table
WHERE line_item_usage_start_date >= date_add('day', -30, current_date)
  AND line_item_line_item_type = 'Usage'
GROUP BY 1, 2
ORDER BY total_cost DESC
LIMIT 20;

Category 1: Compute (Typically 50-60% of Spend)

EC2 Right-Sizing — Expected Savings: 20-40%

Most instances are oversized. Here's how to find them systematically.

# Get right-sizing recommendations
aws ce get-rightsizing-recommendation \
  --service "AmazonEC2" \
  --configuration '{
    "RecommendationTarget": "SAME_INSTANCE_FAMILY",
    "BenefitsConsidered": true
  }' \
  --query 'RightsizingRecommendations[*].{
    Instance: CurrentInstance.ResourceId,
    Current: CurrentInstance.InstanceType,
    Recommended: ModifyRecommendationDetail.TargetInstances[0].EstimatedMonthlyCost,
    Savings: ModifyRecommendationDetail.TargetInstances[0].EstimatedMonthlySavings
  }' \
  --output table

For deeper analysis, pull CloudWatch metrics:

# Find instances with avg CPU < 10% over 14 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --start-time $(date -d '14 days ago' -u +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Average Maximum \
  --query 'Datapoints[*].[Timestamp,Average,Maximum]' \
  --output table

Rules I follow:

  • Average CPU < 10% for 14 days: downsize by 50%.
  • Average CPU 10-30%: downsize one instance size.
  • Memory utilization requires the CloudWatch agent — install it everywhere.
  • Peak utilization matters. Check the p99, not just the average.

Graviton Migration — Expected Savings: 20%

AWS Graviton (ARM) instances are 20% cheaper and often faster than x86 equivalents. The migration is straightforward for most workloads.

x86 InstanceGraviton EquivalentMonthly Savings (on-demand)
m5.xlarge ($140)m7g.xlarge ($113)$27 (19%)
c5.2xlarge ($248)c7g.2xlarge ($199)$49 (20%)
r5.4xlarge ($731)r7g.4xlarge ($590)$141 (19%)
# Identify instances eligible for Graviton migration
aws ec2 describe-instances \
  --filters "Name=instance-type,Values=m5.*,m6i.*,c5.*,c6i.*,r5.*,r6i.*" \
  --query 'Reservations[*].Instances[*].{
    ID: InstanceId,
    Type: InstanceType,
    Name: Tags[?Key==`Name`].Value | [0]
  }' --output table

Spot Instances for Fault-Tolerant Workloads — Expected Savings: 60-90%

Spot gives you 60-90% off on-demand prices. Use it for anything that can handle interruptions.

# EKS managed node group with Spot
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: production
  region: us-east-1

managedNodeGroups:
  - name: spot-workers
    instanceTypes:
      - m5.large
      - m5a.large
      - m5d.large
      - m6i.large
      - m7g.large
    spot: true
    desiredCapacity: 5
    minSize: 2
    maxSize: 20
    labels:
      workload-type: fault-tolerant
    taints:
      - key: spot
        value: "true"
        effect: NoSchedule

Golden rule: never run Spot with a single instance type. Use at least 4-6 types across multiple sizes and families. Diversification reduces interruption rates dramatically.

Lambda Optimization — Expected Savings: 30-50%

# Find over-provisioned Lambda functions using AWS Cost Optimization Hub
aws cost-optimization-hub list-recommendations \
  --filter '{
    "resourceTypes": ["Lambda"],
    "actionTypes": ["Rightsize"]
  }' \
  --query 'items[*].{
    Function: resourceId,
    CurrentCost: currentResourceSummary.monthlyCost,
    RecommendedCost: recommendedResourceSummary.monthlyCost,
    Savings: estimatedMonthlySavings.value
  }' --output table

Power-tune every function with the AWS Lambda Power Tuning tool:

# Deploy the power tuning Step Function
aws serverlessrepo create-cloud-formation-change-set \
  --application-id arn:aws:serverlessrepo:us-east-1:451282441545:applications/aws-lambda-power-tuning \
  --stack-name lambda-power-tuning \
  --capabilities CAPABILITY_IAM

Category 2: Storage (Typically 15-25% of Spend)

S3 Lifecycle Policies — Expected Savings: 40-70%

Most S3 data is accessed once and then sits in Standard tier forever. Fix this with lifecycle rules.

{
  "Rules": [
    {
      "ID": "intelligent-tiering-and-archive",
      "Status": "Enabled",
      "Filter": { "Prefix": "" },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "INTELLIGENT_TIERING"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_INSTANT_RETRIEVAL"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "NoncurrentVersionTransitions": [
        {
          "NoncurrentDays": 30,
          "StorageClass": "GLACIER_INSTANT_RETRIEVAL"
        }
      ],
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 90
      },
      "AbortIncompleteMultipartUpload": {
        "DaysAfterInitiation": 7
      }
    }
  ]
}

The AbortIncompleteMultipartUpload rule is money you're throwing away right now. Incomplete multipart uploads accumulate silently and cost real money.

# Find incomplete multipart uploads across all buckets
for bucket in $(aws s3api list-buckets --query 'Buckets[*].Name' --output text); do
  count=$(aws s3api list-multipart-uploads --bucket "$bucket" \
    --query 'length(Uploads)' --output text 2>/dev/null)
  if [ "$count" != "None" ] && [ "$count" -gt 0 ]; then
    echo "$bucket: $count incomplete uploads"
  fi
done

EBS Optimization — Expected Savings: 20-40%

# Find unattached EBS volumes (you're paying for these right now)
aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query 'Volumes[*].{
    ID: VolumeId,
    Size: Size,
    Type: VolumeType,
    Created: CreateTime
  }' --output table

# Find volumes with low IOPS utilization (candidates for gp3 migration)
# gp3 is cheaper than gp2 in every scenario
aws ec2 describe-volumes \
  --filters "Name=volume-type,Values=gp2" \
  --query 'Volumes[*].{
    ID: VolumeId,
    Size: Size,
    Cost: "Migrate to gp3 for 20% savings"
  }' --output table

Every gp2 volume should be gp3. No exceptions. gp3 gives you 3000 IOPS and 125 MB/s baseline for 20% less money. The migration is online and zero-downtime:

aws ec2 modify-volume --volume-id vol-0123456789abcdef0 --volume-type gp3

Category 3: Networking (The Hidden Cost Monster)

NAT Gateway — Expected Savings: 50-80%

NAT Gateway charges $0.045/GB for data processing plus $0.045/hour. For a cluster doing heavy pulls from the internet, this adds up fast.

# Find NAT Gateway costs
aws ce get-cost-and-usage \
  --time-period Start=2026-02-01,End=2026-03-01 \
  --granularity MONTHLY \
  --filter '{
    "Dimensions": {
      "Key": "USAGE_TYPE",
      "Values": ["NatGateway-Bytes"]
    }
  }' \
  --metrics "UnblendedCost" \
  --query 'ResultsByTime[0].Total.UnblendedCost'

Optimizations:

  1. Use VPC endpoints for S3, DynamoDB, ECR, and other AWS services. This removes NAT Gateway from the path entirely.
  2. Deploy NAT Gateway in one AZ and route through it for non-critical traffic.
  3. Consider NAT instances (Fck-NAT or a t4g.nano) for dev/staging environments.
# Create VPC endpoints for common services (free for Gateway endpoints)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-12345678 \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-12345678

aws ec2 create-vpc-endpoint \
  --vpc-id vpc-12345678 \
  --service-name com.amazonaws.us-east-1.dynamodb \
  --route-table-ids rtb-12345678

Cross-AZ Data Transfer — Expected Savings: 10-20%

Every byte that crosses an AZ boundary costs $0.01/GB in each direction. For services communicating heavily across AZs, this adds up.

# Check cross-AZ transfer costs
aws ce get-cost-and-usage \
  --time-period Start=2026-02-01,End=2026-03-01 \
  --granularity MONTHLY \
  --filter '{
    "Dimensions": {
      "Key": "USAGE_TYPE",
      "Values": ["DataTransfer-Regional-Bytes"]
    }
  }' \
  --metrics "UnblendedCost"

Use topology-aware routing in Kubernetes to keep traffic within AZs:

apiVersion: v1
kind: Service
metadata:
  name: my-service
  annotations:
    service.kubernetes.io/topology-mode: Auto
spec:
  selector:
    app: my-app
  ports:
    - port: 80

Category 4: Reserved Capacity — Expected Savings: 30-72%

Savings Plans vs Reserved Instances

Commitment TypeFlexibilityDiscountBest For
Compute Savings PlansAny instance, any regionUp to 66%Most teams
EC2 Instance Savings PlansSpecific instance family, any sizeUp to 72%Stable workloads
Reserved Instances (Standard)Specific instance type and AZUp to 72%Very predictable usage
Reserved Instances (Convertible)Can change instance typeUp to 66%Evolving workloads

My recommendation: Start with Compute Savings Plans. They cover EC2, Fargate, and Lambda, and you can change instance types freely. Only go to EC2-specific RIs when you have 6+ months of stable usage data.

# Analyze your commitment coverage
aws ce get-savings-plans-coverage \
  --time-period Start=2026-02-01,End=2026-03-01 \
  --granularity MONTHLY \
  --query 'SavingsPlansCoverages[0].{
    OnDemandCost: Coverage.OnDemandCost,
    CoveredCost: Coverage.SpendCoveredBySavingsPlans,
    CoveragePercent: Coverage.CoveragePercentage
  }'

# Get purchase recommendations
aws ce get-savings-plans-purchase-recommendation \
  --savings-plans-type "COMPUTE_SP" \
  --term-in-years "ONE_YEAR" \
  --payment-option "NO_UPFRONT" \
  --lookback-period-in-days "THIRTY_DAYS"

The 80/20 Commitment Rule

Never commit to 100% of your usage. Here's the rule I follow:

  • 80% of baseline: Covered by 1-year Savings Plans (No Upfront).
  • Next 15%: On-demand, evaluated quarterly for additional commitments.
  • Top 5% (peaks): Spot or on-demand.

This gives you the bulk of the savings without locking yourself into capacity you might not need after a re-architecture.

Category 5: Database Optimization — Expected Savings: 20-50%

RDS Right-Sizing

Database instances are the most commonly oversized resources I encounter. Teams provision for peak load and never revisit.

# Check RDS instance utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=production-db \
  --start-time $(date -d '14 days ago' -u +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Average Maximum \
  --output table

# Check freeable memory (if consistently > 50% of total, downsize)
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name FreeableMemory \
  --dimensions Name=DBInstanceIdentifier,Value=production-db \
  --start-time $(date -d '14 days ago' -u +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Average Minimum \
  --output table

Aurora Serverless v2 for Variable Workloads

If your database usage swings significantly between peak and off-peak, Aurora Serverless v2 can reduce costs by 40-60% compared to provisioned instances:

# Modify existing Aurora cluster to use Serverless v2
aws rds modify-db-instance \
  --db-instance-identifier production-db-instance-1 \
  --db-instance-class db.serverless \
  --apply-immediately

# Set capacity range
aws rds modify-db-cluster \
  --db-cluster-identifier production-cluster \
  --serverless-v2-scaling-configuration MinCapacity=2,MaxCapacity=64

The MinCapacity is your floor — you always pay for at least this many ACUs. Set it to handle your baseline traffic, and let the scaling handle peaks. I've seen teams save $3,000-$5,000/month per cluster by switching from a db.r6g.4xlarge to Aurora Serverless v2 with a 4-32 ACU range.

DynamoDB On-Demand vs Provisioned

# Check your DynamoDB table's consumed capacity
aws dynamodb describe-table --table-name UserSessions \
  --query 'Table.{
    BillingMode: BillingModeSummary.BillingMode,
    ReadCapacity: ProvisionedThroughput.ReadCapacityUnits,
    WriteCapacity: ProvisionedThroughput.WriteCapacityUnits,
    ItemCount: ItemCount,
    TableSize: TableSizeBytes
  }'

Rules I follow:

  • Consistent traffic (less than 2x variance peak to trough): Use provisioned with auto-scaling. Add reserved capacity for the baseline.
  • Spiky traffic (more than 4x variance): Use on-demand. The per-request price is higher but you don't pay for unused capacity.
  • New tables with unknown traffic: Start on-demand, switch to provisioned once you have 2 weeks of data.

Category 6: Container and Serverless Optimization — Expected Savings: 25-40%

EKS Node Right-Sizing with Karpenter

Kubernetes clusters are often running nodes far larger than needed. Karpenter provides right-sized, just-in-time node provisioning:

# Karpenter NodePool for cost-optimized provisioning
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: cost-optimized
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["m", "c", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["6"]
        - key: kubernetes.io/arch
          operator: In
          values: ["arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  limits:
    cpu: "200"
    memory: 800Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s

Karpenter's consolidation feature automatically replaces underutilized nodes with smaller ones. I've seen this reduce node costs by 30-40% compared to static node groups.

Lambda Right-Sizing with Power Tuning

Most Lambda functions are either over-provisioned (wasting money) or under-provisioned (slow and still wasting money because they take longer to execute). The AWS Lambda Power Tuning tool runs your function at different memory sizes and finds the optimal cost/performance balance:

# Deploy the power tuning state machine
aws serverlessrepo create-cloud-formation-change-set \
  --application-id arn:aws:serverlessrepo:us-east-1:451282441545:applications/aws-lambda-power-tuning \
  --stack-name lambda-power-tuning \
  --capabilities CAPABILITY_IAM

# Run it against a function
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:123456789012:stateMachine:powerTuningStateMachine \
  --input '{
    "lambdaARN": "arn:aws:lambda:us-east-1:123456789012:function:my-function",
    "powerValues": [128, 256, 512, 1024, 2048, 3072],
    "num": 50,
    "payload": "{}",
    "parallelInvocation": true,
    "strategy": "cost"
  }'

The tool outputs a visualization showing cost vs execution time at each memory level. I've seen functions running at 1024MB that performed identically at 256MB — that's a 75% cost reduction for zero performance loss.

ECR Image Lifecycle Policies

Container images accumulate in ECR and cost $0.10/GB/month. Most teams never clean up old images:

# Apply lifecycle policy to expire untagged images older than 7 days
aws ecr put-lifecycle-policy \
  --repository-name my-app \
  --lifecycle-policy-text '{
    "rules": [
      {
        "rulePriority": 1,
        "description": "Expire untagged images after 7 days",
        "selection": {
          "tagStatus": "untagged",
          "countType": "sinceImagePushed",
          "countUnit": "days",
          "countNumber": 7
        },
        "action": { "type": "expire" }
      },
      {
        "rulePriority": 2,
        "description": "Keep only last 20 tagged images",
        "selection": {
          "tagStatus": "tagged",
          "tagPatternList": ["*"],
          "countType": "imageCountMoreThan",
          "countNumber": 20
        },
        "action": { "type": "expire" }
      }
    ]
  }'

Automated Cleanup for Non-Production

For dev and staging environments, schedule regular cleanup of abandoned resources:

# Find idle EKS node groups in dev
aws eks list-nodegroups --cluster-name dev-cluster \
  --query 'nodegroups' --output text | while read ng; do
  DESIRED=$(aws eks describe-nodegroup \
    --cluster-name dev-cluster \
    --nodegroup-name "$ng" \
    --query 'nodegroup.scalingConfig.desiredSize' --output text)
  echo "$ng: desired=$DESIRED"
done

# Scale down dev cluster outside business hours (cron job)
aws eks update-nodegroup-config \
  --cluster-name dev-cluster \
  --nodegroup-name general \
  --scaling-config minSize=0,maxSize=5,desiredSize=0

Building a Cost Culture: FinOps Practices

Technical optimizations only stick if the organization supports them. Here's what I've seen work.

Weekly Cost Review Meeting

Set up a 30-minute weekly meeting with one dashboard and three questions:

  1. What changed this week? Look at the cost delta from the previous week.
  2. What's the top-growing service? Identify the fastest cost increase.
  3. What's the next action item? Pick one optimization to implement before next week.

Team Cost Accountability

# Generate per-team cost report using tags
aws ce get-cost-and-usage \
  --time-period Start=2026-03-01,End=2026-03-23 \
  --granularity MONTHLY \
  --group-by Type=TAG,Key=Team \
  --metrics "UnblendedCost" \
  --query 'ResultsByTime[0].Groups[*].{
    Team: Keys[0],
    Cost: Metrics.UnblendedCost.Amount
  }' --output table

Send this to team leads monthly. When teams see their own costs, behavior changes. I've watched a team cut 40% of their spend within a month of getting their first cost report — they didn't even know they had 15 unused RDS snapshots.

Budget Alerts

# Create a budget with alerts at 80% and 100%
aws budgets create-budget --account-id 123456789012 --budget '{
  "BudgetName": "monthly-infrastructure",
  "BudgetLimit": {"Amount": "50000", "Unit": "USD"},
  "BudgetType": "COST",
  "TimeUnit": "MONTHLY",
  "CostFilters": {}
}' --notifications-with-subscribers '[
  {
    "Notification": {
      "NotificationType": "ACTUAL",
      "ComparisonOperator": "GREATER_THAN",
      "Threshold": 80,
      "ThresholdType": "PERCENTAGE"
    },
    "Subscribers": [
      {"SubscriptionType": "EMAIL", "Address": "devops-team@company.com"}
    ]
  },
  {
    "Notification": {
      "NotificationType": "ACTUAL",
      "ComparisonOperator": "GREATER_THAN",
      "Threshold": 100,
      "ThresholdType": "PERCENTAGE"
    },
    "Subscribers": [
      {"SubscriptionType": "EMAIL", "Address": "engineering-leads@company.com"}
    ]
  }
]'

The Optimization Checklist

Run through this quarterly. Every item has a dollar amount attached.

PriorityActionExpected SavingsEffort
1Delete unattached EBS volumesImmediate30 min
2Delete unused Elastic IPsImmediate10 min
3Migrate gp2 to gp320% on EBS1 hour
4Add S3 lifecycle policies40-70% on S32 hours
5Right-size EC2 instances20-40% on EC21 week
6Add VPC endpoints for S3/DynamoDB50%+ on NAT1 hour
7Purchase Savings Plans30-66% on compute2 hours
8Migrate to Graviton20% on EC22-4 weeks
9Implement Spot for fault-tolerant60-90% on batch1-2 weeks
10Optimize cross-AZ traffic10-20% on networking1-2 weeks
11Right-size RDS instances20-40% on databases1 week
12Evaluate Aurora Serverless v240-60% on Aurora1-2 weeks
13Implement Karpenter for EKS30-40% on nodes2 weeks
14Schedule dev/staging shutdowns60%+ on non-prod1-2 days

The Bottom Line

AWS cost optimization isn't a one-time project. It's a continuous practice. The companies that save the most money are the ones that review costs weekly, tag everything, and treat cloud spend as an engineering metric — not just a finance problem.

Start with the quick wins at the top of the checklist. They take hours, not weeks, and they'll fund the engineering time for the bigger optimizations. I've never run this playbook and found less than 25% savings. Usually it's north of 35%.

The hardest part isn't the technical implementation — it's building the organizational habit. Set up the dashboards, send the reports, celebrate the wins publicly. When an engineer saves $2,000/month by right-sizing a database, make sure the whole team knows about it. Cost consciousness is a culture, and cultures are built one visible success at a time.

Your CFO will thank you. Your runway will thank you. And the next time someone spins up an m5.4xlarge for a cron job, you'll have the dashboards to catch it.

One last point: cost optimization never ends because AWS never stops adding services, and your infrastructure never stops growing. Build the review cadence into your team's rhythm. Make it a monthly habit, not an annual crisis. The teams that treat cost as a first-class engineering concern — right alongside performance, reliability, and security — are the ones that sustain their optimizations long term.

Share:
Dev Patel
Dev Patel

Cloud Cost Optimization Specialist

I find the money your cloud is wasting. FinOps practitioner, data-driven analyst, and the person your CFO wishes they'd hired sooner. Every dollar saved is a dollar earned.

Related Articles