DevOpsil
AWS
93%
Fresh

AWS Core Services: The DevOps Engineer's Essential Guide

Aareez AsifAareez Asif26 min read

AWS has over 200 services. Nobody uses all of them. As a DevOps engineer, your job is to know the core services deeply and the rest well enough to know when they solve a problem. This guide covers the services you will touch every single week, with real CLI examples, pricing context, architecture patterns, and the operational details that matter when you are on call at 2 AM.

AWS Account Structure and Organizations

Before you provision a single resource, understand how AWS organizes access. A production-grade setup uses AWS Organizations with multiple accounts, and getting this right early prevents painful migrations later.

The Multi-Account Strategy

A typical enterprise structure looks like this:

  • Management Account -- billing, consolidated logs, Organization policies. No workloads run here.
  • Security Account -- GuardDuty, Security Hub, centralized CloudTrail, AWS Config aggregator.
  • Log Archive Account -- immutable storage for CloudTrail logs, VPC flow logs, and audit trails.
  • Shared Services Account -- DNS (Route 53), shared container registries (ECR), CI/CD tooling, artifact storage.
  • Network Account -- Transit Gateway, Direct Connect, shared VPC infrastructure.
  • Workload Accounts -- dev, staging, production, each fully isolated with separate IAM boundaries.

This separation exists because a single AWS account becomes a blast radius. If an attacker compromises your production account, they should not be able to touch your audit logs or billing configuration. AWS Organizations lets you manage all these accounts centrally.

Service Control Policies (SCPs)

SCPs are guardrails applied at the Organization or Organizational Unit (OU) level. They restrict what member accounts can do, even if the account's IAM policies allow it. Think of SCPs as a ceiling on permissions.

For example, deny all activity outside your approved regions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyUnapprovedRegions",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": ["us-east-1", "eu-west-1"]
        }
      }
    }
  ]
}

Other common SCPs include preventing member accounts from leaving the Organization, blocking the creation of IAM users with console access (forcing SSO instead), and denying public S3 bucket policies. These guardrails catch mistakes before they become incidents.

AWS Control Tower

For teams setting up multi-account environments from scratch, AWS Control Tower automates the creation of accounts, OUs, and baseline guardrails. It provisions a landing zone with pre-configured security baselines, SSO configuration, and centralized logging. Control Tower uses Account Factory to let teams request new accounts through a self-service catalog, ensuring every account starts with the correct configuration.

IAM: Identity and Access Management

IAM is the service you will interact with the most and get wrong the most. Every API call in AWS is authorized through IAM. Mastering it is not optional.

Core Concepts

ConceptWhat It IsWhen to Use
UserLong-lived identity with credentialsHuman access (prefer SSO instead)
GroupCollection of users sharing policiesOrganizing human permissions
RoleAssumable identity, temporary credentialsEC2 instances, Lambda, cross-account access
PolicyJSON document defining permissionsAttached to users, groups, or roles
Instance ProfileWrapper that lets EC2 assume a roleEvery EC2 instance that calls AWS APIs
Permission BoundaryMaximum permissions an entity can haveDelegated administration
Session PolicyInline policy passed during role assumptionTemporary scope reduction

IAM Policy Evaluation Logic

Understanding how AWS evaluates policies prevents hours of debugging. The evaluation order is:

  1. Explicit Deny -- if any policy says Deny, the request is denied. Period.
  2. SCPs -- the Organization-level ceiling. If the SCP does not allow it, it is denied.
  3. Permission Boundaries -- if set, the effective permissions are the intersection of the boundary and the identity policy.
  4. Session Policies -- further restricts permissions during an assumed role session.
  5. Identity Policies -- the policies attached to the user, group, or role.
  6. Resource Policies -- policies on the resource itself (S3 bucket policy, SQS queue policy).
  7. Default Deny -- if nothing explicitly allows the action, it is denied.

IAM Best Practices for DevOps

  1. Never use the root account for daily work. Lock it down with MFA and use it only for billing changes and account recovery.
  2. Use roles, not users, for workloads. EC2 instances, Lambda functions, and ECS tasks should all assume roles.
  3. Least privilege always. Start with zero permissions and add what is needed. Use IAM Access Analyzer to identify unused permissions.
  4. Use conditions. Restrict by source IP, MFA presence, request time, or resource tags.
  5. Enable CloudTrail in every account. Every IAM action generates an API event that CloudTrail records.
  6. Rotate credentials. If you must use access keys, rotate them every 90 days. Better yet, use IAM Identity Center (SSO) for human access.

Create a role for an EC2 instance that can read from a specific S3 bucket:

# Create the trust policy
cat > trust-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "ec2.amazonaws.com" },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

# Create the role
aws iam create-role \
  --role-name AppServerRole \
  --assume-role-policy-document file://trust-policy.json

# Attach a scoped policy
aws iam put-role-policy \
  --role-name AppServerRole \
  --policy-name S3ReadAccess \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": ["s3:GetObject", "s3:ListBucket"],
        "Resource": [
          "arn:aws:s3:::my-app-config-bucket",
          "arn:aws:s3:::my-app-config-bucket/*"
        ]
      }
    ]
  }'

# Create the instance profile and add the role
aws iam create-instance-profile --instance-profile-name AppServerProfile
aws iam add-role-to-instance-profile \
  --instance-profile-name AppServerProfile \
  --role-name AppServerRole

Cross-Account Access

Cross-account role assumption is how you securely access resources in other accounts without sharing credentials. The pattern is:

  1. Account B creates a role with a trust policy allowing Account A to assume it.
  2. Account A's IAM entity calls sts:AssumeRole targeting the role ARN in Account B.
  3. STS returns temporary credentials scoped to Account B's role.
# From Account A, assume a role in Account B
CREDS=$(aws sts assume-role \
  --role-arn arn:aws:iam::987654321098:role/CrossAccountDeployRole \
  --role-session-name deploy-session \
  --query 'Credentials.[AccessKeyId,SecretAccessKey,SessionToken]' \
  --output text)

# Export the temporary credentials
export AWS_ACCESS_KEY_ID=$(echo $CREDS | cut -d' ' -f1)
export AWS_SECRET_ACCESS_KEY=$(echo $CREDS | cut -d' ' -f2)
export AWS_SESSION_TOKEN=$(echo $CREDS | cut -d' ' -f3)

# Now all AWS CLI commands operate in Account B
aws s3 ls

Cross-Cloud IAM Comparison

FeatureAWS IAMAzure RBACGCP IAMAlibaba RAM
Identity for workloadsIAM RolesManaged IdentitiesService AccountsRAM Roles
Human accessIAM Identity Center (SSO)Entra IDCloud IdentityIDaaS
Policy languageJSONJSON (Azure Policy)YAML/JSON bindingsJSON
Permission inheritanceNone (explicit)Scope hierarchyResource hierarchyNone (explicit)
Temporary credentialsSTS AssumeRoleManaged Identity tokensWorkload IdentitySTS AssumeRole
Condition keys50+ global keysConditions in policiesIAM ConditionsLimited conditions

EC2: Elastic Compute Cloud

EC2 is the foundational compute service. Even if you run containers or serverless, you need to understand EC2 because many managed services run on it under the hood, and EC2 knowledge translates directly to cost optimization.

Instance Types That Matter

FamilyUse CaseExampleOn-Demand Price (us-east-1)
t3/t3aBurstable, dev/test, small workloadst3.medium (2 vCPU, 4 GB)~$0.0416/hr
m6i/m6aGeneral purpose, production web serversm6i.xlarge (4 vCPU, 16 GB)~$0.192/hr
m7gGeneral purpose, Graviton3 ARMm7g.xlarge (4 vCPU, 16 GB)~$0.163/hr
c6i/c7gCPU-intensive, CI/CD build agentsc6i.2xlarge (8 vCPU, 16 GB)~$0.34/hr
r6iMemory-intensive, caches, in-memory DBsr6i.xlarge (4 vCPU, 32 GB)~$0.252/hr
g5GPU, ML inferenceg5.xlarge (4 vCPU, 16 GB, 1 GPU)~$1.006/hr
i3enStorage-optimized, databasesi3en.xlarge (4 vCPU, 32 GB, 2.5TB NVMe)~$0.452/hr

The a suffix means AMD (cheaper), g suffix means Graviton (ARM, cheaper and often faster). Graviton instances typically give you 20-40% better price-performance for Linux workloads. If your application runs on Linux and does not depend on x86 architecture, Graviton should be your default.

Purchasing Options and Pricing

OptionSavingsCommitmentBest For
On-Demand0% (baseline)NoneUnpredictable workloads, short-term
Reserved Instances (RI)30-60%1 or 3 yearsSteady-state production workloads
Savings Plans30-60%1 or 3 yearsFlexible across instance families
Spot Instances60-90%None (can be interrupted)CI/CD, batch, fault-tolerant
Dedicated HostsVariesHourly or reservedLicensing compliance, regulatory

Savings Plans are generally preferred over Reserved Instances because they offer flexibility across instance families, sizes, and even between EC2 and Fargate. Compute Savings Plans apply to any instance family in any region. EC2 Instance Savings Plans are cheaper but locked to a specific instance family in a specific region.

Launching an Instance with the CLI

# Find the latest Amazon Linux 2023 AMI
AMI_ID=$(aws ec2 describe-images \
  --owners amazon \
  --filters "Name=name,Values=al2023-ami-2023.*-x86_64" \
            "Name=state,Values=available" \
  --query 'sort_by(Images, &CreationDate)[-1].ImageId' \
  --output text)

# Launch the instance
aws ec2 run-instances \
  --image-id "$AMI_ID" \
  --instance-type t3.medium \
  --key-name my-key-pair \
  --security-group-ids sg-0abc1234def56789 \
  --subnet-id subnet-0abc1234 \
  --iam-instance-profile Name=AppServerProfile \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=web-server-01},{Key=Environment,Value=production}]' \
  --user-data file://bootstrap.sh \
  --block-device-mappings '[{
    "DeviceName": "/dev/xvda",
    "Ebs": {
      "VolumeSize": 50,
      "VolumeType": "gp3",
      "Iops": 3000,
      "Throughput": 125,
      "Encrypted": true
    }
  }]'

Always tag your instances. Tags are how you track costs, automate operations, and enforce policies. A good tagging strategy includes at minimum: Name, Environment, Team, CostCenter, and Application.

EBS Volume Types

Volume TypeIOPSThroughputUse CaseCost
gp33,000 baseline (up to 16,000)125 MB/s (up to 1,000)General purpose, boot volumes$0.08/GB/mo
gp2Burst to 3,000250 MB/s maxLegacy, migrate to gp3$0.10/GB/mo
io2Up to 64,0001,000 MB/sMission-critical databases$0.125/GB/mo + IOPS
st1500 baseline500 MB/sThroughput-heavy sequential$0.045/GB/mo
sc1250 baseline250 MB/sCold storage, infrequent access$0.015/GB/mo

Always use gp3 over gp2 for new deployments. gp3 is 20% cheaper and lets you independently provision IOPS and throughput.

VPC: Virtual Private Cloud

Every resource you deploy lives inside a VPC. Understanding VPC architecture is non-negotiable for DevOps work. A poorly designed VPC leads to security gaps, routing headaches, and expensive re-architecture.

Standard Three-Tier VPC Layout

VPC: 10.0.0.0/16 (65,534 usable IPs)
|-- Public Subnets (10.0.1.0/24, 10.0.2.0/24, 10.0.3.0/24)
|   |-- Internet Gateway attached
|   |-- NAT Gateways (one per AZ for high availability)
|   |-- Application Load Balancers
|   +-- Bastion hosts (if not using SSM)
|-- Private Subnets (10.0.11.0/24, 10.0.12.0/24, 10.0.13.0/24)
|   |-- Route to NAT Gateway for outbound internet
|   |-- Application servers (EC2, ECS tasks)
|   +-- EKS worker nodes
+-- Data Subnets (10.0.21.0/24, 10.0.22.0/24, 10.0.23.0/24)
    |-- No internet route at all
    |-- RDS instances
    |-- ElastiCache clusters
    +-- OpenSearch domains

Each subnet tier spans three availability zones for high availability. The key networking components:

  • Internet Gateway (IGW) -- allows resources with public IPs to reach the internet. Free, one per VPC.
  • NAT Gateway -- allows private subnet resources to make outbound internet calls without being directly reachable. Costs $0.045/hr plus $0.045/GB processed. Deploy one per AZ.
  • Route Tables -- control where traffic flows. Each subnet associates with exactly one route table.
  • Security Groups -- stateful firewalls at the instance level. Default deny inbound, allow outbound. Up to 5 security groups per ENI.
  • Network ACLs -- stateless firewalls at the subnet level. Used as a secondary defense layer. Process rules in order by rule number.

CIDR Planning

CIDR planning deserves careful thought because VPC CIDRs cannot overlap if you want to peer them. A common approach:

  • 10.0.0.0/16 for production
  • 10.1.0.0/16 for staging
  • 10.2.0.0/16 for development
  • 10.10.0.0/16 for shared services
  • 10.100.0.0/16 for management

Leave room for growth. A /16 gives you 65,534 addresses. Each /24 subnet provides 251 usable IPs (AWS reserves 5). For EKS clusters, plan for larger subnets (/20 or bigger) because each pod gets an IP address with the AWS VPC CNI plugin.

VPC Endpoints

VPC endpoints let your private subnets reach AWS services without going through the NAT Gateway, saving money and improving security.

# Gateway endpoint for S3 (free)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-aaa111 \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-private1 rtb-private2 rtb-private3

# Interface endpoint for Secrets Manager (charges apply)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-aaa111 \
  --service-name com.amazonaws.us-east-1.secretsmanager \
  --vpc-endpoint-type Interface \
  --subnet-ids subnet-private1 subnet-private2 \
  --security-group-ids sg-vpce-allow \
  --private-dns-enabled

Gateway endpoints (S3, DynamoDB) are free. Interface endpoints cost $0.01/hr per AZ plus data processing. Despite the cost, interface endpoints can save money if your private instances do heavy AWS API traffic that would otherwise go through the NAT Gateway.

VPC Peering and Transit Gateway

When you have multiple VPCs (multi-account setup), connect them with:

  • VPC Peering -- simple, direct connection between two VPCs. Works cross-region and cross-account. Good for small numbers of VPCs. No transitive routing (A-B and B-C does not mean A-C).
  • Transit Gateway -- hub-and-spoke model. When you have more than 3-4 VPCs, this is the way to go. Centralizes routing. Supports transitive routing. Costs $0.05/hr per attachment plus $0.02/GB.
# Create a VPC peering connection
aws ec2 create-vpc-peering-connection \
  --vpc-id vpc-aaa111 \
  --peer-vpc-id vpc-bbb222 \
  --peer-owner-id 123456789012 \
  --peer-region eu-west-1

# Accept the peering (run from the peer account/region)
aws ec2 accept-vpc-peering-connection \
  --vpc-peering-connection-id pcx-0abc1234

# Add routes in both VPCs
aws ec2 create-route \
  --route-table-id rtb-aaa111 \
  --destination-cidr-block 10.1.0.0/16 \
  --vpc-peering-connection-id pcx-0abc1234

S3: Simple Storage Service

S3 is effectively infinite object storage with 99.999999999% (eleven nines) durability. You will use it for everything: Terraform state, application assets, log archives, data lake, backup targets, static website hosting, and as a data transfer medium between services.

Storage Classes and Cost

Storage ClassUse CaseRetrieval CostMonthly Cost (per GB)Min Duration
S3 StandardFrequently accessed dataNone~$0.023None
S3 Intelligent-TieringUnknown or changing access patternsNone~$0.023 + $0.0025/1K objects monitoringNone
S3 Standard-IAInfrequent access, rapid retrieval$0.01/GB~$0.012530 days
S3 One Zone-IAInfrequent, reproducible data$0.01/GB~$0.0130 days
S3 Glacier InstantArchive, millisecond retrieval$0.03/GB~$0.00490 days
S3 Glacier FlexibleArchive, minutes to hours retrievalVaries by speed~$0.003690 days
S3 Glacier Deep ArchiveLong-term archive, 12-hour retrieval$0.02/GB~$0.00099180 days

S3 Intelligent-Tiering is the low-effort option: it automatically moves objects between tiers based on access patterns. The monitoring fee is negligible for large objects but adds up for millions of small files.

Lifecycle Policies

Automate transitions between storage classes to save money:

aws s3api put-bucket-lifecycle-configuration \
  --bucket my-app-logs \
  --lifecycle-configuration '{
    "Rules": [
      {
        "ID": "ArchiveOldLogs",
        "Status": "Enabled",
        "Filter": { "Prefix": "logs/" },
        "Transitions": [
          { "Days": 30, "StorageClass": "STANDARD_IA" },
          { "Days": 90, "StorageClass": "GLACIER_IR" },
          { "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
        ],
        "Expiration": { "Days": 2555 },
        "NoncurrentVersionTransitions": [
          { "NoncurrentDays": 30, "StorageClass": "GLACIER_IR" }
        ],
        "NoncurrentVersionExpiration": { "NoncurrentDays": 90 }
      }
    ]
  }'

S3 Security

Every S3 bucket should follow these security practices:

# Enable versioning for state files and configuration
aws s3api put-bucket-versioning \
  --bucket my-terraform-state \
  --versioning-configuration Status=Enabled

# Block all public access (account-level)
aws s3control put-public-access-block \
  --account-id 123456789012 \
  --public-access-block-configuration \
    BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

# Enable default encryption
aws s3api put-bucket-encryption \
  --bucket my-terraform-state \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "aws:kms",
          "KMSMasterKeyID": "alias/s3-encryption-key"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'

# Enable access logging
aws s3api put-bucket-logging \
  --bucket my-terraform-state \
  --bucket-logging-status '{
    "LoggingEnabled": {
      "TargetBucket": "my-access-logs-bucket",
      "TargetPrefix": "s3-logs/terraform-state/"
    }
  }'

S3 Performance Optimization

S3 automatically handles 5,500 GET/HEAD and 3,500 PUT/COPY/POST/DELETE requests per second per prefix. For higher throughput:

  • Use multiple prefixes. Distribute objects across prefixes to parallelize.
  • Enable S3 Transfer Acceleration for cross-region uploads (uses CloudFront edge locations).
  • Use multipart upload for objects larger than 100 MB. The AWS CLI does this automatically for aws s3 cp.
  • S3 Select and Glacier Select let you query CSV/JSON/Parquet files in place without downloading the full object.

RDS: Managed Relational Databases

RDS handles patching, backups, replication, and failover for your relational databases. Supported engines include PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and Aurora.

Key Decisions for DevOps

  • Multi-AZ -- synchronous standby in another AZ, automatic failover in 60-120 seconds. Always enable for production. Doubles your cost.
  • Read Replicas -- asynchronous copies for read-heavy workloads. Can be cross-region for disaster recovery. Up to 15 read replicas for Aurora, 5 for other engines.
  • Aurora -- AWS-proprietary engine compatible with PostgreSQL and MySQL. Up to 5x throughput of standard MySQL, 3x of standard PostgreSQL. Storage auto-scales up to 128 TB. More expensive but often worth it for production.
  • Aurora Serverless v2 -- scales compute capacity automatically based on load. Pay for what you use. Excellent for variable workloads.
  • Automated Backups -- enabled by default, retention up to 35 days. Test your restore process regularly.
  • Storage -- gp3 for most workloads, io2 for high-performance databases.
aws rds create-db-instance \
  --db-instance-identifier prod-postgres \
  --db-instance-class db.r6g.xlarge \
  --engine postgres \
  --engine-version 15.4 \
  --master-username dbadmin \
  --master-user-password "$(aws secretsmanager get-random-password \
    --password-length 32 --require-each-included-type --output text)" \
  --allocated-storage 100 \
  --max-allocated-storage 500 \
  --storage-type gp3 \
  --multi-az \
  --vpc-security-group-ids sg-0abc1234 \
  --db-subnet-group-name prod-db-subnets \
  --backup-retention-period 14 \
  --preferred-backup-window "03:00-04:00" \
  --preferred-maintenance-window "Mon:04:00-Mon:05:00" \
  --storage-encrypted \
  --kms-key-id alias/rds-encryption-key \
  --performance-insights-enabled \
  --monitoring-interval 60 \
  --monitoring-role-arn arn:aws:iam::123456789012:role/rds-monitoring-role \
  --enable-cloudwatch-logs-exports '["postgresql","upgrade"]' \
  --deletion-protection \
  --copy-tags-to-snapshot \
  --tags Key=Environment,Value=production Key=Team,Value=platform

RDS Pricing Considerations

RDS pricing has several components: instance hours, storage (per GB/month), I/O (for Aurora), backup storage beyond the free allocation, and data transfer. Reserved Instances offer 30-60% savings for steady-state databases. Aurora I/O-Optimized is a newer pricing model that bundles I/O costs into the instance price -- worth evaluating if your Aurora cluster has heavy I/O.

RDS vs Aurora Decision Matrix

FactorRDS PostgreSQL/MySQLAurora
Cost (small workloads)LowerHigher base cost
Cost (large workloads)ComparableOften lower (better efficiency)
Failover time60-120 secondsTypically under 30 seconds
Storage scalingManual (with downtime risk)Automatic up to 128 TB
Read replicasUp to 5, replication lagUp to 15, lower replication lag
BacktrackNot availableRewind database to any point in time
Global DatabaseCross-region read replicasSub-second cross-region replication

Lambda: Serverless Compute

Lambda runs code without you managing servers. It scales from zero to thousands of concurrent executions automatically. You pay only for execution time, billed in 1ms increments.

Pricing Model

  • Requests: $0.20 per 1 million requests (first 1M free per month).
  • Duration: $0.0000166667 per GB-second. A 256 MB function running for 1 second costs $0.0000042.
  • Free Tier: 1 million requests and 400,000 GB-seconds per month, every month, permanently.

For most DevOps automation tasks (event handlers, cleanup scripts, webhook processors), Lambda falls well within the free tier.

Common DevOps Uses

Lambda excels at event-driven automation: CloudWatch alarm handlers, S3 event processing, API Gateway backends, scheduled cleanup jobs, custom CloudFormation resources, CodePipeline approval actions, and infrastructure compliance checks.

# Package and deploy a simple function
zip function.zip index.js

aws lambda create-function \
  --function-name process-s3-uploads \
  --runtime nodejs20.x \
  --handler index.handler \
  --role arn:aws:iam::123456789012:role/LambdaS3Role \
  --zip-file fileb://function.zip \
  --timeout 30 \
  --memory-size 256 \
  --ephemeral-storage Size=1024 \
  --environment Variables='{DEST_BUCKET=processed-data}' \
  --tracing-config Mode=Active \
  --architectures arm64 \
  --tags Environment=production

# Add an S3 trigger
aws lambda add-permission \
  --function-name process-s3-uploads \
  --statement-id s3-trigger \
  --action lambda:InvokeFunction \
  --principal s3.amazonaws.com \
  --source-arn arn:aws:s3:::my-upload-bucket \
  --source-account 123456789012

aws s3api put-bucket-notification-configuration \
  --bucket my-upload-bucket \
  --notification-configuration '{
    "LambdaFunctionConfigurations": [
      {
        "LambdaFunctionArn": "arn:aws:lambda:us-east-1:123456789012:function:process-s3-uploads",
        "Events": ["s3:ObjectCreated:*"],
        "Filter": {
          "Key": {
            "FilterRules": [
              { "Name": "prefix", "Value": "uploads/" },
              { "Name": "suffix", "Value": ".csv" }
            ]
          }
        }
      }
    ]
  }'

Lambda Optimization Tips

  • Use ARM64 (Graviton2) -- 20% cheaper, often faster for compute workloads.
  • Right-size memory. More memory also means more CPU. Use AWS Lambda Power Tuning to find the optimal configuration.
  • Minimize cold starts. Use Provisioned Concurrency for latency-sensitive functions ($0.0000041667 per GB-second provisioned).
  • Keep functions small and focused. One function per responsibility.
  • Use layers for shared dependencies. Keeps deployment packages small.
  • Use environment variables for configuration and Secrets Manager for credentials.

EKS: Elastic Kubernetes Service

EKS is AWS-managed Kubernetes. AWS manages the control plane (etcd, API server, scheduler); you manage the worker nodes. The control plane costs $0.10/hr ($73/month) regardless of cluster size.

Cluster Setup Choices

OptionControlEffortCost
Managed Node GroupsYou choose instance types, EKS handles ASG and updatesMediumEC2 pricing + $73/mo control plane
Self-Managed NodesFull control, you manage everythingHighEC2 pricing + $73/mo control plane
FargateNo nodes to manage, per-pod pricingLow$0.04048/vCPU/hr + $0.004445/GB/hr
EKS Auto ModeAWS manages everything including nodesLowestEC2 pricing + premium

For most teams, managed node groups with Karpenter for autoscaling is the best balance of control and operational simplicity. Karpenter provisions right-sized nodes based on pending pod requirements, consolidates underutilized nodes, and can mix Spot and On-Demand instances intelligently.

# Create a cluster with eksctl
eksctl create cluster \
  --name production \
  --region us-east-1 \
  --version 1.29 \
  --nodegroup-name workers \
  --node-type m6i.xlarge \
  --nodes-min 2 \
  --nodes-max 10 \
  --managed \
  --with-oidc \
  --alb-ingress-access \
  --node-private-networking \
  --asg-access

# Install Karpenter for intelligent autoscaling
helm install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --namespace karpenter --create-namespace \
  --set "settings.clusterName=production" \
  --set "settings.interruptionQueue=production" \
  --wait

EKS Networking and Service Mesh

EKS uses the AWS VPC CNI plugin by default, which assigns real VPC IP addresses to pods. This means pods can communicate directly with other AWS resources using VPC networking, security groups, and NACLs. The tradeoff is IP address consumption -- plan your VPC CIDRs accordingly.

For service-to-service communication, AWS offers App Mesh (Envoy-based) or you can deploy Istio, Linkerd, or Cilium. For most teams, Cilium provides a good balance of networking, observability, and security without the complexity of a full service mesh.

Kubernetes on AWS vs Other Clouds

FeatureEKS (AWS)AKS (Azure)GKE (GCP)ACK (Alibaba)
Control plane cost$73/moFreeFree (Autopilot) / $73/mo (Standard)Free (Managed)
Pod networkingVPC CNI (real IPs)Azure CNI or KubenetGKE VPC-nativeTerway or Flannel
AutoscalerKarpenter or Cluster AutoscalerKEDA, Cluster AutoscalerGKE Autopilot or Cluster AutoscalerCluster Autoscaler
Serverless podsFargateVirtual Kubelet (ACI)AutopilotECI (Elastic Container Instance)
Max nodes per cluster5,0005,00015,0005,000
GPU supportFullFullFullFull

CloudWatch: Monitoring and Observability

CloudWatch collects metrics, logs, and traces across your AWS resources. It is the default observability platform, and even teams using Datadog or Grafana still rely on CloudWatch for AWS-native integrations.

CloudWatch Components

  • Metrics -- CPU, memory, disk, network for EC2; request count and latency for ALB; and custom metrics from your applications. Standard resolution is 1 minute; high resolution is 1 second.
  • Logs -- centralized log storage. Use Log Insights for SQL-like querying across log groups. Supports metric filters to create metrics from log patterns.
  • Alarms -- trigger notifications or auto-scaling actions based on metric thresholds. Composite alarms combine multiple alarms with AND/OR logic.
  • Dashboards -- visualize operational health. Up to 3 free dashboards, then $3/month each.
  • X-Ray -- distributed tracing for microservices. Helps identify latency bottlenecks across service boundaries.
  • Synthetics -- canary functions that monitor your endpoints on a schedule.
  • Application Signals -- APM for applications running on EKS, ECS, and EC2.
# Create an alarm for high CPU
aws cloudwatch put-metric-alarm \
  --alarm-name "HighCPU-WebServers" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
  --ok-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
  --dimensions Name=AutoScalingGroupName,Value=web-asg \
  --treat-missing-data notBreaching

# Query logs with Log Insights
aws logs start-query \
  --log-group-name /ecs/web-app \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message
    | filter @message like /ERROR/
    | stats count(*) by bin(5m)
    | sort @timestamp desc'

# Create a metric filter for error counting
aws logs put-metric-filter \
  --log-group-name /ecs/web-app \
  --filter-name ErrorCount \
  --filter-pattern '"ERROR"' \
  --metric-transformations \
    metricName=ApplicationErrors,metricNamespace=Custom/WebApp,metricValue=1,defaultValue=0

CloudWatch Pricing

CloudWatch costs add up quickly. The main cost drivers are:

  • Custom metrics: $0.30/metric/month for the first 10,000.
  • Log ingestion: $0.50/GB ingested.
  • Log storage: $0.03/GB/month.
  • Log Insights queries: $0.005 per GB scanned.
  • Dashboards: $3/month each (beyond 3 free).
  • Alarms: $0.10/alarm/month (standard), $0.30 (high-resolution).

To control costs: filter logs before sending them to CloudWatch, use log retention policies aggressively, and avoid high-cardinality custom metrics.

AWS CLI Essentials

The CLI is your primary interface for automation. Install v2 and configure named profiles for each account:

# Configure a profile
aws configure --profile production
# Use SSO-based authentication (preferred)
aws configure sso --profile production

# Common operations
aws sts get-caller-identity --profile production  # Who am I?
aws ec2 describe-instances --filters "Name=tag:Environment,Values=production" \
  --query 'Reservations[].Instances[].[InstanceId,State.Name,PrivateIpAddress]' \
  --output table

# Use --query for JMESPath filtering (saves piping to jq)
aws s3api list-buckets --query 'Buckets[?starts_with(Name, `prod-`)].Name' --output text

# Batch operations with waiter
aws ec2 start-instances --instance-ids i-0abc123 i-0def456
aws ec2 wait instance-running --instance-ids i-0abc123 i-0def456
echo "Instances are now running"

# Use CloudShell for quick tasks (browser-based, pre-authenticated)
# Access at https://console.aws.amazon.com/cloudshell/

AWS SDK Usage Patterns

For automation scripts, the AWS SDKs (boto3 for Python, aws-sdk for JavaScript/TypeScript) provide programmatic access:

# Python boto3 example: find and clean up unattached EBS volumes
import boto3

ec2 = boto3.client('ec2', region_name='us-east-1')

response = ec2.describe_volumes(
    Filters=[{'Name': 'status', 'Values': ['available']}]
)

for volume in response['Volumes']:
    age_days = (datetime.now(timezone.utc) - volume['CreateTime']).days
    if age_days > 30:
        print(f"Deleting {volume['VolumeId']} - {volume['Size']}GB, {age_days} days old")
        ec2.delete_volume(VolumeId=volume['VolumeId'])

Cost-Aware Architecture Decisions

Every architectural choice has a cost implication. Build cost awareness into your DevOps practice:

  1. Right-size instances. Use AWS Compute Optimizer recommendations. Most instances are over-provisioned by 30-50%.
  2. Use Savings Plans for predictable workloads -- 30-60% savings over on-demand. Compute Savings Plans offer the most flexibility.
  3. Spot Instances for fault-tolerant workloads (CI/CD runners, batch processing, EKS node pools) -- up to 90% savings.
  4. Delete unused resources. Unattached EBS volumes, idle load balancers, old snapshots, and unused Elastic IPs add up silently.
  5. Use S3 lifecycle policies aggressively. Logs older than 30 days rarely need Standard storage.
  6. VPC endpoints save NAT Gateway data processing charges for S3 and DynamoDB (gateway endpoints are free).
  7. Set up billing alerts in every account. Use AWS Budgets to get notified before you overspend.
  8. Use AWS Cost Explorer to analyze spending by service, account, and tag. Enable hourly granularity for detailed analysis.
  9. Review Reserved Instance utilization monthly. Unused reservations are wasted money.
  10. Consider region pricing. us-east-1 and us-west-2 are typically the cheapest US regions.
# Create a monthly budget alert
aws budgets create-budget \
  --account-id 123456789012 \
  --budget '{
    "BudgetName": "MonthlySpend",
    "BudgetLimit": { "Amount": "5000", "Unit": "USD" },
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST"
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        { "SubscriptionType": "EMAIL", "Address": "ops@example.com" }
      ]
    },
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        { "SubscriptionType": "EMAIL", "Address": "ops@example.com" }
      ]
    }
  ]'

Migration Considerations

When migrating to AWS from on-premises or another cloud:

  • AWS Migration Hub provides a central dashboard for tracking migrations across multiple tools.
  • AWS Application Migration Service (MGN) handles lift-and-shift server migrations with minimal downtime.
  • AWS Database Migration Service (DMS) migrates databases with continuous replication. Supports heterogeneous migrations (Oracle to PostgreSQL, SQL Server to Aurora).
  • AWS Transfer Family provides managed SFTP, FTPS, and FTP servers that store data in S3.
  • S3 Transfer Acceleration speeds up cross-region uploads to S3.
  • AWS Snow Family (Snowball, Snowcone, Snowmobile) for large-scale offline data transfer when network bandwidth is insufficient.

The typical migration pattern is: assess (Migration Hub, Application Discovery Service), mobilize (set up landing zone, networking, security), then migrate (rehost with MGN, replatform with managed services, or refactor to cloud-native). Start with the simplest approach (lift-and-shift) and modernize incrementally.

AWS is vast, but mastering these core services gives you the foundation to build and operate production infrastructure confidently. Start with IAM and VPC -- get those right and everything else becomes easier to reason about. The services covered here represent 80% of what a DevOps engineer interacts with on a daily basis. Deep knowledge of these fundamentals is far more valuable than shallow knowledge of all 200+ services.

Share:
Aareez Asif
Aareez Asif

Senior Kubernetes Architect

10+ years orchestrating containers in production. Battle-tested opinions on everything from pod scheduling to service mesh. I've seen clusters burn and helped rebuild them better.

Related Articles

AWSQuick RefFresh

AWS CLI: Cheat Sheet

AWS CLI cheat sheet with copy-paste commands for EC2, S3, IAM, Lambda, ECS, CloudFormation, SSM, and Secrets Manager operations.

Dev Patel·
3 min read