AWS Core Services: The DevOps Engineer's Essential Guide

AWS has over 200 services. Nobody uses all of them. As a DevOps engineer, your job is to know the core services deeply and the rest well enough to know when they solve a problem. This guide covers the services you will touch every single week, with real CLI examples, pricing context, architecture patterns, and the operational details that matter when you are on call at 2 AM.

AWS Account Structure and Organizations

Before you provision a single resource, understand how AWS organizes access. A production-grade setup uses AWS Organizations with multiple accounts, and getting this right early prevents painful migrations later.

The Multi-Account Strategy

A typical enterprise structure looks like this:

Management Account -- billing, consolidated logs, Organization policies. No workloads run here.
Security Account -- GuardDuty, Security Hub, centralized CloudTrail, AWS Config aggregator.
Log Archive Account -- immutable storage for CloudTrail logs, VPC flow logs, and audit trails.
Shared Services Account -- DNS (Route 53), shared container registries (ECR), CI/CD tooling, artifact storage.
Network Account -- Transit Gateway, Direct Connect, shared VPC infrastructure.
Workload Accounts -- dev, staging, production, each fully isolated with separate IAM boundaries.

This separation exists because a single AWS account becomes a blast radius. If an attacker compromises your production account, they should not be able to touch your audit logs or billing configuration. AWS Organizations lets you manage all these accounts centrally.

Service Control Policies (SCPs)

SCPs are guardrails applied at the Organization or Organizational Unit (OU) level. They restrict what member accounts can do, even if the account's IAM policies allow it. Think of SCPs as a ceiling on permissions.

For example, deny all activity outside your approved regions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyUnapprovedRegions",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": ["us-east-1", "eu-west-1"]
        }
      }
    }
  ]
}

Other common SCPs include preventing member accounts from leaving the Organization, blocking the creation of IAM users with console access (forcing SSO instead), and denying public S3 bucket policies. These guardrails catch mistakes before they become incidents.

AWS Control Tower

For teams setting up multi-account environments from scratch, AWS Control Tower automates the creation of accounts, OUs, and baseline guardrails. It provisions a landing zone with pre-configured security baselines, SSO configuration, and centralized logging. Control Tower uses Account Factory to let teams request new accounts through a self-service catalog, ensuring every account starts with the correct configuration.

IAM: Identity and Access Management

IAM is the service you will interact with the most and get wrong the most. Every API call in AWS is authorized through IAM. Mastering it is not optional.

Core Concepts

Concept	What It Is	When to Use
User	Long-lived identity with credentials	Human access (prefer SSO instead)
Group	Collection of users sharing policies	Organizing human permissions
Role	Assumable identity, temporary credentials	EC2 instances, Lambda, cross-account access
Policy	JSON document defining permissions	Attached to users, groups, or roles
Instance Profile	Wrapper that lets EC2 assume a role	Every EC2 instance that calls AWS APIs
Permission Boundary	Maximum permissions an entity can have	Delegated administration
Session Policy	Inline policy passed during role assumption	Temporary scope reduction

IAM Policy Evaluation Logic

Understanding how AWS evaluates policies prevents hours of debugging. The evaluation order is:

Explicit Deny -- if any policy says Deny, the request is denied. Period.
SCPs -- the Organization-level ceiling. If the SCP does not allow it, it is denied.
Permission Boundaries -- if set, the effective permissions are the intersection of the boundary and the identity policy.
Session Policies -- further restricts permissions during an assumed role session.
Identity Policies -- the policies attached to the user, group, or role.
Resource Policies -- policies on the resource itself (S3 bucket policy, SQS queue policy).
Default Deny -- if nothing explicitly allows the action, it is denied.

IAM Best Practices for DevOps

Never use the root account for daily work. Lock it down with MFA and use it only for billing changes and account recovery.
Use roles, not users, for workloads. EC2 instances, Lambda functions, and ECS tasks should all assume roles.
Least privilege always. Start with zero permissions and add what is needed. Use IAM Access Analyzer to identify unused permissions.
Use conditions. Restrict by source IP, MFA presence, request time, or resource tags.
Enable CloudTrail in every account. Every IAM action generates an API event that CloudTrail records.
Rotate credentials. If you must use access keys, rotate them every 90 days. Better yet, use IAM Identity Center (SSO) for human access.

Create a role for an EC2 instance that can read from a specific S3 bucket:

# Create the trust policy
cat > trust-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "ec2.amazonaws.com" },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

# Create the role
aws iam create-role \
  --role-name AppServerRole \
  --assume-role-policy-document file://trust-policy.json

# Attach a scoped policy
aws iam put-role-policy \
  --role-name AppServerRole \
  --policy-name S3ReadAccess \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": ["s3:GetObject", "s3:ListBucket"],
        "Resource": [
          "arn:aws:s3:::my-app-config-bucket",
          "arn:aws:s3:::my-app-config-bucket/*"
        ]
      }
    ]
  }'

# Create the instance profile and add the role
aws iam create-instance-profile --instance-profile-name AppServerProfile
aws iam add-role-to-instance-profile \
  --instance-profile-name AppServerProfile \
  --role-name AppServerRole

Cross-Account Access

Cross-account role assumption is how you securely access resources in other accounts without sharing credentials. The pattern is:

Account B creates a role with a trust policy allowing Account A to assume it.
Account A's IAM entity calls sts:AssumeRole targeting the role ARN in Account B.
STS returns temporary credentials scoped to Account B's role.

# From Account A, assume a role in Account B
CREDS=$(aws sts assume-role \
  --role-arn arn:aws:iam::987654321098:role/CrossAccountDeployRole \
  --role-session-name deploy-session \
  --query 'Credentials.[AccessKeyId,SecretAccessKey,SessionToken]' \
  --output text)

# Export the temporary credentials
export AWS_ACCESS_KEY_ID=$(echo $CREDS | cut -d' ' -f1)
export AWS_SECRET_ACCESS_KEY=$(echo $CREDS | cut -d' ' -f2)
export AWS_SESSION_TOKEN=$(echo $CREDS | cut -d' ' -f3)

# Now all AWS CLI commands operate in Account B
aws s3 ls

Cross-Cloud IAM Comparison

Feature	AWS IAM	Azure RBAC	GCP IAM	Alibaba RAM
Identity for workloads	IAM Roles	Managed Identities	Service Accounts	RAM Roles
Human access	IAM Identity Center (SSO)	Entra ID	Cloud Identity	IDaaS
Policy language	JSON	JSON (Azure Policy)	YAML/JSON bindings	JSON
Permission inheritance	None (explicit)	Scope hierarchy	Resource hierarchy	None (explicit)
Temporary credentials	STS AssumeRole	Managed Identity tokens	Workload Identity	STS AssumeRole
Condition keys	50+ global keys	Conditions in policies	IAM Conditions	Limited conditions

EC2: Elastic Compute Cloud

EC2 is the foundational compute service. Even if you run containers or serverless, you need to understand EC2 because many managed services run on it under the hood, and EC2 knowledge translates directly to cost optimization.

Instance Types That Matter

Family	Use Case	Example	On-Demand Price (us-east-1)
t3/t3a	Burstable, dev/test, small workloads	t3.medium (2 vCPU, 4 GB)	~$0.0416/hr
m6i/m6a	General purpose, production web servers	m6i.xlarge (4 vCPU, 16 GB)	~$0.192/hr
m7g	General purpose, Graviton3 ARM	m7g.xlarge (4 vCPU, 16 GB)	~$0.163/hr
c6i/c7g	CPU-intensive, CI/CD build agents	c6i.2xlarge (8 vCPU, 16 GB)	~$0.34/hr
r6i	Memory-intensive, caches, in-memory DBs	r6i.xlarge (4 vCPU, 32 GB)	~$0.252/hr
g5	GPU, ML inference	g5.xlarge (4 vCPU, 16 GB, 1 GPU)	~$1.006/hr
i3en	Storage-optimized, databases	i3en.xlarge (4 vCPU, 32 GB, 2.5TB NVMe)	~$0.452/hr

The a suffix means AMD (cheaper), g suffix means Graviton (ARM, cheaper and often faster). Graviton instances typically give you 20-40% better price-performance for Linux workloads. If your application runs on Linux and does not depend on x86 architecture, Graviton should be your default.

Purchasing Options and Pricing

Option	Savings	Commitment	Best For
On-Demand	0% (baseline)	None	Unpredictable workloads, short-term
Reserved Instances (RI)	30-60%	1 or 3 years	Steady-state production workloads
Savings Plans	30-60%	1 or 3 years	Flexible across instance families
Spot Instances	60-90%	None (can be interrupted)	CI/CD, batch, fault-tolerant
Dedicated Hosts	Varies	Hourly or reserved	Licensing compliance, regulatory

Savings Plans are generally preferred over Reserved Instances because they offer flexibility across instance families, sizes, and even between EC2 and Fargate. Compute Savings Plans apply to any instance family in any region. EC2 Instance Savings Plans are cheaper but locked to a specific instance family in a specific region.

Launching an Instance with the CLI

# Find the latest Amazon Linux 2023 AMI
AMI_ID=$(aws ec2 describe-images \
  --owners amazon \
  --filters "Name=name,Values=al2023-ami-2023.*-x86_64" \
            "Name=state,Values=available" \
  --query 'sort_by(Images, &CreationDate)[-1].ImageId' \
  --output text)

# Launch the instance
aws ec2 run-instances \
  --image-id "$AMI_ID" \
  --instance-type t3.medium \
  --key-name my-key-pair \
  --security-group-ids sg-0abc1234def56789 \
  --subnet-id subnet-0abc1234 \
  --iam-instance-profile Name=AppServerProfile \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=web-server-01},{Key=Environment,Value=production}]' \
  --user-data file://bootstrap.sh \
  --block-device-mappings '[{
    "DeviceName": "/dev/xvda",
    "Ebs": {
      "VolumeSize": 50,
      "VolumeType": "gp3",
      "Iops": 3000,
      "Throughput": 125,
      "Encrypted": true
    }
  }]'

Always tag your instances. Tags are how you track costs, automate operations, and enforce policies. A good tagging strategy includes at minimum: Name, Environment, Team, CostCenter, and Application.

EBS Volume Types

Volume Type	IOPS	Throughput	Use Case	Cost
gp3	3,000 baseline (up to 16,000)	125 MB/s (up to 1,000)	General purpose, boot volumes	$0.08/GB/mo
gp2	Burst to 3,000	250 MB/s max	Legacy, migrate to gp3	$0.10/GB/mo
io2	Up to 64,000	1,000 MB/s	Mission-critical databases	$0.125/GB/mo + IOPS
st1	500 baseline	500 MB/s	Throughput-heavy sequential	$0.045/GB/mo
sc1	250 baseline	250 MB/s	Cold storage, infrequent access	$0.015/GB/mo

Always use gp3 over gp2 for new deployments. gp3 is 20% cheaper and lets you independently provision IOPS and throughput.

VPC: Virtual Private Cloud

Every resource you deploy lives inside a VPC. Understanding VPC architecture is non-negotiable for DevOps work. A poorly designed VPC leads to security gaps, routing headaches, and expensive re-architecture.

Standard Three-Tier VPC Layout

VPC: 10.0.0.0/16 (65,534 usable IPs)
|-- Public Subnets (10.0.1.0/24, 10.0.2.0/24, 10.0.3.0/24)
|   |-- Internet Gateway attached
|   |-- NAT Gateways (one per AZ for high availability)
|   |-- Application Load Balancers
|   +-- Bastion hosts (if not using SSM)
|-- Private Subnets (10.0.11.0/24, 10.0.12.0/24, 10.0.13.0/24)
|   |-- Route to NAT Gateway for outbound internet
|   |-- Application servers (EC2, ECS tasks)
|   +-- EKS worker nodes
+-- Data Subnets (10.0.21.0/24, 10.0.22.0/24, 10.0.23.0/24)
    |-- No internet route at all
    |-- RDS instances
    |-- ElastiCache clusters
    +-- OpenSearch domains

Each subnet tier spans three availability zones for high availability. The key networking components:

Internet Gateway (IGW) -- allows resources with public IPs to reach the internet. Free, one per VPC.
NAT Gateway -- allows private subnet resources to make outbound internet calls without being directly reachable. Costs $0.045/hr plus $0.045/GB processed. Deploy one per AZ.
Route Tables -- control where traffic flows. Each subnet associates with exactly one route table.
Security Groups -- stateful firewalls at the instance level. Default deny inbound, allow outbound. Up to 5 security groups per ENI.
Network ACLs -- stateless firewalls at the subnet level. Used as a secondary defense layer. Process rules in order by rule number.

CIDR Planning

CIDR planning deserves careful thought because VPC CIDRs cannot overlap if you want to peer them. A common approach:

10.0.0.0/16 for production
10.1.0.0/16 for staging
10.2.0.0/16 for development
10.10.0.0/16 for shared services
10.100.0.0/16 for management

Leave room for growth. A /16 gives you 65,534 addresses. Each /24 subnet provides 251 usable IPs (AWS reserves 5). For EKS clusters, plan for larger subnets (/20 or bigger) because each pod gets an IP address with the AWS VPC CNI plugin.

VPC Endpoints

VPC endpoints let your private subnets reach AWS services without going through the NAT Gateway, saving money and improving security.

# Gateway endpoint for S3 (free)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-aaa111 \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-private1 rtb-private2 rtb-private3

# Interface endpoint for Secrets Manager (charges apply)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-aaa111 \
  --service-name com.amazonaws.us-east-1.secretsmanager \
  --vpc-endpoint-type Interface \
  --subnet-ids subnet-private1 subnet-private2 \
  --security-group-ids sg-vpce-allow \
  --private-dns-enabled

Gateway endpoints (S3, DynamoDB) are free. Interface endpoints cost $0.01/hr per AZ plus data processing. Despite the cost, interface endpoints can save money if your private instances do heavy AWS API traffic that would otherwise go through the NAT Gateway.

VPC Peering and Transit Gateway

When you have multiple VPCs (multi-account setup), connect them with:

VPC Peering -- simple, direct connection between two VPCs. Works cross-region and cross-account. Good for small numbers of VPCs. No transitive routing (A-B and B-C does not mean A-C).
Transit Gateway -- hub-and-spoke model. When you have more than 3-4 VPCs, this is the way to go. Centralizes routing. Supports transitive routing. Costs $0.05/hr per attachment plus $0.02/GB.

# Create a VPC peering connection
aws ec2 create-vpc-peering-connection \
  --vpc-id vpc-aaa111 \
  --peer-vpc-id vpc-bbb222 \
  --peer-owner-id 123456789012 \
  --peer-region eu-west-1

# Accept the peering (run from the peer account/region)
aws ec2 accept-vpc-peering-connection \
  --vpc-peering-connection-id pcx-0abc1234

# Add routes in both VPCs
aws ec2 create-route \
  --route-table-id rtb-aaa111 \
  --destination-cidr-block 10.1.0.0/16 \
  --vpc-peering-connection-id pcx-0abc1234

S3: Simple Storage Service

S3 is effectively infinite object storage with 99.999999999% (eleven nines) durability. You will use it for everything: Terraform state, application assets, log archives, data lake, backup targets, static website hosting, and as a data transfer medium between services.

Storage Classes and Cost

Storage Class	Use Case	Retrieval Cost	Monthly Cost (per GB)	Min Duration
S3 Standard	Frequently accessed data	None	~$0.023	None
S3 Intelligent-Tiering	Unknown or changing access patterns	None	~$0.023 + $0.0025/1K objects monitoring	None
S3 Standard-IA	Infrequent access, rapid retrieval	$0.01/GB	~$0.0125	30 days
S3 One Zone-IA	Infrequent, reproducible data	$0.01/GB	~$0.01	30 days
S3 Glacier Instant	Archive, millisecond retrieval	$0.03/GB	~$0.004	90 days
S3 Glacier Flexible	Archive, minutes to hours retrieval	Varies by speed	~$0.0036	90 days
S3 Glacier Deep Archive	Long-term archive, 12-hour retrieval	$0.02/GB	~$0.00099	180 days

S3 Intelligent-Tiering is the low-effort option: it automatically moves objects between tiers based on access patterns. The monitoring fee is negligible for large objects but adds up for millions of small files.

Lifecycle Policies

Automate transitions between storage classes to save money:

aws s3api put-bucket-lifecycle-configuration \
  --bucket my-app-logs \
  --lifecycle-configuration '{
    "Rules": [
      {
        "ID": "ArchiveOldLogs",
        "Status": "Enabled",
        "Filter": { "Prefix": "logs/" },
        "Transitions": [
          { "Days": 30, "StorageClass": "STANDARD_IA" },
          { "Days": 90, "StorageClass": "GLACIER_IR" },
          { "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
        ],
        "Expiration": { "Days": 2555 },
        "NoncurrentVersionTransitions": [
          { "NoncurrentDays": 30, "StorageClass": "GLACIER_IR" }
        ],
        "NoncurrentVersionExpiration": { "NoncurrentDays": 90 }
      }
    ]
  }'

S3 Security

Every S3 bucket should follow these security practices:

# Enable versioning for state files and configuration
aws s3api put-bucket-versioning \
  --bucket my-terraform-state \
  --versioning-configuration Status=Enabled

# Block all public access (account-level)
aws s3control put-public-access-block \
  --account-id 123456789012 \
  --public-access-block-configuration \
    BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

# Enable default encryption
aws s3api put-bucket-encryption \
  --bucket my-terraform-state \
  --server-side-encryption-configuration '{
    "Rules": [
      {
        "ApplyServerSideEncryptionByDefault": {
          "SSEAlgorithm": "aws:kms",
          "KMSMasterKeyID": "alias/s3-encryption-key"
        },
        "BucketKeyEnabled": true
      }
    ]
  }'

# Enable access logging
aws s3api put-bucket-logging \
  --bucket my-terraform-state \
  --bucket-logging-status '{
    "LoggingEnabled": {
      "TargetBucket": "my-access-logs-bucket",
      "TargetPrefix": "s3-logs/terraform-state/"
    }
  }'

S3 Performance Optimization

S3 automatically handles 5,500 GET/HEAD and 3,500 PUT/COPY/POST/DELETE requests per second per prefix. For higher throughput:

Use multiple prefixes. Distribute objects across prefixes to parallelize.
Enable S3 Transfer Acceleration for cross-region uploads (uses CloudFront edge locations).
Use multipart upload for objects larger than 100 MB. The AWS CLI does this automatically for aws s3 cp.
S3 Select and Glacier Select let you query CSV/JSON/Parquet files in place without downloading the full object.

RDS: Managed Relational Databases

RDS handles patching, backups, replication, and failover for your relational databases. Supported engines include PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and Aurora.

Key Decisions for DevOps

Multi-AZ -- synchronous standby in another AZ, automatic failover in 60-120 seconds. Always enable for production. Doubles your cost.
Read Replicas -- asynchronous copies for read-heavy workloads. Can be cross-region for disaster recovery. Up to 15 read replicas for Aurora, 5 for other engines.
Aurora -- AWS-proprietary engine compatible with PostgreSQL and MySQL. Up to 5x throughput of standard MySQL, 3x of standard PostgreSQL. Storage auto-scales up to 128 TB. More expensive but often worth it for production.
Aurora Serverless v2 -- scales compute capacity automatically based on load. Pay for what you use. Excellent for variable workloads.
Automated Backups -- enabled by default, retention up to 35 days. Test your restore process regularly.
Storage -- gp3 for most workloads, io2 for high-performance databases.

aws rds create-db-instance \
  --db-instance-identifier prod-postgres \
  --db-instance-class db.r6g.xlarge \
  --engine postgres \
  --engine-version 15.4 \
  --master-username dbadmin \
  --master-user-password "$(aws secretsmanager get-random-password \
    --password-length 32 --require-each-included-type --output text)" \
  --allocated-storage 100 \
  --max-allocated-storage 500 \
  --storage-type gp3 \
  --multi-az \
  --vpc-security-group-ids sg-0abc1234 \
  --db-subnet-group-name prod-db-subnets \
  --backup-retention-period 14 \
  --preferred-backup-window "03:00-04:00" \
  --preferred-maintenance-window "Mon:04:00-Mon:05:00" \
  --storage-encrypted \
  --kms-key-id alias/rds-encryption-key \
  --performance-insights-enabled \
  --monitoring-interval 60 \
  --monitoring-role-arn arn:aws:iam::123456789012:role/rds-monitoring-role \
  --enable-cloudwatch-logs-exports '["postgresql","upgrade"]' \
  --deletion-protection \
  --copy-tags-to-snapshot \
  --tags Key=Environment,Value=production Key=Team,Value=platform

RDS Pricing Considerations

RDS pricing has several components: instance hours, storage (per GB/month), I/O (for Aurora), backup storage beyond the free allocation, and data transfer. Reserved Instances offer 30-60% savings for steady-state databases. Aurora I/O-Optimized is a newer pricing model that bundles I/O costs into the instance price -- worth evaluating if your Aurora cluster has heavy I/O.

RDS vs Aurora Decision Matrix

Factor	RDS PostgreSQL/MySQL	Aurora
Cost (small workloads)	Lower	Higher base cost
Cost (large workloads)	Comparable	Often lower (better efficiency)
Failover time	60-120 seconds	Typically under 30 seconds
Storage scaling	Manual (with downtime risk)	Automatic up to 128 TB
Read replicas	Up to 5, replication lag	Up to 15, lower replication lag
Backtrack	Not available	Rewind database to any point in time
Global Database	Cross-region read replicas	Sub-second cross-region replication

Lambda: Serverless Compute

Lambda runs code without you managing servers. It scales from zero to thousands of concurrent executions automatically. You pay only for execution time, billed in 1ms increments.

Pricing Model

Requests: $0.20 per 1 million requests (first 1M free per month).
Duration: $0.0000166667 per GB-second. A 256 MB function running for 1 second costs $0.0000042.
Free Tier: 1 million requests and 400,000 GB-seconds per month, every month, permanently.

For most DevOps automation tasks (event handlers, cleanup scripts, webhook processors), Lambda falls well within the free tier.

Common DevOps Uses

Lambda excels at event-driven automation: CloudWatch alarm handlers, S3 event processing, API Gateway backends, scheduled cleanup jobs, custom CloudFormation resources, CodePipeline approval actions, and infrastructure compliance checks.

# Package and deploy a simple function
zip function.zip index.js

aws lambda create-function \
  --function-name process-s3-uploads \
  --runtime nodejs20.x \
  --handler index.handler \
  --role arn:aws:iam::123456789012:role/LambdaS3Role \
  --zip-file fileb://function.zip \
  --timeout 30 \
  --memory-size 256 \
  --ephemeral-storage Size=1024 \
  --environment Variables='{DEST_BUCKET=processed-data}' \
  --tracing-config Mode=Active \
  --architectures arm64 \
  --tags Environment=production

# Add an S3 trigger
aws lambda add-permission \
  --function-name process-s3-uploads \
  --statement-id s3-trigger \
  --action lambda:InvokeFunction \
  --principal s3.amazonaws.com \
  --source-arn arn:aws:s3:::my-upload-bucket \
  --source-account 123456789012

aws s3api put-bucket-notification-configuration \
  --bucket my-upload-bucket \
  --notification-configuration '{
    "LambdaFunctionConfigurations": [
      {
        "LambdaFunctionArn": "arn:aws:lambda:us-east-1:123456789012:function:process-s3-uploads",
        "Events": ["s3:ObjectCreated:*"],
        "Filter": {
          "Key": {
            "FilterRules": [
              { "Name": "prefix", "Value": "uploads/" },
              { "Name": "suffix", "Value": ".csv" }
            ]
          }
        }
      }
    ]
  }'

Lambda Optimization Tips

Use ARM64 (Graviton2) -- 20% cheaper, often faster for compute workloads.
Right-size memory. More memory also means more CPU. Use AWS Lambda Power Tuning to find the optimal configuration.
Minimize cold starts. Use Provisioned Concurrency for latency-sensitive functions ($0.0000041667 per GB-second provisioned).
Keep functions small and focused. One function per responsibility.
Use layers for shared dependencies. Keeps deployment packages small.
Use environment variables for configuration and Secrets Manager for credentials.

EKS: Elastic Kubernetes Service

EKS is AWS-managed Kubernetes. AWS manages the control plane (etcd, API server, scheduler); you manage the worker nodes. The control plane costs $0.10/hr ($73/month) regardless of cluster size.

Cluster Setup Choices

Option	Control	Effort	Cost
Managed Node Groups	You choose instance types, EKS handles ASG and updates	Medium	EC2 pricing + $73/mo control plane
Self-Managed Nodes	Full control, you manage everything	High	EC2 pricing + $73/mo control plane
Fargate	No nodes to manage, per-pod pricing	Low	$0.04048/vCPU/hr + $0.004445/GB/hr
EKS Auto Mode	AWS manages everything including nodes	Lowest	EC2 pricing + premium

For most teams, managed node groups with Karpenter for autoscaling is the best balance of control and operational simplicity. Karpenter provisions right-sized nodes based on pending pod requirements, consolidates underutilized nodes, and can mix Spot and On-Demand instances intelligently.

# Create a cluster with eksctl
eksctl create cluster \
  --name production \
  --region us-east-1 \
  --version 1.29 \
  --nodegroup-name workers \
  --node-type m6i.xlarge \
  --nodes-min 2 \
  --nodes-max 10 \
  --managed \
  --with-oidc \
  --alb-ingress-access \
  --node-private-networking \
  --asg-access

# Install Karpenter for intelligent autoscaling
helm install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --namespace karpenter --create-namespace \
  --set "settings.clusterName=production" \
  --set "settings.interruptionQueue=production" \
  --wait

EKS Networking and Service Mesh

EKS uses the AWS VPC CNI plugin by default, which assigns real VPC IP addresses to pods. This means pods can communicate directly with other AWS resources using VPC networking, security groups, and NACLs. The tradeoff is IP address consumption -- plan your VPC CIDRs accordingly.

For service-to-service communication, AWS offers App Mesh (Envoy-based) or you can deploy Istio, Linkerd, or Cilium. For most teams, Cilium provides a good balance of networking, observability, and security without the complexity of a full service mesh.

Kubernetes on AWS vs Other Clouds

Feature	EKS (AWS)	AKS (Azure)	GKE (GCP)	ACK (Alibaba)
Control plane cost	$73/mo	Free	Free (Autopilot) / $73/mo (Standard)	Free (Managed)
Pod networking	VPC CNI (real IPs)	Azure CNI or Kubenet	GKE VPC-native	Terway or Flannel
Autoscaler	Karpenter or Cluster Autoscaler	KEDA, Cluster Autoscaler	GKE Autopilot or Cluster Autoscaler	Cluster Autoscaler
Serverless pods	Fargate	Virtual Kubelet (ACI)	Autopilot	ECI (Elastic Container Instance)
Max nodes per cluster	5,000	5,000	15,000	5,000
GPU support	Full	Full	Full	Full

CloudWatch: Monitoring and Observability

CloudWatch collects metrics, logs, and traces across your AWS resources. It is the default observability platform, and even teams using Datadog or Grafana still rely on CloudWatch for AWS-native integrations.

CloudWatch Components

Metrics -- CPU, memory, disk, network for EC2; request count and latency for ALB; and custom metrics from your applications. Standard resolution is 1 minute; high resolution is 1 second.
Logs -- centralized log storage. Use Log Insights for SQL-like querying across log groups. Supports metric filters to create metrics from log patterns.
Alarms -- trigger notifications or auto-scaling actions based on metric thresholds. Composite alarms combine multiple alarms with AND/OR logic.
Dashboards -- visualize operational health. Up to 3 free dashboards, then $3/month each.
X-Ray -- distributed tracing for microservices. Helps identify latency bottlenecks across service boundaries.
Synthetics -- canary functions that monitor your endpoints on a schedule.
Application Signals -- APM for applications running on EKS, ECS, and EC2.

# Create an alarm for high CPU
aws cloudwatch put-metric-alarm \
  --alarm-name "HighCPU-WebServers" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
  --ok-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
  --dimensions Name=AutoScalingGroupName,Value=web-asg \
  --treat-missing-data notBreaching

# Query logs with Log Insights
aws logs start-query \
  --log-group-name /ecs/web-app \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message
    | filter @message like /ERROR/
    | stats count(*) by bin(5m)
    | sort @timestamp desc'

# Create a metric filter for error counting
aws logs put-metric-filter \
  --log-group-name /ecs/web-app \
  --filter-name ErrorCount \
  --filter-pattern '"ERROR"' \
  --metric-transformations \
    metricName=ApplicationErrors,metricNamespace=Custom/WebApp,metricValue=1,defaultValue=0

CloudWatch Pricing

CloudWatch costs add up quickly. The main cost drivers are:

Custom metrics: $0.30/metric/month for the first 10,000.
Log ingestion: $0.50/GB ingested.
Log storage: $0.03/GB/month.
Log Insights queries: $0.005 per GB scanned.
Dashboards: $3/month each (beyond 3 free).
Alarms: $0.10/alarm/month (standard), $0.30 (high-resolution).

To control costs: filter logs before sending them to CloudWatch, use log retention policies aggressively, and avoid high-cardinality custom metrics.

AWS CLI Essentials

The CLI is your primary interface for automation. Install v2 and configure named profiles for each account:

# Configure a profile
aws configure --profile production
# Use SSO-based authentication (preferred)
aws configure sso --profile production

# Common operations
aws sts get-caller-identity --profile production  # Who am I?
aws ec2 describe-instances --filters "Name=tag:Environment,Values=production" \
  --query 'Reservations[].Instances[].[InstanceId,State.Name,PrivateIpAddress]' \
  --output table

# Use --query for JMESPath filtering (saves piping to jq)
aws s3api list-buckets --query 'Buckets[?starts_with(Name, `prod-`)].Name' --output text

# Batch operations with waiter
aws ec2 start-instances --instance-ids i-0abc123 i-0def456
aws ec2 wait instance-running --instance-ids i-0abc123 i-0def456
echo "Instances are now running"

# Use CloudShell for quick tasks (browser-based, pre-authenticated)
# Access at https://console.aws.amazon.com/cloudshell/

AWS SDK Usage Patterns

For automation scripts, the AWS SDKs (boto3 for Python, aws-sdk for JavaScript/TypeScript) provide programmatic access:

# Python boto3 example: find and clean up unattached EBS volumes
import boto3

ec2 = boto3.client('ec2', region_name='us-east-1')

response = ec2.describe_volumes(
    Filters=[{'Name': 'status', 'Values': ['available']}]
)

for volume in response['Volumes']:
    age_days = (datetime.now(timezone.utc) - volume['CreateTime']).days
    if age_days > 30:
        print(f"Deleting {volume['VolumeId']} - {volume['Size']}GB, {age_days} days old")
        ec2.delete_volume(VolumeId=volume['VolumeId'])

Cost-Aware Architecture Decisions

Every architectural choice has a cost implication. Build cost awareness into your DevOps practice:

Right-size instances. Use AWS Compute Optimizer recommendations. Most instances are over-provisioned by 30-50%.
Use Savings Plans for predictable workloads -- 30-60% savings over on-demand. Compute Savings Plans offer the most flexibility.
Spot Instances for fault-tolerant workloads (CI/CD runners, batch processing, EKS node pools) -- up to 90% savings.
Delete unused resources. Unattached EBS volumes, idle load balancers, old snapshots, and unused Elastic IPs add up silently.
Use S3 lifecycle policies aggressively. Logs older than 30 days rarely need Standard storage.
VPC endpoints save NAT Gateway data processing charges for S3 and DynamoDB (gateway endpoints are free).
Set up billing alerts in every account. Use AWS Budgets to get notified before you overspend.
Use AWS Cost Explorer to analyze spending by service, account, and tag. Enable hourly granularity for detailed analysis.
Review Reserved Instance utilization monthly. Unused reservations are wasted money.
Consider region pricing. us-east-1 and us-west-2 are typically the cheapest US regions.

# Create a monthly budget alert
aws budgets create-budget \
  --account-id 123456789012 \
  --budget '{
    "BudgetName": "MonthlySpend",
    "BudgetLimit": { "Amount": "5000", "Unit": "USD" },
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST"
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        { "SubscriptionType": "EMAIL", "Address": "[email protected]" }
      ]
    },
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        { "SubscriptionType": "EMAIL", "Address": "[email protected]" }
      ]
    }
  ]'

Migration Considerations

When migrating to AWS from on-premises or another cloud:

AWS Migration Hub provides a central dashboard for tracking migrations across multiple tools.
AWS Application Migration Service (MGN) handles lift-and-shift server migrations with minimal downtime.
AWS Database Migration Service (DMS) migrates databases with continuous replication. Supports heterogeneous migrations (Oracle to PostgreSQL, SQL Server to Aurora).
AWS Transfer Family provides managed SFTP, FTPS, and FTP servers that store data in S3.
S3 Transfer Acceleration speeds up cross-region uploads to S3.
AWS Snow Family (Snowball, Snowcone, Snowmobile) for large-scale offline data transfer when network bandwidth is insufficient.

The typical migration pattern is: assess (Migration Hub, Application Discovery Service), mobilize (set up landing zone, networking, security), then migrate (rehost with MGN, replatform with managed services, or refactor to cloud-native). Start with the simplest approach (lift-and-shift) and modernize incrementally.

AWS is vast, but mastering these core services gives you the foundation to build and operate production infrastructure confidently. Start with IAM and VPC -- get those right and everything else becomes easier to reason about. The services covered here represent 80% of what a DevOps engineer interacts with on a daily basis. Deep knowledge of these fundamentals is far more valuable than shallow knowledge of all 200+ services.

On this page

Related Articles

AWS CLI: Cheat Sheet

AWS ECS Fargate: Serverless Container Deployment Without Managing Nodes

AWS EKS: Production Kubernetes Cluster Setup from Scratch

Fix AWS S3 'Access Denied' Errors

Fix AWS EC2 Instance 'Status Check Failed' Errors

AWS IAM Least Privilege: Policies That Won't Lock You Out

More in AWS

AWS Lambda Cold Start Optimization Using Provisioned Concurrency And SnapStart

Cloud Landing Zone Architecture: Account Structure Done Right

Discussion