AWS Core Services: The DevOps Engineer's Essential Guide
AWS has over 200 services. Nobody uses all of them. As a DevOps engineer, your job is to know the core services deeply and the rest well enough to know when they solve a problem. This guide covers the services you will touch every single week, with real CLI examples, pricing context, architecture patterns, and the operational details that matter when you are on call at 2 AM.
AWS Account Structure and Organizations
Before you provision a single resource, understand how AWS organizes access. A production-grade setup uses AWS Organizations with multiple accounts, and getting this right early prevents painful migrations later.
The Multi-Account Strategy
A typical enterprise structure looks like this:
- Management Account -- billing, consolidated logs, Organization policies. No workloads run here.
- Security Account -- GuardDuty, Security Hub, centralized CloudTrail, AWS Config aggregator.
- Log Archive Account -- immutable storage for CloudTrail logs, VPC flow logs, and audit trails.
- Shared Services Account -- DNS (Route 53), shared container registries (ECR), CI/CD tooling, artifact storage.
- Network Account -- Transit Gateway, Direct Connect, shared VPC infrastructure.
- Workload Accounts -- dev, staging, production, each fully isolated with separate IAM boundaries.
This separation exists because a single AWS account becomes a blast radius. If an attacker compromises your production account, they should not be able to touch your audit logs or billing configuration. AWS Organizations lets you manage all these accounts centrally.
Service Control Policies (SCPs)
SCPs are guardrails applied at the Organization or Organizational Unit (OU) level. They restrict what member accounts can do, even if the account's IAM policies allow it. Think of SCPs as a ceiling on permissions.
For example, deny all activity outside your approved regions:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyUnapprovedRegions",
"Effect": "Deny",
"Action": "*",
"Resource": "*",
"Condition": {
"StringNotEquals": {
"aws:RequestedRegion": ["us-east-1", "eu-west-1"]
}
}
}
]
}
Other common SCPs include preventing member accounts from leaving the Organization, blocking the creation of IAM users with console access (forcing SSO instead), and denying public S3 bucket policies. These guardrails catch mistakes before they become incidents.
AWS Control Tower
For teams setting up multi-account environments from scratch, AWS Control Tower automates the creation of accounts, OUs, and baseline guardrails. It provisions a landing zone with pre-configured security baselines, SSO configuration, and centralized logging. Control Tower uses Account Factory to let teams request new accounts through a self-service catalog, ensuring every account starts with the correct configuration.
IAM: Identity and Access Management
IAM is the service you will interact with the most and get wrong the most. Every API call in AWS is authorized through IAM. Mastering it is not optional.
Core Concepts
| Concept | What It Is | When to Use |
|---|---|---|
| User | Long-lived identity with credentials | Human access (prefer SSO instead) |
| Group | Collection of users sharing policies | Organizing human permissions |
| Role | Assumable identity, temporary credentials | EC2 instances, Lambda, cross-account access |
| Policy | JSON document defining permissions | Attached to users, groups, or roles |
| Instance Profile | Wrapper that lets EC2 assume a role | Every EC2 instance that calls AWS APIs |
| Permission Boundary | Maximum permissions an entity can have | Delegated administration |
| Session Policy | Inline policy passed during role assumption | Temporary scope reduction |
IAM Policy Evaluation Logic
Understanding how AWS evaluates policies prevents hours of debugging. The evaluation order is:
- Explicit Deny -- if any policy says Deny, the request is denied. Period.
- SCPs -- the Organization-level ceiling. If the SCP does not allow it, it is denied.
- Permission Boundaries -- if set, the effective permissions are the intersection of the boundary and the identity policy.
- Session Policies -- further restricts permissions during an assumed role session.
- Identity Policies -- the policies attached to the user, group, or role.
- Resource Policies -- policies on the resource itself (S3 bucket policy, SQS queue policy).
- Default Deny -- if nothing explicitly allows the action, it is denied.
IAM Best Practices for DevOps
- Never use the root account for daily work. Lock it down with MFA and use it only for billing changes and account recovery.
- Use roles, not users, for workloads. EC2 instances, Lambda functions, and ECS tasks should all assume roles.
- Least privilege always. Start with zero permissions and add what is needed. Use IAM Access Analyzer to identify unused permissions.
- Use conditions. Restrict by source IP, MFA presence, request time, or resource tags.
- Enable CloudTrail in every account. Every IAM action generates an API event that CloudTrail records.
- Rotate credentials. If you must use access keys, rotate them every 90 days. Better yet, use IAM Identity Center (SSO) for human access.
Create a role for an EC2 instance that can read from a specific S3 bucket:
# Create the trust policy
cat > trust-policy.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "Service": "ec2.amazonaws.com" },
"Action": "sts:AssumeRole"
}
]
}
EOF
# Create the role
aws iam create-role \
--role-name AppServerRole \
--assume-role-policy-document file://trust-policy.json
# Attach a scoped policy
aws iam put-role-policy \
--role-name AppServerRole \
--policy-name S3ReadAccess \
--policy-document '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::my-app-config-bucket",
"arn:aws:s3:::my-app-config-bucket/*"
]
}
]
}'
# Create the instance profile and add the role
aws iam create-instance-profile --instance-profile-name AppServerProfile
aws iam add-role-to-instance-profile \
--instance-profile-name AppServerProfile \
--role-name AppServerRole
Cross-Account Access
Cross-account role assumption is how you securely access resources in other accounts without sharing credentials. The pattern is:
- Account B creates a role with a trust policy allowing Account A to assume it.
- Account A's IAM entity calls
sts:AssumeRoletargeting the role ARN in Account B. - STS returns temporary credentials scoped to Account B's role.
# From Account A, assume a role in Account B
CREDS=$(aws sts assume-role \
--role-arn arn:aws:iam::987654321098:role/CrossAccountDeployRole \
--role-session-name deploy-session \
--query 'Credentials.[AccessKeyId,SecretAccessKey,SessionToken]' \
--output text)
# Export the temporary credentials
export AWS_ACCESS_KEY_ID=$(echo $CREDS | cut -d' ' -f1)
export AWS_SECRET_ACCESS_KEY=$(echo $CREDS | cut -d' ' -f2)
export AWS_SESSION_TOKEN=$(echo $CREDS | cut -d' ' -f3)
# Now all AWS CLI commands operate in Account B
aws s3 ls
Cross-Cloud IAM Comparison
| Feature | AWS IAM | Azure RBAC | GCP IAM | Alibaba RAM |
|---|---|---|---|---|
| Identity for workloads | IAM Roles | Managed Identities | Service Accounts | RAM Roles |
| Human access | IAM Identity Center (SSO) | Entra ID | Cloud Identity | IDaaS |
| Policy language | JSON | JSON (Azure Policy) | YAML/JSON bindings | JSON |
| Permission inheritance | None (explicit) | Scope hierarchy | Resource hierarchy | None (explicit) |
| Temporary credentials | STS AssumeRole | Managed Identity tokens | Workload Identity | STS AssumeRole |
| Condition keys | 50+ global keys | Conditions in policies | IAM Conditions | Limited conditions |
EC2: Elastic Compute Cloud
EC2 is the foundational compute service. Even if you run containers or serverless, you need to understand EC2 because many managed services run on it under the hood, and EC2 knowledge translates directly to cost optimization.
Instance Types That Matter
| Family | Use Case | Example | On-Demand Price (us-east-1) |
|---|---|---|---|
| t3/t3a | Burstable, dev/test, small workloads | t3.medium (2 vCPU, 4 GB) | ~$0.0416/hr |
| m6i/m6a | General purpose, production web servers | m6i.xlarge (4 vCPU, 16 GB) | ~$0.192/hr |
| m7g | General purpose, Graviton3 ARM | m7g.xlarge (4 vCPU, 16 GB) | ~$0.163/hr |
| c6i/c7g | CPU-intensive, CI/CD build agents | c6i.2xlarge (8 vCPU, 16 GB) | ~$0.34/hr |
| r6i | Memory-intensive, caches, in-memory DBs | r6i.xlarge (4 vCPU, 32 GB) | ~$0.252/hr |
| g5 | GPU, ML inference | g5.xlarge (4 vCPU, 16 GB, 1 GPU) | ~$1.006/hr |
| i3en | Storage-optimized, databases | i3en.xlarge (4 vCPU, 32 GB, 2.5TB NVMe) | ~$0.452/hr |
The a suffix means AMD (cheaper), g suffix means Graviton (ARM, cheaper and often faster). Graviton instances typically give you 20-40% better price-performance for Linux workloads. If your application runs on Linux and does not depend on x86 architecture, Graviton should be your default.
Purchasing Options and Pricing
| Option | Savings | Commitment | Best For |
|---|---|---|---|
| On-Demand | 0% (baseline) | None | Unpredictable workloads, short-term |
| Reserved Instances (RI) | 30-60% | 1 or 3 years | Steady-state production workloads |
| Savings Plans | 30-60% | 1 or 3 years | Flexible across instance families |
| Spot Instances | 60-90% | None (can be interrupted) | CI/CD, batch, fault-tolerant |
| Dedicated Hosts | Varies | Hourly or reserved | Licensing compliance, regulatory |
Savings Plans are generally preferred over Reserved Instances because they offer flexibility across instance families, sizes, and even between EC2 and Fargate. Compute Savings Plans apply to any instance family in any region. EC2 Instance Savings Plans are cheaper but locked to a specific instance family in a specific region.
Launching an Instance with the CLI
# Find the latest Amazon Linux 2023 AMI
AMI_ID=$(aws ec2 describe-images \
--owners amazon \
--filters "Name=name,Values=al2023-ami-2023.*-x86_64" \
"Name=state,Values=available" \
--query 'sort_by(Images, &CreationDate)[-1].ImageId' \
--output text)
# Launch the instance
aws ec2 run-instances \
--image-id "$AMI_ID" \
--instance-type t3.medium \
--key-name my-key-pair \
--security-group-ids sg-0abc1234def56789 \
--subnet-id subnet-0abc1234 \
--iam-instance-profile Name=AppServerProfile \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=web-server-01},{Key=Environment,Value=production}]' \
--user-data file://bootstrap.sh \
--block-device-mappings '[{
"DeviceName": "/dev/xvda",
"Ebs": {
"VolumeSize": 50,
"VolumeType": "gp3",
"Iops": 3000,
"Throughput": 125,
"Encrypted": true
}
}]'
Always tag your instances. Tags are how you track costs, automate operations, and enforce policies. A good tagging strategy includes at minimum: Name, Environment, Team, CostCenter, and Application.
EBS Volume Types
| Volume Type | IOPS | Throughput | Use Case | Cost |
|---|---|---|---|---|
| gp3 | 3,000 baseline (up to 16,000) | 125 MB/s (up to 1,000) | General purpose, boot volumes | $0.08/GB/mo |
| gp2 | Burst to 3,000 | 250 MB/s max | Legacy, migrate to gp3 | $0.10/GB/mo |
| io2 | Up to 64,000 | 1,000 MB/s | Mission-critical databases | $0.125/GB/mo + IOPS |
| st1 | 500 baseline | 500 MB/s | Throughput-heavy sequential | $0.045/GB/mo |
| sc1 | 250 baseline | 250 MB/s | Cold storage, infrequent access | $0.015/GB/mo |
Always use gp3 over gp2 for new deployments. gp3 is 20% cheaper and lets you independently provision IOPS and throughput.
VPC: Virtual Private Cloud
Every resource you deploy lives inside a VPC. Understanding VPC architecture is non-negotiable for DevOps work. A poorly designed VPC leads to security gaps, routing headaches, and expensive re-architecture.
Standard Three-Tier VPC Layout
VPC: 10.0.0.0/16 (65,534 usable IPs)
|-- Public Subnets (10.0.1.0/24, 10.0.2.0/24, 10.0.3.0/24)
| |-- Internet Gateway attached
| |-- NAT Gateways (one per AZ for high availability)
| |-- Application Load Balancers
| +-- Bastion hosts (if not using SSM)
|-- Private Subnets (10.0.11.0/24, 10.0.12.0/24, 10.0.13.0/24)
| |-- Route to NAT Gateway for outbound internet
| |-- Application servers (EC2, ECS tasks)
| +-- EKS worker nodes
+-- Data Subnets (10.0.21.0/24, 10.0.22.0/24, 10.0.23.0/24)
|-- No internet route at all
|-- RDS instances
|-- ElastiCache clusters
+-- OpenSearch domains
Each subnet tier spans three availability zones for high availability. The key networking components:
- Internet Gateway (IGW) -- allows resources with public IPs to reach the internet. Free, one per VPC.
- NAT Gateway -- allows private subnet resources to make outbound internet calls without being directly reachable. Costs $0.045/hr plus $0.045/GB processed. Deploy one per AZ.
- Route Tables -- control where traffic flows. Each subnet associates with exactly one route table.
- Security Groups -- stateful firewalls at the instance level. Default deny inbound, allow outbound. Up to 5 security groups per ENI.
- Network ACLs -- stateless firewalls at the subnet level. Used as a secondary defense layer. Process rules in order by rule number.
CIDR Planning
CIDR planning deserves careful thought because VPC CIDRs cannot overlap if you want to peer them. A common approach:
- 10.0.0.0/16 for production
- 10.1.0.0/16 for staging
- 10.2.0.0/16 for development
- 10.10.0.0/16 for shared services
- 10.100.0.0/16 for management
Leave room for growth. A /16 gives you 65,534 addresses. Each /24 subnet provides 251 usable IPs (AWS reserves 5). For EKS clusters, plan for larger subnets (/20 or bigger) because each pod gets an IP address with the AWS VPC CNI plugin.
VPC Endpoints
VPC endpoints let your private subnets reach AWS services without going through the NAT Gateway, saving money and improving security.
# Gateway endpoint for S3 (free)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-aaa111 \
--service-name com.amazonaws.us-east-1.s3 \
--route-table-ids rtb-private1 rtb-private2 rtb-private3
# Interface endpoint for Secrets Manager (charges apply)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-aaa111 \
--service-name com.amazonaws.us-east-1.secretsmanager \
--vpc-endpoint-type Interface \
--subnet-ids subnet-private1 subnet-private2 \
--security-group-ids sg-vpce-allow \
--private-dns-enabled
Gateway endpoints (S3, DynamoDB) are free. Interface endpoints cost $0.01/hr per AZ plus data processing. Despite the cost, interface endpoints can save money if your private instances do heavy AWS API traffic that would otherwise go through the NAT Gateway.
VPC Peering and Transit Gateway
When you have multiple VPCs (multi-account setup), connect them with:
- VPC Peering -- simple, direct connection between two VPCs. Works cross-region and cross-account. Good for small numbers of VPCs. No transitive routing (A-B and B-C does not mean A-C).
- Transit Gateway -- hub-and-spoke model. When you have more than 3-4 VPCs, this is the way to go. Centralizes routing. Supports transitive routing. Costs $0.05/hr per attachment plus $0.02/GB.
# Create a VPC peering connection
aws ec2 create-vpc-peering-connection \
--vpc-id vpc-aaa111 \
--peer-vpc-id vpc-bbb222 \
--peer-owner-id 123456789012 \
--peer-region eu-west-1
# Accept the peering (run from the peer account/region)
aws ec2 accept-vpc-peering-connection \
--vpc-peering-connection-id pcx-0abc1234
# Add routes in both VPCs
aws ec2 create-route \
--route-table-id rtb-aaa111 \
--destination-cidr-block 10.1.0.0/16 \
--vpc-peering-connection-id pcx-0abc1234
S3: Simple Storage Service
S3 is effectively infinite object storage with 99.999999999% (eleven nines) durability. You will use it for everything: Terraform state, application assets, log archives, data lake, backup targets, static website hosting, and as a data transfer medium between services.
Storage Classes and Cost
| Storage Class | Use Case | Retrieval Cost | Monthly Cost (per GB) | Min Duration |
|---|---|---|---|---|
| S3 Standard | Frequently accessed data | None | ~$0.023 | None |
| S3 Intelligent-Tiering | Unknown or changing access patterns | None | ~$0.023 + $0.0025/1K objects monitoring | None |
| S3 Standard-IA | Infrequent access, rapid retrieval | $0.01/GB | ~$0.0125 | 30 days |
| S3 One Zone-IA | Infrequent, reproducible data | $0.01/GB | ~$0.01 | 30 days |
| S3 Glacier Instant | Archive, millisecond retrieval | $0.03/GB | ~$0.004 | 90 days |
| S3 Glacier Flexible | Archive, minutes to hours retrieval | Varies by speed | ~$0.0036 | 90 days |
| S3 Glacier Deep Archive | Long-term archive, 12-hour retrieval | $0.02/GB | ~$0.00099 | 180 days |
S3 Intelligent-Tiering is the low-effort option: it automatically moves objects between tiers based on access patterns. The monitoring fee is negligible for large objects but adds up for millions of small files.
Lifecycle Policies
Automate transitions between storage classes to save money:
aws s3api put-bucket-lifecycle-configuration \
--bucket my-app-logs \
--lifecycle-configuration '{
"Rules": [
{
"ID": "ArchiveOldLogs",
"Status": "Enabled",
"Filter": { "Prefix": "logs/" },
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER_IR" },
{ "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
],
"Expiration": { "Days": 2555 },
"NoncurrentVersionTransitions": [
{ "NoncurrentDays": 30, "StorageClass": "GLACIER_IR" }
],
"NoncurrentVersionExpiration": { "NoncurrentDays": 90 }
}
]
}'
S3 Security
Every S3 bucket should follow these security practices:
# Enable versioning for state files and configuration
aws s3api put-bucket-versioning \
--bucket my-terraform-state \
--versioning-configuration Status=Enabled
# Block all public access (account-level)
aws s3control put-public-access-block \
--account-id 123456789012 \
--public-access-block-configuration \
BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true
# Enable default encryption
aws s3api put-bucket-encryption \
--bucket my-terraform-state \
--server-side-encryption-configuration '{
"Rules": [
{
"ApplyServerSideEncryptionByDefault": {
"SSEAlgorithm": "aws:kms",
"KMSMasterKeyID": "alias/s3-encryption-key"
},
"BucketKeyEnabled": true
}
]
}'
# Enable access logging
aws s3api put-bucket-logging \
--bucket my-terraform-state \
--bucket-logging-status '{
"LoggingEnabled": {
"TargetBucket": "my-access-logs-bucket",
"TargetPrefix": "s3-logs/terraform-state/"
}
}'
S3 Performance Optimization
S3 automatically handles 5,500 GET/HEAD and 3,500 PUT/COPY/POST/DELETE requests per second per prefix. For higher throughput:
- Use multiple prefixes. Distribute objects across prefixes to parallelize.
- Enable S3 Transfer Acceleration for cross-region uploads (uses CloudFront edge locations).
- Use multipart upload for objects larger than 100 MB. The AWS CLI does this automatically for
aws s3 cp. - S3 Select and Glacier Select let you query CSV/JSON/Parquet files in place without downloading the full object.
RDS: Managed Relational Databases
RDS handles patching, backups, replication, and failover for your relational databases. Supported engines include PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and Aurora.
Key Decisions for DevOps
- Multi-AZ -- synchronous standby in another AZ, automatic failover in 60-120 seconds. Always enable for production. Doubles your cost.
- Read Replicas -- asynchronous copies for read-heavy workloads. Can be cross-region for disaster recovery. Up to 15 read replicas for Aurora, 5 for other engines.
- Aurora -- AWS-proprietary engine compatible with PostgreSQL and MySQL. Up to 5x throughput of standard MySQL, 3x of standard PostgreSQL. Storage auto-scales up to 128 TB. More expensive but often worth it for production.
- Aurora Serverless v2 -- scales compute capacity automatically based on load. Pay for what you use. Excellent for variable workloads.
- Automated Backups -- enabled by default, retention up to 35 days. Test your restore process regularly.
- Storage -- gp3 for most workloads, io2 for high-performance databases.
aws rds create-db-instance \
--db-instance-identifier prod-postgres \
--db-instance-class db.r6g.xlarge \
--engine postgres \
--engine-version 15.4 \
--master-username dbadmin \
--master-user-password "$(aws secretsmanager get-random-password \
--password-length 32 --require-each-included-type --output text)" \
--allocated-storage 100 \
--max-allocated-storage 500 \
--storage-type gp3 \
--multi-az \
--vpc-security-group-ids sg-0abc1234 \
--db-subnet-group-name prod-db-subnets \
--backup-retention-period 14 \
--preferred-backup-window "03:00-04:00" \
--preferred-maintenance-window "Mon:04:00-Mon:05:00" \
--storage-encrypted \
--kms-key-id alias/rds-encryption-key \
--performance-insights-enabled \
--monitoring-interval 60 \
--monitoring-role-arn arn:aws:iam::123456789012:role/rds-monitoring-role \
--enable-cloudwatch-logs-exports '["postgresql","upgrade"]' \
--deletion-protection \
--copy-tags-to-snapshot \
--tags Key=Environment,Value=production Key=Team,Value=platform
RDS Pricing Considerations
RDS pricing has several components: instance hours, storage (per GB/month), I/O (for Aurora), backup storage beyond the free allocation, and data transfer. Reserved Instances offer 30-60% savings for steady-state databases. Aurora I/O-Optimized is a newer pricing model that bundles I/O costs into the instance price -- worth evaluating if your Aurora cluster has heavy I/O.
RDS vs Aurora Decision Matrix
| Factor | RDS PostgreSQL/MySQL | Aurora |
|---|---|---|
| Cost (small workloads) | Lower | Higher base cost |
| Cost (large workloads) | Comparable | Often lower (better efficiency) |
| Failover time | 60-120 seconds | Typically under 30 seconds |
| Storage scaling | Manual (with downtime risk) | Automatic up to 128 TB |
| Read replicas | Up to 5, replication lag | Up to 15, lower replication lag |
| Backtrack | Not available | Rewind database to any point in time |
| Global Database | Cross-region read replicas | Sub-second cross-region replication |
Lambda: Serverless Compute
Lambda runs code without you managing servers. It scales from zero to thousands of concurrent executions automatically. You pay only for execution time, billed in 1ms increments.
Pricing Model
- Requests: $0.20 per 1 million requests (first 1M free per month).
- Duration: $0.0000166667 per GB-second. A 256 MB function running for 1 second costs $0.0000042.
- Free Tier: 1 million requests and 400,000 GB-seconds per month, every month, permanently.
For most DevOps automation tasks (event handlers, cleanup scripts, webhook processors), Lambda falls well within the free tier.
Common DevOps Uses
Lambda excels at event-driven automation: CloudWatch alarm handlers, S3 event processing, API Gateway backends, scheduled cleanup jobs, custom CloudFormation resources, CodePipeline approval actions, and infrastructure compliance checks.
# Package and deploy a simple function
zip function.zip index.js
aws lambda create-function \
--function-name process-s3-uploads \
--runtime nodejs20.x \
--handler index.handler \
--role arn:aws:iam::123456789012:role/LambdaS3Role \
--zip-file fileb://function.zip \
--timeout 30 \
--memory-size 256 \
--ephemeral-storage Size=1024 \
--environment Variables='{DEST_BUCKET=processed-data}' \
--tracing-config Mode=Active \
--architectures arm64 \
--tags Environment=production
# Add an S3 trigger
aws lambda add-permission \
--function-name process-s3-uploads \
--statement-id s3-trigger \
--action lambda:InvokeFunction \
--principal s3.amazonaws.com \
--source-arn arn:aws:s3:::my-upload-bucket \
--source-account 123456789012
aws s3api put-bucket-notification-configuration \
--bucket my-upload-bucket \
--notification-configuration '{
"LambdaFunctionConfigurations": [
{
"LambdaFunctionArn": "arn:aws:lambda:us-east-1:123456789012:function:process-s3-uploads",
"Events": ["s3:ObjectCreated:*"],
"Filter": {
"Key": {
"FilterRules": [
{ "Name": "prefix", "Value": "uploads/" },
{ "Name": "suffix", "Value": ".csv" }
]
}
}
}
]
}'
Lambda Optimization Tips
- Use ARM64 (Graviton2) -- 20% cheaper, often faster for compute workloads.
- Right-size memory. More memory also means more CPU. Use AWS Lambda Power Tuning to find the optimal configuration.
- Minimize cold starts. Use Provisioned Concurrency for latency-sensitive functions ($0.0000041667 per GB-second provisioned).
- Keep functions small and focused. One function per responsibility.
- Use layers for shared dependencies. Keeps deployment packages small.
- Use environment variables for configuration and Secrets Manager for credentials.
EKS: Elastic Kubernetes Service
EKS is AWS-managed Kubernetes. AWS manages the control plane (etcd, API server, scheduler); you manage the worker nodes. The control plane costs $0.10/hr ($73/month) regardless of cluster size.
Cluster Setup Choices
| Option | Control | Effort | Cost |
|---|---|---|---|
| Managed Node Groups | You choose instance types, EKS handles ASG and updates | Medium | EC2 pricing + $73/mo control plane |
| Self-Managed Nodes | Full control, you manage everything | High | EC2 pricing + $73/mo control plane |
| Fargate | No nodes to manage, per-pod pricing | Low | $0.04048/vCPU/hr + $0.004445/GB/hr |
| EKS Auto Mode | AWS manages everything including nodes | Lowest | EC2 pricing + premium |
For most teams, managed node groups with Karpenter for autoscaling is the best balance of control and operational simplicity. Karpenter provisions right-sized nodes based on pending pod requirements, consolidates underutilized nodes, and can mix Spot and On-Demand instances intelligently.
# Create a cluster with eksctl
eksctl create cluster \
--name production \
--region us-east-1 \
--version 1.29 \
--nodegroup-name workers \
--node-type m6i.xlarge \
--nodes-min 2 \
--nodes-max 10 \
--managed \
--with-oidc \
--alb-ingress-access \
--node-private-networking \
--asg-access
# Install Karpenter for intelligent autoscaling
helm install karpenter oci://public.ecr.aws/karpenter/karpenter \
--namespace karpenter --create-namespace \
--set "settings.clusterName=production" \
--set "settings.interruptionQueue=production" \
--wait
EKS Networking and Service Mesh
EKS uses the AWS VPC CNI plugin by default, which assigns real VPC IP addresses to pods. This means pods can communicate directly with other AWS resources using VPC networking, security groups, and NACLs. The tradeoff is IP address consumption -- plan your VPC CIDRs accordingly.
For service-to-service communication, AWS offers App Mesh (Envoy-based) or you can deploy Istio, Linkerd, or Cilium. For most teams, Cilium provides a good balance of networking, observability, and security without the complexity of a full service mesh.
Kubernetes on AWS vs Other Clouds
| Feature | EKS (AWS) | AKS (Azure) | GKE (GCP) | ACK (Alibaba) |
|---|---|---|---|---|
| Control plane cost | $73/mo | Free | Free (Autopilot) / $73/mo (Standard) | Free (Managed) |
| Pod networking | VPC CNI (real IPs) | Azure CNI or Kubenet | GKE VPC-native | Terway or Flannel |
| Autoscaler | Karpenter or Cluster Autoscaler | KEDA, Cluster Autoscaler | GKE Autopilot or Cluster Autoscaler | Cluster Autoscaler |
| Serverless pods | Fargate | Virtual Kubelet (ACI) | Autopilot | ECI (Elastic Container Instance) |
| Max nodes per cluster | 5,000 | 5,000 | 15,000 | 5,000 |
| GPU support | Full | Full | Full | Full |
CloudWatch: Monitoring and Observability
CloudWatch collects metrics, logs, and traces across your AWS resources. It is the default observability platform, and even teams using Datadog or Grafana still rely on CloudWatch for AWS-native integrations.
CloudWatch Components
- Metrics -- CPU, memory, disk, network for EC2; request count and latency for ALB; and custom metrics from your applications. Standard resolution is 1 minute; high resolution is 1 second.
- Logs -- centralized log storage. Use Log Insights for SQL-like querying across log groups. Supports metric filters to create metrics from log patterns.
- Alarms -- trigger notifications or auto-scaling actions based on metric thresholds. Composite alarms combine multiple alarms with AND/OR logic.
- Dashboards -- visualize operational health. Up to 3 free dashboards, then $3/month each.
- X-Ray -- distributed tracing for microservices. Helps identify latency bottlenecks across service boundaries.
- Synthetics -- canary functions that monitor your endpoints on a schedule.
- Application Signals -- APM for applications running on EKS, ECS, and EC2.
# Create an alarm for high CPU
aws cloudwatch put-metric-alarm \
--alarm-name "HighCPU-WebServers" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
--ok-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
--dimensions Name=AutoScalingGroupName,Value=web-asg \
--treat-missing-data notBreaching
# Query logs with Log Insights
aws logs start-query \
--log-group-name /ecs/web-app \
--start-time $(date -d '1 hour ago' +%s) \
--end-time $(date +%s) \
--query-string 'fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) by bin(5m)
| sort @timestamp desc'
# Create a metric filter for error counting
aws logs put-metric-filter \
--log-group-name /ecs/web-app \
--filter-name ErrorCount \
--filter-pattern '"ERROR"' \
--metric-transformations \
metricName=ApplicationErrors,metricNamespace=Custom/WebApp,metricValue=1,defaultValue=0
CloudWatch Pricing
CloudWatch costs add up quickly. The main cost drivers are:
- Custom metrics: $0.30/metric/month for the first 10,000.
- Log ingestion: $0.50/GB ingested.
- Log storage: $0.03/GB/month.
- Log Insights queries: $0.005 per GB scanned.
- Dashboards: $3/month each (beyond 3 free).
- Alarms: $0.10/alarm/month (standard), $0.30 (high-resolution).
To control costs: filter logs before sending them to CloudWatch, use log retention policies aggressively, and avoid high-cardinality custom metrics.
AWS CLI Essentials
The CLI is your primary interface for automation. Install v2 and configure named profiles for each account:
# Configure a profile
aws configure --profile production
# Use SSO-based authentication (preferred)
aws configure sso --profile production
# Common operations
aws sts get-caller-identity --profile production # Who am I?
aws ec2 describe-instances --filters "Name=tag:Environment,Values=production" \
--query 'Reservations[].Instances[].[InstanceId,State.Name,PrivateIpAddress]' \
--output table
# Use --query for JMESPath filtering (saves piping to jq)
aws s3api list-buckets --query 'Buckets[?starts_with(Name, `prod-`)].Name' --output text
# Batch operations with waiter
aws ec2 start-instances --instance-ids i-0abc123 i-0def456
aws ec2 wait instance-running --instance-ids i-0abc123 i-0def456
echo "Instances are now running"
# Use CloudShell for quick tasks (browser-based, pre-authenticated)
# Access at https://console.aws.amazon.com/cloudshell/
AWS SDK Usage Patterns
For automation scripts, the AWS SDKs (boto3 for Python, aws-sdk for JavaScript/TypeScript) provide programmatic access:
# Python boto3 example: find and clean up unattached EBS volumes
import boto3
ec2 = boto3.client('ec2', region_name='us-east-1')
response = ec2.describe_volumes(
Filters=[{'Name': 'status', 'Values': ['available']}]
)
for volume in response['Volumes']:
age_days = (datetime.now(timezone.utc) - volume['CreateTime']).days
if age_days > 30:
print(f"Deleting {volume['VolumeId']} - {volume['Size']}GB, {age_days} days old")
ec2.delete_volume(VolumeId=volume['VolumeId'])
Cost-Aware Architecture Decisions
Every architectural choice has a cost implication. Build cost awareness into your DevOps practice:
- Right-size instances. Use AWS Compute Optimizer recommendations. Most instances are over-provisioned by 30-50%.
- Use Savings Plans for predictable workloads -- 30-60% savings over on-demand. Compute Savings Plans offer the most flexibility.
- Spot Instances for fault-tolerant workloads (CI/CD runners, batch processing, EKS node pools) -- up to 90% savings.
- Delete unused resources. Unattached EBS volumes, idle load balancers, old snapshots, and unused Elastic IPs add up silently.
- Use S3 lifecycle policies aggressively. Logs older than 30 days rarely need Standard storage.
- VPC endpoints save NAT Gateway data processing charges for S3 and DynamoDB (gateway endpoints are free).
- Set up billing alerts in every account. Use AWS Budgets to get notified before you overspend.
- Use AWS Cost Explorer to analyze spending by service, account, and tag. Enable hourly granularity for detailed analysis.
- Review Reserved Instance utilization monthly. Unused reservations are wasted money.
- Consider region pricing. us-east-1 and us-west-2 are typically the cheapest US regions.
# Create a monthly budget alert
aws budgets create-budget \
--account-id 123456789012 \
--budget '{
"BudgetName": "MonthlySpend",
"BudgetLimit": { "Amount": "5000", "Unit": "USD" },
"TimeUnit": "MONTHLY",
"BudgetType": "COST"
}' \
--notifications-with-subscribers '[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{ "SubscriptionType": "EMAIL", "Address": "ops@example.com" }
]
},
{
"Notification": {
"NotificationType": "FORECASTED",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 100,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{ "SubscriptionType": "EMAIL", "Address": "ops@example.com" }
]
}
]'
Migration Considerations
When migrating to AWS from on-premises or another cloud:
- AWS Migration Hub provides a central dashboard for tracking migrations across multiple tools.
- AWS Application Migration Service (MGN) handles lift-and-shift server migrations with minimal downtime.
- AWS Database Migration Service (DMS) migrates databases with continuous replication. Supports heterogeneous migrations (Oracle to PostgreSQL, SQL Server to Aurora).
- AWS Transfer Family provides managed SFTP, FTPS, and FTP servers that store data in S3.
- S3 Transfer Acceleration speeds up cross-region uploads to S3.
- AWS Snow Family (Snowball, Snowcone, Snowmobile) for large-scale offline data transfer when network bandwidth is insufficient.
The typical migration pattern is: assess (Migration Hub, Application Discovery Service), mobilize (set up landing zone, networking, security), then migrate (rehost with MGN, replatform with managed services, or refactor to cloud-native). Start with the simplest approach (lift-and-shift) and modernize incrementally.
AWS is vast, but mastering these core services gives you the foundation to build and operate production infrastructure confidently. Start with IAM and VPC -- get those right and everything else becomes easier to reason about. The services covered here represent 80% of what a DevOps engineer interacts with on a daily basis. Deep knowledge of these fundamentals is far more valuable than shallow knowledge of all 200+ services.
Senior Kubernetes Architect
10+ years orchestrating containers in production. Battle-tested opinions on everything from pod scheduling to service mesh. I've seen clusters burn and helped rebuild them better.
Related Articles
AWS CLI: Cheat Sheet
AWS CLI cheat sheet with copy-paste commands for EC2, S3, IAM, Lambda, ECS, CloudFormation, SSM, and Secrets Manager operations.
The Complete AWS Cost Optimization Playbook: Compute, Storage, Networking, and Reserved Capacity
A data-driven playbook for cutting AWS costs across compute, storage, networking, and reserved capacity with real numbers and actions.
Azure Core Services: The DevOps Engineer's Essential Guide
Understand Azure's essential services — VMs, Storage, VNets, Azure AD (Entra ID), AKS, App Service, and Azure DevOps for infrastructure automation.