Search articles.../

LIVE SIMULATIONS

DevOps Flight Simulator

Practice real-world incident response in a safe environment. Navigate production outages, security breaches, and infrastructure failures with expert guidance — before they happen for real.

20Scenarios

93Decision Points

9600Total Points Available

The Monday Morning OOM Kill

It's 9:02 AM Monday. Slack is blowing up — pods in the payments namespace keep getting OOMKilled after a weekend deploy. Customers can't checkout. You're the on-call engineer. Triage and fix this before the standup at 9:30.

Aareez Asif

5 steps·60s/step

CI/CDIntermediate

Pipeline Paralysis

Your team's CI/CD pipeline takes 45 minutes per run. Developers are stacking PRs, context-switching constantly, and velocity has tanked. The VP of Engineering wants this fixed by end of week. Time to optimize.

Sarah Chen

5 steps·60s/step

CloudIntermediate

The $50K Weekend

Monday morning. Finance pings you: 'AWS bill is $50K higher than last month... and it's only the 5th.' Your heart rate spikes. Something spun up over the weekend and nobody knows what. Find it, kill it, and make sure it never happens again.

Dev Patel

5 steps·60s/step

Secrets in the Open

A security researcher just DMed your company Twitter: 'Hey, your production AWS keys and Stripe API keys are in a public GitHub repo. You might want to fix that.' Your stomach drops. The clock is ticking — every second those keys are exposed is a second an attacker could use them.

Amara Okafor

5 steps·60s/step

The Cascade Failure

3:47 AM. PagerDuty wakes you up. Every dashboard is red. The API gateway is returning 503s, the order service is down, payments are failing, and even the search service is timing out. It's a cascading failure across your microservices architecture. Multiple teams are waking up. You need to lead the incident response.

Riku Tanaka

6 steps·60s/step

State File Corruption

Two engineers ran `terraform apply` at the same time against the same state file. Now the state is corrupted — some resources show as tainted, others are duplicated, and one critical RDS instance has disappeared from state entirely (but it's still running in AWS). Production infrastructure is at risk.

Zara Blackwood

5 steps·60s/step

networkingIntermediate

The DNS Nightmare

It's 2 PM on a Tuesday. After migrating DNS providers, your main application is unreachable for half your users. Some regions work, others don't. Support tickets are flooding in. Diagnose and fix the DNS propagation disaster.

Aareez Asif

5 steps·60s/step

containersBeginner

The Bloated Build

Deployments that used to take 3 minutes now take 25 minutes. The Docker image for your microservice has grown from 150MB to 2.4GB. CI/CD pipelines are timing out, and the team is pushing fixes by SSH-ing directly into production. Stop the bleeding.

Aareez Asif

5 steps·60s/step

The Expired Certificate

Friday at 5 PM. Your site shows 'Your connection is not private' errors across all browsers. The SSL certificate expired 30 minutes ago. Let's Encrypt auto-renewal failed silently weeks ago. Customers are panicking on social media.

Aareez Asif

4 steps·45s/step

The Actions Secret Leak

A junior developer's PR accidentally prints AWS credentials in GitHub Actions logs. The logs are public on an open-source repo. The credentials have been exposed for 6 hours. Time to contain the damage and harden the pipeline.

Aareez Asif

4 steps·45s/step

databasesAdvanced

The Failed Migration

A database migration ran in production and corrupted data for 15,000 users. The migration was supposed to merge two tables, but a missing WHERE clause updated all rows. The team is in panic mode. Recover the data and prevent this from ever happening again.

Aareez Asif

4 steps·60s/step

observabilityIntermediate

The Alert Storm

Your on-call phone won't stop buzzing. Prometheus is firing 200+ alerts per minute. PagerDuty is auto-escalating to management. Half the alerts are false positives from a network blip that resolved itself 10 minutes ago. But buried in the noise, there's a real production issue. Find it.

Aareez Asif

5 steps·60s/step

K8sIntermediate

The Invisible Firewall

After deploying a new microservice, the frontend team reports that their app can't reach the new API. All pods are running and healthy. DNS resolves correctly. The issue? Someone applied a restrictive NetworkPolicy yesterday and forgot to update it. You have 15 minutes before the product demo.

Aareez Asif

4 steps·60s/step

The Botched Helm Release

A Helm upgrade to v3.2.0 of the user-service went sideways. The new chart introduced a breaking ConfigMap change that crashes the app on startup. All 8 replicas are in CrashLoopBackOff. The previous release was v3.1.4 and was stable. Revenue is dropping $200/minute.

Aareez Asif

4 steps·45s/step

The GitOps Sync Storm

ArgoCD is stuck in a sync loop — it keeps deploying, detecting drift, and redeploying every 30 seconds. CPU on the cluster is spiking from constant pod recreations. A HorizontalPodAutoscaler keeps changing the replica count, which ArgoCD sees as drift from the Git-declared value. Three teams are blocked.

Aareez Asif

5 steps·60s/step

observabilityIntermediate

The Log Tsunami

Your centralized logging pipeline (Fluentd → Elasticsearch) is backed up. A microservice started debug logging in a hot loop, producing 50,000 logs/second. Elasticsearch is rejecting bulk inserts, Fluentd buffers are full, and now ALL services are losing logs — including the ones you need for an active P1 investigation.

Aareez Asif

4 steps·60s/step

cloud-securityAdvanced

The Open Bucket

A security scanner flagged an S3 bucket as publicly accessible. It contains customer invoices uploaded by the billing service. The bucket has been public for 3 weeks since a Terraform change removed the block_public_access setting. Legal needs to know the blast radius. The clock is ticking on a 72-hour GDPR breach notification.

Aareez Asif

5 steps·60s/step

load-balancingIntermediate

The Phantom 504s

Users are intermittently getting 504 Gateway Timeout errors — but only about 10% of requests. The backend appears healthy and responds quickly when you test it directly. The load balancer health checks pass. A deeper investigation reveals the LB is routing traffic to a node that's technically 'healthy' but has a saturated connection table.

Aareez Asif

4 steps·60s/step

The Silent Disk Killer

Your production Kubernetes nodes are going NotReady one by one. Pods get evicted but there's nowhere to schedule them. The root cause? Container logs and unused Docker images have filled the node disks to 95%. Kubelet's garbage collection can't keep up. The cluster is slowly dying.

Aareez Asif

4 steps·45s/step

The Poisoned Pipeline

Your CI/CD pipeline started producing builds with a cryptominer embedded in the Docker image. The build logs look clean. A dependency in your package-lock.json was compromised via a typosquatting attack — `lodashs` instead of `lodash` was added in a PR three days ago. You need to contain the breach and secure the supply chain.

Aareez Asif

5 steps·60s/step