DevOpsil
LIVE SIMULATIONS

DevOps Flight Simulator

Practice real-world incident response in a safe environment. Navigate production outages, security breaches, and infrastructure failures with expert guidance — before they happen for real.

6Scenarios
31Decision Points
3200Total Points Available
K8sBeginner

The Monday Morning OOM Kill

It's 9:02 AM Monday. Slack is blowing up — pods in the payments namespace keep getting OOMKilled after a weekend deploy. Customers can't checkout. You're the on-call engineer. Triage and fix this before the standup at 9:30.

Aareez AsifAareez Asif
5 steps·60s/step
CI/CDIntermediate

Pipeline Paralysis

Your team's CI/CD pipeline takes 45 minutes per run. Developers are stacking PRs, context-switching constantly, and velocity has tanked. The VP of Engineering wants this fixed by end of week. Time to optimize.

Sarah ChenSarah Chen
5 steps·60s/step
CloudIntermediate

The $50K Weekend

Monday morning. Finance pings you: 'AWS bill is $50K higher than last month... and it's only the 5th.' Your heart rate spikes. Something spun up over the weekend and nobody knows what. Find it, kill it, and make sure it never happens again.

Dev PatelDev Patel
5 steps·60s/step
SecAdvanced

Secrets in the Open

A security researcher just DMed your company Twitter: 'Hey, your production AWS keys and Stripe API keys are in a public GitHub repo. You might want to fix that.' Your stomach drops. The clock is ticking — every second those keys are exposed is a second an attacker could use them.

Amara OkaforAmara Okafor
5 steps·60s/step
SREAdvanced

The Cascade Failure

3:47 AM. PagerDuty wakes you up. Every dashboard is red. The API gateway is returning 503s, the order service is down, payments are failing, and even the search service is timing out. It's a cascading failure across your microservices architecture. Multiple teams are waking up. You need to lead the incident response.

Riku TanakaRiku Tanaka
6 steps·60s/step
TFIntermediate

State File Corruption

Two engineers ran `terraform apply` at the same time against the same state file. Now the state is corrupted — some resources show as tainted, others are duplicated, and one critical RDS instance has disappeared from state entirely (but it's still running in AWS). Production infrastructure is at risk.

Zara BlackwoodZara Blackwood
5 steps·60s/step