DevOps Flight Simulator
Practice real-world incident response in a safe environment. Navigate production outages, security breaches, and infrastructure failures with expert guidance — before they happen for real.
The Monday Morning OOM Kill
It's 9:02 AM Monday. Slack is blowing up — pods in the payments namespace keep getting OOMKilled after a weekend deploy. Customers can't checkout. You're the on-call engineer. Triage and fix this before the standup at 9:30.
Pipeline Paralysis
Your team's CI/CD pipeline takes 45 minutes per run. Developers are stacking PRs, context-switching constantly, and velocity has tanked. The VP of Engineering wants this fixed by end of week. Time to optimize.
The $50K Weekend
Monday morning. Finance pings you: 'AWS bill is $50K higher than last month... and it's only the 5th.' Your heart rate spikes. Something spun up over the weekend and nobody knows what. Find it, kill it, and make sure it never happens again.
Secrets in the Open
A security researcher just DMed your company Twitter: 'Hey, your production AWS keys and Stripe API keys are in a public GitHub repo. You might want to fix that.' Your stomach drops. The clock is ticking — every second those keys are exposed is a second an attacker could use them.
The Cascade Failure
3:47 AM. PagerDuty wakes you up. Every dashboard is red. The API gateway is returning 503s, the order service is down, payments are failing, and even the search service is timing out. It's a cascading failure across your microservices architecture. Multiple teams are waking up. You need to lead the incident response.
State File Corruption
Two engineers ran `terraform apply` at the same time against the same state file. Now the state is corrupted — some resources show as tainted, others are duplicated, and one critical RDS instance has disappeared from state entirely (but it's still running in AWS). Production infrastructure is at risk.