The Complete Guide to Kubernetes Deployment Strategies: Rolling, Blue-Green, Canary, and Progressive Delivery
Every Deployment Is a Risk. Manage It.
I've deployed to Kubernetes clusters thousands of times. The deployments that go wrong aren't usually the ones with bad code — they're the ones with bad deployment strategy. A pod that starts successfully but degrades performance by 40% will sail right past a rolling update's readiness check. A breaking database schema change will pass every health probe and then fail when real traffic hits it.
The deployment strategy you choose determines how quickly you detect problems and how many users are affected when something goes wrong. Get this wrong, and a bad deploy means downtime for everyone. Get it right, and the blast radius of any failure is a fraction of your traffic for a few minutes.
This guide covers every deployment strategy available in Kubernetes — when to use each one, how to implement it, and the failure modes I've seen in production.
Strategy 1: Rolling Updates (The Default)
How It Works
Rolling updates gradually replace old pods with new ones. Kubernetes terminates old pods and creates new ones in batches, controlled by maxSurge and maxUnavailable.
Time 0: [v1] [v1] [v1] [v1] [v1]
Time 1: [v1] [v1] [v1] [v1] [v2] ← 1 new pod created
Time 2: [v1] [v1] [v1] [v2] [v2] ← old pod terminated, new created
Time 3: [v1] [v1] [v2] [v2] [v2]
Time 4: [v1] [v2] [v2] [v2] [v2]
Time 5: [v2] [v2] [v2] [v2] [v2] ← complete
Production-Grade Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
namespace: production
spec:
replicas: 5
revisionHistoryLimit: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Create at most 1 extra pod during update
maxUnavailable: 0 # Never reduce below desired count
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
version: v2.3.1
spec:
terminationGracePeriodSeconds: 60
containers:
- name: api
image: myapp/api-server:v2.3.1
ports:
- containerPort: 8080
name: http
readinessProbe:
httpGet:
path: /healthz/ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 2 # Must pass twice before receiving traffic
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz/live
port: http
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 5
startupProbe:
httpGet:
path: /healthz/started
port: http
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 30 # Allow up to 150s for startup
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "1000m"
The details matter here. Let me explain the non-obvious settings:
maxSurge: 1, maxUnavailable: 0: This ensures you always have full capacity during rollout. The tradeoff is speed — the rollout takes longer because Kubernetes waits for each new pod to be ready before terminating an old one.successThreshold: 2: A single successful health check isn't enough. Two consecutive passes reduces the chance of routing traffic to a pod that's technically up but not ready.preStopsleep: When a pod is terminated, the endpoint is removed from the Service, but in-flight requests may still arrive during propagation. The 10-second sleep gives load balancers time to stop sending traffic before the pod shuts down.- Three different probes:
startupProbefor slow-starting apps (prevents liveness kills during startup),readinessProbefor traffic routing,livenessProbefor restart-on-deadlock.
Rollback
# Check rollout history
kubectl rollout history deployment/api-server -n production
# Roll back to previous version
kubectl rollout undo deployment/api-server -n production
# Roll back to specific revision
kubectl rollout undo deployment/api-server -n production --to-revision=3
When to Use Rolling Updates
- Good for: Stateless services, APIs, web servers — anything where running two versions simultaneously is safe.
- Bad for: Services that require database migrations, breaking API changes, or strict version consistency across all pods.
Strategy 2: Blue-Green Deployments
How It Works
Run two identical environments (blue and green). Deploy the new version to the inactive environment, test it, then switch all traffic at once.
Before: Traffic → [Blue v1] [Blue v1] [Blue v1]
[Green — idle]
Deploy: Traffic → [Blue v1] [Blue v1] [Blue v1]
[Green v2] [Green v2] [Green v2] ← deploy + test
Switch: Traffic → [Green v2] [Green v2] [Green v2]
[Blue v1] [Blue v1] [Blue v1] ← standby for rollback
Implementation with Services
# deployment-blue.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server-blue
namespace: production
labels:
app: api-server
slot: blue
spec:
replicas: 5
selector:
matchLabels:
app: api-server
slot: blue
template:
metadata:
labels:
app: api-server
slot: blue
version: v2.3.0
spec:
containers:
- name: api
image: myapp/api-server:v2.3.0
# ... full container spec
---
# deployment-green.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server-green
namespace: production
labels:
app: api-server
slot: green
spec:
replicas: 5
selector:
matchLabels:
app: api-server
slot: green
template:
metadata:
labels:
app: api-server
slot: green
version: v2.3.1
spec:
containers:
- name: api
image: myapp/api-server:v2.3.1
# ... full container spec
---
# service.yaml — Switch traffic by changing the selector
apiVersion: v1
kind: Service
metadata:
name: api-server
namespace: production
spec:
selector:
app: api-server
slot: blue # ← Change to "green" to switch traffic
ports:
- port: 80
targetPort: 8080
---
# test-service.yaml — Always points to the inactive slot for testing
apiVersion: v1
kind: Service
metadata:
name: api-server-test
namespace: production
spec:
selector:
app: api-server
slot: green # ← Always the opposite of the production service
ports:
- port: 80
targetPort: 8080
Automated Blue-Green Switch Script
#!/bin/bash
set -euo pipefail
NAMESPACE="production"
SERVICE="api-server"
NEW_VERSION="$1"
# Determine current and target slots
CURRENT_SLOT=$(kubectl get svc "$SERVICE" -n "$NAMESPACE" \
-o jsonpath='{.spec.selector.slot}')
if [ "$CURRENT_SLOT" = "blue" ]; then
TARGET_SLOT="green"
else
TARGET_SLOT="blue"
fi
echo "Current: $CURRENT_SLOT | Target: $TARGET_SLOT | Version: $NEW_VERSION"
# Deploy new version to target slot
kubectl set image "deployment/${SERVICE}-${TARGET_SLOT}" \
api="myapp/api-server:${NEW_VERSION}" \
-n "$NAMESPACE"
# Wait for rollout to complete
kubectl rollout status "deployment/${SERVICE}-${TARGET_SLOT}" \
-n "$NAMESPACE" --timeout=300s
# Run smoke tests against test service
echo "Running smoke tests against ${SERVICE}-test..."
for i in {1..10}; do
STATUS=$(kubectl exec -n "$NAMESPACE" deploy/curl-pod -- \
curl -s -o /dev/null -w "%{http_code}" "http://${SERVICE}-test/health")
if [ "$STATUS" != "200" ]; then
echo "Smoke test failed with status $STATUS. Aborting switch."
exit 1
fi
done
echo "Smoke tests passed."
# Switch traffic
kubectl patch svc "$SERVICE" -n "$NAMESPACE" \
-p "{\"spec\":{\"selector\":{\"slot\":\"$TARGET_SLOT\"}}}"
echo "Traffic switched to $TARGET_SLOT (version $NEW_VERSION)"
echo "Previous version running on $CURRENT_SLOT — ready for rollback"
When to Use Blue-Green
- Good for: Applications that need atomic switchover, database migrations that require all pods on the same version, compliance requirements for pre-production testing of the exact production deployment.
- Bad for: Teams without budget for double the infrastructure. Blue-green literally doubles your running compute during deployments.
Strategy 3: Canary Deployments
How It Works
Route a small percentage of traffic to the new version. Monitor metrics. Gradually increase traffic if everything looks good. Roll back instantly if it doesn't.
Phase 1: [v1] [v1] [v1] [v1] [v1] 95% traffic
[v2] 5% traffic
Phase 2: [v1] [v1] [v1] [v1] 80% traffic
[v2] [v2] 20% traffic
Phase 3: [v1] [v1] 40% traffic
[v2] [v2] [v2] [v2] 60% traffic
Phase 4: [v2] [v2] [v2] [v2] [v2] 100% traffic
Canary with Argo Rollouts
Argo Rollouts is purpose-built for advanced deployment strategies. It replaces the Deployment resource with a Rollout resource.
# Install Argo Rollouts
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts \
-f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-server
namespace: production
spec:
replicas: 10
revisionHistoryLimit: 5
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api
image: myapp/api-server:v2.3.1
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
periodSeconds: 5
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "1000m"
strategy:
canary:
canaryService: api-server-canary
stableService: api-server-stable
trafficRouting:
nginx:
stableIngress: api-server-ingress
additionalIngressAnnotations:
canary-by-header: X-Canary
steps:
# Step 1: 5% traffic to canary
- setWeight: 5
- pause: { duration: 5m }
# Step 2: Automated analysis
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: api-server-canary
# Step 3: 20% traffic
- setWeight: 20
- pause: { duration: 5m }
# Step 4: Another analysis
- analysis:
templates:
- templateName: success-rate
- templateName: latency-check
# Step 5: 50% traffic
- setWeight: 50
- pause: { duration: 10m }
# Step 6: Final analysis before full promotion
- analysis:
templates:
- templateName: success-rate
- templateName: latency-check
# Step 7: Full traffic (implicit at end of steps)
# Automatic rollback on failure
rollbackWindow:
revisions: 2
Analysis Templates for Automated Canary Verification
This is the critical piece. Manual canary deployments are just rolling updates with extra steps. Automated analysis is what makes canary deployments actually work.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
namespace: production
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 60s
count: 5
successCondition: result[0] > 0.99
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(
http_requests_total{
service="{{args.service-name}}",
status!~"5.."
}[2m]
)) /
sum(rate(
http_requests_total{
service="{{args.service-name}}"
}[2m]
))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: latency-check
namespace: production
spec:
args:
- name: service-name
metrics:
- name: p99-latency
interval: 60s
count: 5
successCondition: result[0] < 0.5
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99,
sum by (le) (
rate(http_request_duration_seconds_bucket{
service="{{args.service-name}}"
}[2m])
)
)
The analysis template queries Prometheus every 60 seconds, 5 times. If the success rate drops below 99% or p99 latency exceeds 500ms more than twice, the rollout automatically aborts and rolls back. No human intervention needed at 3 AM.
Services for Canary Traffic Splitting
apiVersion: v1
kind: Service
metadata:
name: api-server-stable
namespace: production
spec:
selector:
app: api-server
ports:
- port: 80
targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: api-server-canary
namespace: production
spec:
selector:
app: api-server
ports:
- port: 80
targetPort: 8080
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-server-ingress
namespace: production
annotations:
nginx.ingress.kubernetes.io/canary: "false"
spec:
ingressClassName: nginx
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-server-stable
port:
number: 80
Canary with Flagger (Istio/Linkerd)
If you're running a service mesh, Flagger provides canary automation with mesh-level traffic splitting:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api-server
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
service:
port: 8080
targetPort: 8080
analysis:
interval: 1m
threshold: 5 # Max failed checks before rollback
maxWeight: 50 # Max canary traffic percentage
stepWeight: 10 # Increment per interval
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
webhooks:
- name: smoke-test
type: pre-rollout
url: http://flagger-loadtester.test/
timeout: 30s
metadata:
type: bash
cmd: "curl -s http://api-server-canary.production/health | grep ok"
- name: load-test
type: rollout
url: http://flagger-loadtester.test/
timeout: 60s
metadata:
type: cmd
cmd: "hey -z 1m -q 10 -c 2 http://api-server-canary.production/"
Strategy Comparison
| Strategy | Zero Downtime | Rollback Speed | Resource Cost | Traffic Control | Complexity |
|---|---|---|---|---|---|
| Rolling Update | Yes | 30s-2min | 1x + surge | None (all-or-nothing per pod) | Low |
| Blue-Green | Yes | Instant | 2x | Binary switch | Medium |
| Canary | Yes | Instant | 1x + canary pods | Percentage-based | High |
| Progressive Delivery | Yes | Automatic | 1x + canary pods | Metric-driven | Highest |
Choosing the Right Strategy
My decision framework after running all of these in production:
Use Rolling Updates when:
- Your app is stateless and backward-compatible.
- You don't have a service mesh or Argo Rollouts installed.
- The team is small and deployments are infrequent.
Use Blue-Green when:
- You need atomic switchover (database migrations, strict version consistency).
- You require a tested-in-place production environment before traffic hits it.
- Budget for double compute exists and is justified.
Use Canary with Argo Rollouts when:
- You deploy frequently (multiple times per day).
- You have Prometheus metrics that can validate deployment health.
- The service handles enough traffic for metrics to be statistically meaningful.
- You want automated rollback without human intervention.
Use Progressive Delivery with Flagger when:
- You already run a service mesh (Istio, Linkerd).
- You need mesh-level traffic management (header routing, mirroring).
- You want the most granular control over traffic distribution.
Strategy 4: Traffic Mirroring (Shadow Deployments)
There's a strategy that doesn't get enough attention: traffic mirroring. Instead of sending real user traffic to the new version, you send a copy of production traffic to the canary and compare the responses. Users never see the new version's responses, but you get real-world validation.
How It Works
Client Request ──> [v1 Production] ──> Response to Client
│
└──> [v2 Shadow] ──> Response Discarded (logged for analysis)
Implementation with Istio
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: api-server
namespace: production
spec:
hosts:
- api-server
http:
- route:
- destination:
host: api-server
subset: stable
weight: 100
mirror:
host: api-server
subset: canary
mirrorPercentage:
value: 100.0
The shadow deployment receives a copy of every request but its responses are discarded. This is perfect for:
- Testing database-heavy queries under real load patterns
- Validating new algorithm outputs against the current version
- Smoke-testing major refactors without any user impact
The catch: mirrored traffic still hits downstream dependencies. If your new version writes to a database, those writes are real. Use read-only database connections or a separate test database for shadow deployments that involve writes.
Deployment Readiness Checklist
Before deploying anything to production, run through this checklist. I've seen every item on this list cause a production incident when skipped.
| Check | Why It Matters | How to Verify |
|---|---|---|
| Readiness probe configured | Prevents routing traffic to unready pods | kubectl describe deployment |
| Liveness probe configured | Restarts deadlocked containers | Check probe endpoints respond |
| Startup probe for slow starters | Prevents liveness kills during startup | initialDelaySeconds + failureThreshold |
| preStop hook for graceful shutdown | Drains in-flight requests | lifecycle.preStop in pod spec |
| Resource requests and limits set | Prevents OOM kills and noisy neighbors | resources.requests / resources.limits |
| PodDisruptionBudget exists | Prevents too many pods going down at once | kubectl get pdb |
| Rollback plan documented | Reduces MTTR when things go wrong | Runbook link in deployment manifest |
| Metrics and alerts in place | Detects issues the deployment introduces | Check Grafana dashboard |
PodDisruptionBudget — Don't Skip This
A PDB tells Kubernetes how many pods must remain available during voluntary disruptions (node drains, cluster upgrades, rolling updates):
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
namespace: production
spec:
maxUnavailable: 1
selector:
matchLabels:
app: api-server
Without a PDB, a node drain during a rolling update could take down more pods than your maxUnavailable setting allows. The PDB adds a hard constraint that Kubernetes respects across all disruption sources.
Graceful Shutdown Pattern
The preStop hook and terminationGracePeriodSeconds work together to prevent dropped requests:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
# Signal the app to stop accepting new connections
kill -SIGTERM 1
# Wait for in-flight requests to complete
sleep 15
The sequence during pod termination:
- Pod is marked for deletion
- Pod is removed from Service endpoints (but propagation takes time)
- preStop hook runs (sleep 15 gives load balancers time to stop sending traffic)
- SIGTERM is sent to the main process
- App has until
terminationGracePeriodSecondsto shut down cleanly - SIGKILL if the app hasn't exited
If your app handles long-running requests (file uploads, WebSocket connections), increase both the preStop sleep and the termination grace period accordingly.
The Deployment I Wish I'd Done Differently
Early in my career, I rolled out a breaking change to a user-facing API using a standard rolling update. The new version passed every health check — the application started, the endpoints responded, the readiness probe returned 200. But the response payload format had changed, and every client that depended on the old format started failing silently.
By the time we noticed, 100% of pods were on the new version. The rollback took 3 minutes, but the damage was done — thousands of failed requests, corrupted client caches, and a postmortem that concluded with "we should have used canary."
The lesson: health checks tell you if the process is alive. They don't tell you if the service is correct. Canary analysis against real traffic metrics — error rates, latency percentiles, business metrics — catches the failures that health probes miss.
Conclusion
Choose your deployment strategy based on the blast radius you can tolerate. For most production services, that answer should be "as small as possible, verified by metrics, with automatic rollback." That's canary. Build toward it.
Start with rolling updates — they're built in and require no extra tooling. Add proper health checks, preStop hooks, and PodDisruptionBudgets. When you're ready for more control, install Argo Rollouts and implement canary with automated analysis. The progression is natural: each step gives you more confidence and smaller blast radius.
The investment in deployment infrastructure pays for itself not on the good days, but on the bad ones. When a deploy goes wrong at 2 AM, the difference between "automatic rollback in 30 seconds" and "page the on-call engineer who pages the team lead who approves the rollback" is the difference between a blip and an outage.
Whatever strategy you choose, measure your deployment metrics: deployment frequency, lead time for changes, change failure rate, and time to recover. These are the DORA metrics, and they directly correlate with engineering team performance. A team deploying daily with canary analysis and automatic rollback will outship a team deploying weekly with manual verification every time — not because they're moving faster, but because they're moving safer.
Related Articles
Senior Kubernetes Architect
10+ years orchestrating containers in production. Battle-tested opinions on everything from pod scheduling to service mesh. I've seen clusters burn and helped rebuild them better.
Related Articles
Kubernetes Ingress vs Gateway API: When to Migrate and How to Do It Without Breaking Everything
A practical comparison of Kubernetes Ingress and Gateway API, with a migration strategy that won't take down your production traffic.
Kubernetes Resource Requests vs Limits: The Guide I Wish I Had Before My First OOM Kill
A deep dive into Kubernetes resource requests, limits, QoS classes, and why getting them wrong leads to OOM kills, throttling, and wasted money.
Encrypting Kubernetes Secrets at Rest: Because Base64 Is Not Encryption
How to configure encryption at rest for Kubernetes secrets using KMS providers, because your secrets in etcd are stored in plaintext by default.