DevOpsil
Kubernetes
89%
Fresh
Part 2 of 8 in Kubernetes from Zero to Hero

The Complete Guide to Kubernetes Deployment Strategies: Rolling, Blue-Green, Canary, and Progressive Delivery

Aareez AsifAareez Asif15 min read

Every Deployment Is a Risk. Manage It.

I've deployed to Kubernetes clusters thousands of times. The deployments that go wrong aren't usually the ones with bad code — they're the ones with bad deployment strategy. A pod that starts successfully but degrades performance by 40% will sail right past a rolling update's readiness check. A breaking database schema change will pass every health probe and then fail when real traffic hits it.

The deployment strategy you choose determines how quickly you detect problems and how many users are affected when something goes wrong. Get this wrong, and a bad deploy means downtime for everyone. Get it right, and the blast radius of any failure is a fraction of your traffic for a few minutes.

This guide covers every deployment strategy available in Kubernetes — when to use each one, how to implement it, and the failure modes I've seen in production.

Strategy 1: Rolling Updates (The Default)

How It Works

Rolling updates gradually replace old pods with new ones. Kubernetes terminates old pods and creates new ones in batches, controlled by maxSurge and maxUnavailable.

Time 0:  [v1] [v1] [v1] [v1] [v1]
Time 1:  [v1] [v1] [v1] [v1] [v2]  ← 1 new pod created
Time 2:  [v1] [v1] [v1] [v2] [v2]  ← old pod terminated, new created
Time 3:  [v1] [v1] [v2] [v2] [v2]
Time 4:  [v1] [v2] [v2] [v2] [v2]
Time 5:  [v2] [v2] [v2] [v2] [v2]  ← complete

Production-Grade Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  replicas: 5
  revisionHistoryLimit: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1          # Create at most 1 extra pod during update
      maxUnavailable: 0    # Never reduce below desired count
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
        version: v2.3.1
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: api
          image: myapp/api-server:v2.3.1
          ports:
            - containerPort: 8080
              name: http
          readinessProbe:
            httpGet:
              path: /healthz/ready
              port: http
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 2    # Must pass twice before receiving traffic
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /healthz/live
              port: http
            initialDelaySeconds: 15
            periodSeconds: 10
            failureThreshold: 5
          startupProbe:
            httpGet:
              path: /healthz/started
              port: http
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 30   # Allow up to 150s for startup
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 10"]
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "1000m"

The details matter here. Let me explain the non-obvious settings:

  • maxSurge: 1, maxUnavailable: 0: This ensures you always have full capacity during rollout. The tradeoff is speed — the rollout takes longer because Kubernetes waits for each new pod to be ready before terminating an old one.
  • successThreshold: 2: A single successful health check isn't enough. Two consecutive passes reduces the chance of routing traffic to a pod that's technically up but not ready.
  • preStop sleep: When a pod is terminated, the endpoint is removed from the Service, but in-flight requests may still arrive during propagation. The 10-second sleep gives load balancers time to stop sending traffic before the pod shuts down.
  • Three different probes: startupProbe for slow-starting apps (prevents liveness kills during startup), readinessProbe for traffic routing, livenessProbe for restart-on-deadlock.

Rollback

# Check rollout history
kubectl rollout history deployment/api-server -n production

# Roll back to previous version
kubectl rollout undo deployment/api-server -n production

# Roll back to specific revision
kubectl rollout undo deployment/api-server -n production --to-revision=3

When to Use Rolling Updates

  • Good for: Stateless services, APIs, web servers — anything where running two versions simultaneously is safe.
  • Bad for: Services that require database migrations, breaking API changes, or strict version consistency across all pods.

Strategy 2: Blue-Green Deployments

How It Works

Run two identical environments (blue and green). Deploy the new version to the inactive environment, test it, then switch all traffic at once.

Before:   Traffic → [Blue v1] [Blue v1] [Blue v1]
                     [Green — idle]

Deploy:   Traffic → [Blue v1] [Blue v1] [Blue v1]
                     [Green v2] [Green v2] [Green v2]  ← deploy + test

Switch:   Traffic → [Green v2] [Green v2] [Green v2]
                     [Blue v1] [Blue v1] [Blue v1]  ← standby for rollback

Implementation with Services

# deployment-blue.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server-blue
  namespace: production
  labels:
    app: api-server
    slot: blue
spec:
  replicas: 5
  selector:
    matchLabels:
      app: api-server
      slot: blue
  template:
    metadata:
      labels:
        app: api-server
        slot: blue
        version: v2.3.0
    spec:
      containers:
        - name: api
          image: myapp/api-server:v2.3.0
          # ... full container spec

---
# deployment-green.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server-green
  namespace: production
  labels:
    app: api-server
    slot: green
spec:
  replicas: 5
  selector:
    matchLabels:
      app: api-server
      slot: green
  template:
    metadata:
      labels:
        app: api-server
        slot: green
        version: v2.3.1
    spec:
      containers:
        - name: api
          image: myapp/api-server:v2.3.1
          # ... full container spec

---
# service.yaml — Switch traffic by changing the selector
apiVersion: v1
kind: Service
metadata:
  name: api-server
  namespace: production
spec:
  selector:
    app: api-server
    slot: blue     # ← Change to "green" to switch traffic
  ports:
    - port: 80
      targetPort: 8080

---
# test-service.yaml — Always points to the inactive slot for testing
apiVersion: v1
kind: Service
metadata:
  name: api-server-test
  namespace: production
spec:
  selector:
    app: api-server
    slot: green    # ← Always the opposite of the production service
  ports:
    - port: 80
      targetPort: 8080

Automated Blue-Green Switch Script

#!/bin/bash
set -euo pipefail

NAMESPACE="production"
SERVICE="api-server"
NEW_VERSION="$1"

# Determine current and target slots
CURRENT_SLOT=$(kubectl get svc "$SERVICE" -n "$NAMESPACE" \
  -o jsonpath='{.spec.selector.slot}')

if [ "$CURRENT_SLOT" = "blue" ]; then
  TARGET_SLOT="green"
else
  TARGET_SLOT="blue"
fi

echo "Current: $CURRENT_SLOT | Target: $TARGET_SLOT | Version: $NEW_VERSION"

# Deploy new version to target slot
kubectl set image "deployment/${SERVICE}-${TARGET_SLOT}" \
  api="myapp/api-server:${NEW_VERSION}" \
  -n "$NAMESPACE"

# Wait for rollout to complete
kubectl rollout status "deployment/${SERVICE}-${TARGET_SLOT}" \
  -n "$NAMESPACE" --timeout=300s

# Run smoke tests against test service
echo "Running smoke tests against ${SERVICE}-test..."
for i in {1..10}; do
  STATUS=$(kubectl exec -n "$NAMESPACE" deploy/curl-pod -- \
    curl -s -o /dev/null -w "%{http_code}" "http://${SERVICE}-test/health")
  if [ "$STATUS" != "200" ]; then
    echo "Smoke test failed with status $STATUS. Aborting switch."
    exit 1
  fi
done
echo "Smoke tests passed."

# Switch traffic
kubectl patch svc "$SERVICE" -n "$NAMESPACE" \
  -p "{\"spec\":{\"selector\":{\"slot\":\"$TARGET_SLOT\"}}}"

echo "Traffic switched to $TARGET_SLOT (version $NEW_VERSION)"
echo "Previous version running on $CURRENT_SLOT — ready for rollback"

When to Use Blue-Green

  • Good for: Applications that need atomic switchover, database migrations that require all pods on the same version, compliance requirements for pre-production testing of the exact production deployment.
  • Bad for: Teams without budget for double the infrastructure. Blue-green literally doubles your running compute during deployments.

Strategy 3: Canary Deployments

How It Works

Route a small percentage of traffic to the new version. Monitor metrics. Gradually increase traffic if everything looks good. Roll back instantly if it doesn't.

Phase 1:  [v1] [v1] [v1] [v1] [v1]    95% traffic
          [v2]                           5% traffic

Phase 2:  [v1] [v1] [v1] [v1]          80% traffic
          [v2] [v2]                     20% traffic

Phase 3:  [v1] [v1]                    40% traffic
          [v2] [v2] [v2] [v2]          60% traffic

Phase 4:  [v2] [v2] [v2] [v2] [v2]   100% traffic

Canary with Argo Rollouts

Argo Rollouts is purpose-built for advanced deployment strategies. It replaces the Deployment resource with a Rollout resource.

# Install Argo Rollouts
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts \
  -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-server
  namespace: production
spec:
  replicas: 10
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
        - name: api
          image: myapp/api-server:v2.3.1
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /healthz/ready
              port: 8080
            periodSeconds: 5
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "1000m"
  strategy:
    canary:
      canaryService: api-server-canary
      stableService: api-server-stable
      trafficRouting:
        nginx:
          stableIngress: api-server-ingress
          additionalIngressAnnotations:
            canary-by-header: X-Canary
      steps:
        # Step 1: 5% traffic to canary
        - setWeight: 5
        - pause: { duration: 5m }

        # Step 2: Automated analysis
        - analysis:
            templates:
              - templateName: success-rate
            args:
              - name: service-name
                value: api-server-canary

        # Step 3: 20% traffic
        - setWeight: 20
        - pause: { duration: 5m }

        # Step 4: Another analysis
        - analysis:
            templates:
              - templateName: success-rate
              - templateName: latency-check

        # Step 5: 50% traffic
        - setWeight: 50
        - pause: { duration: 10m }

        # Step 6: Final analysis before full promotion
        - analysis:
            templates:
              - templateName: success-rate
              - templateName: latency-check

        # Step 7: Full traffic (implicit at end of steps)

      # Automatic rollback on failure
      rollbackWindow:
        revisions: 2

Analysis Templates for Automated Canary Verification

This is the critical piece. Manual canary deployments are just rolling updates with extra steps. Automated analysis is what makes canary deployments actually work.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
  namespace: production
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 60s
      count: 5
      successCondition: result[0] > 0.99
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(
              http_requests_total{
                service="{{args.service-name}}",
                status!~"5.."
              }[2m]
            )) /
            sum(rate(
              http_requests_total{
                service="{{args.service-name}}"
              }[2m]
            ))

---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: latency-check
  namespace: production
spec:
  args:
    - name: service-name
  metrics:
    - name: p99-latency
      interval: 60s
      count: 5
      successCondition: result[0] < 0.5
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum by (le) (
                rate(http_request_duration_seconds_bucket{
                  service="{{args.service-name}}"
                }[2m])
              )
            )

The analysis template queries Prometheus every 60 seconds, 5 times. If the success rate drops below 99% or p99 latency exceeds 500ms more than twice, the rollout automatically aborts and rolls back. No human intervention needed at 3 AM.

Services for Canary Traffic Splitting

apiVersion: v1
kind: Service
metadata:
  name: api-server-stable
  namespace: production
spec:
  selector:
    app: api-server
  ports:
    - port: 80
      targetPort: 8080

---
apiVersion: v1
kind: Service
metadata:
  name: api-server-canary
  namespace: production
spec:
  selector:
    app: api-server
  ports:
    - port: 80
      targetPort: 8080

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-server-ingress
  namespace: production
  annotations:
    nginx.ingress.kubernetes.io/canary: "false"
spec:
  ingressClassName: nginx
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-server-stable
                port:
                  number: 80

Canary with Flagger (Istio/Linkerd)

If you're running a service mesh, Flagger provides canary automation with mesh-level traffic splitting:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-server
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  service:
    port: 8080
    targetPort: 8080
  analysis:
    interval: 1m
    threshold: 5       # Max failed checks before rollback
    maxWeight: 50      # Max canary traffic percentage
    stepWeight: 10     # Increment per interval
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m
    webhooks:
      - name: smoke-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -s http://api-server-canary.production/health | grep ok"
      - name: load-test
        type: rollout
        url: http://flagger-loadtester.test/
        timeout: 60s
        metadata:
          type: cmd
          cmd: "hey -z 1m -q 10 -c 2 http://api-server-canary.production/"

Strategy Comparison

StrategyZero DowntimeRollback SpeedResource CostTraffic ControlComplexity
Rolling UpdateYes30s-2min1x + surgeNone (all-or-nothing per pod)Low
Blue-GreenYesInstant2xBinary switchMedium
CanaryYesInstant1x + canary podsPercentage-basedHigh
Progressive DeliveryYesAutomatic1x + canary podsMetric-drivenHighest

Choosing the Right Strategy

My decision framework after running all of these in production:

Use Rolling Updates when:

  • Your app is stateless and backward-compatible.
  • You don't have a service mesh or Argo Rollouts installed.
  • The team is small and deployments are infrequent.

Use Blue-Green when:

  • You need atomic switchover (database migrations, strict version consistency).
  • You require a tested-in-place production environment before traffic hits it.
  • Budget for double compute exists and is justified.

Use Canary with Argo Rollouts when:

  • You deploy frequently (multiple times per day).
  • You have Prometheus metrics that can validate deployment health.
  • The service handles enough traffic for metrics to be statistically meaningful.
  • You want automated rollback without human intervention.

Use Progressive Delivery with Flagger when:

  • You already run a service mesh (Istio, Linkerd).
  • You need mesh-level traffic management (header routing, mirroring).
  • You want the most granular control over traffic distribution.

Strategy 4: Traffic Mirroring (Shadow Deployments)

There's a strategy that doesn't get enough attention: traffic mirroring. Instead of sending real user traffic to the new version, you send a copy of production traffic to the canary and compare the responses. Users never see the new version's responses, but you get real-world validation.

How It Works

Client Request ──> [v1 Production] ──> Response to Client
                      └──> [v2 Shadow] ──> Response Discarded (logged for analysis)

Implementation with Istio

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-server
  namespace: production
spec:
  hosts:
    - api-server
  http:
    - route:
        - destination:
            host: api-server
            subset: stable
          weight: 100
      mirror:
        host: api-server
        subset: canary
      mirrorPercentage:
        value: 100.0

The shadow deployment receives a copy of every request but its responses are discarded. This is perfect for:

  • Testing database-heavy queries under real load patterns
  • Validating new algorithm outputs against the current version
  • Smoke-testing major refactors without any user impact

The catch: mirrored traffic still hits downstream dependencies. If your new version writes to a database, those writes are real. Use read-only database connections or a separate test database for shadow deployments that involve writes.

Deployment Readiness Checklist

Before deploying anything to production, run through this checklist. I've seen every item on this list cause a production incident when skipped.

CheckWhy It MattersHow to Verify
Readiness probe configuredPrevents routing traffic to unready podskubectl describe deployment
Liveness probe configuredRestarts deadlocked containersCheck probe endpoints respond
Startup probe for slow startersPrevents liveness kills during startupinitialDelaySeconds + failureThreshold
preStop hook for graceful shutdownDrains in-flight requestslifecycle.preStop in pod spec
Resource requests and limits setPrevents OOM kills and noisy neighborsresources.requests / resources.limits
PodDisruptionBudget existsPrevents too many pods going down at oncekubectl get pdb
Rollback plan documentedReduces MTTR when things go wrongRunbook link in deployment manifest
Metrics and alerts in placeDetects issues the deployment introducesCheck Grafana dashboard

PodDisruptionBudget — Don't Skip This

A PDB tells Kubernetes how many pods must remain available during voluntary disruptions (node drains, cluster upgrades, rolling updates):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
  namespace: production
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: api-server

Without a PDB, a node drain during a rolling update could take down more pods than your maxUnavailable setting allows. The PDB adds a hard constraint that Kubernetes respects across all disruption sources.

Graceful Shutdown Pattern

The preStop hook and terminationGracePeriodSeconds work together to prevent dropped requests:

spec:
  terminationGracePeriodSeconds: 60
  containers:
    - name: app
      lifecycle:
        preStop:
          exec:
            command:
              - /bin/sh
              - -c
              - |
                # Signal the app to stop accepting new connections
                kill -SIGTERM 1
                # Wait for in-flight requests to complete
                sleep 15

The sequence during pod termination:

  1. Pod is marked for deletion
  2. Pod is removed from Service endpoints (but propagation takes time)
  3. preStop hook runs (sleep 15 gives load balancers time to stop sending traffic)
  4. SIGTERM is sent to the main process
  5. App has until terminationGracePeriodSeconds to shut down cleanly
  6. SIGKILL if the app hasn't exited

If your app handles long-running requests (file uploads, WebSocket connections), increase both the preStop sleep and the termination grace period accordingly.

The Deployment I Wish I'd Done Differently

Early in my career, I rolled out a breaking change to a user-facing API using a standard rolling update. The new version passed every health check — the application started, the endpoints responded, the readiness probe returned 200. But the response payload format had changed, and every client that depended on the old format started failing silently.

By the time we noticed, 100% of pods were on the new version. The rollback took 3 minutes, but the damage was done — thousands of failed requests, corrupted client caches, and a postmortem that concluded with "we should have used canary."

The lesson: health checks tell you if the process is alive. They don't tell you if the service is correct. Canary analysis against real traffic metrics — error rates, latency percentiles, business metrics — catches the failures that health probes miss.

Conclusion

Choose your deployment strategy based on the blast radius you can tolerate. For most production services, that answer should be "as small as possible, verified by metrics, with automatic rollback." That's canary. Build toward it.

Start with rolling updates — they're built in and require no extra tooling. Add proper health checks, preStop hooks, and PodDisruptionBudgets. When you're ready for more control, install Argo Rollouts and implement canary with automated analysis. The progression is natural: each step gives you more confidence and smaller blast radius.

The investment in deployment infrastructure pays for itself not on the good days, but on the bad ones. When a deploy goes wrong at 2 AM, the difference between "automatic rollback in 30 seconds" and "page the on-call engineer who pages the team lead who approves the rollback" is the difference between a blip and an outage.

Whatever strategy you choose, measure your deployment metrics: deployment frequency, lead time for changes, change failure rate, and time to recover. These are the DORA metrics, and they directly correlate with engineering team performance. A team deploying daily with canary analysis and automatic rollback will outship a team deploying weekly with manual verification every time — not because they're moving faster, but because they're moving safer.

Share:
Aareez Asif
Aareez Asif

Senior Kubernetes Architect

10+ years orchestrating containers in production. Battle-tested opinions on everything from pod scheduling to service mesh. I've seen clusters burn and helped rebuild them better.

Related Articles