DevOpsil
Kubernetes
94%
Fresh
Part 4 of 8 in Kubernetes from Zero to Hero

Kubernetes HPA with Custom Metrics: Stop Scaling on CPU Alone

Aareez AsifAareez Asif15 min read

CPU-Based Scaling Is a Lie (for Most Workloads)

Here's the thing — if you're scaling your pods based on CPU utilization alone, you're probably doing it wrong. I've watched teams burn through thousands of dollars in cloud spend because their HPA was thrashing pods up and down based on CPU, while the actual bottleneck was request queue depth.

CPU and memory are lagging indicators. By the time CPU spikes, your users have already felt the pain. What you actually want is to scale on leading indicators: request rate, queue length, connection count, or whatever metric tells you "load is coming" before it arrives.

Let me tell you why custom metrics change the game, and how to wire Prometheus into your HPA so your workloads scale on signals that matter.

The Architecture: How Custom Metrics Flow

Before we touch any YAML, you need to understand the data path:

  1. Your application exposes metrics (or a ServiceMonitor scrapes them)
  2. Prometheus collects those metrics
  3. The Prometheus Adapter translates Prometheus queries into the Kubernetes Custom Metrics API
  4. The HPA controller queries the Custom Metrics API to make scaling decisions

If any link in this chain breaks, your HPA sits there doing nothing. I've debugged this exact issue at 2 AM more times than I'd like to admit.

App Pods --> Prometheus --> Prometheus Adapter --> Custom Metrics API --> HPA Controller

Step 1: Deploy the Prometheus Adapter

The Prometheus Adapter bridges Prometheus and the Kubernetes metrics API. Install it with Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus-server.monitoring.svc \
  --set prometheus.port=9090

Here's the thing most guides skip: the adapter needs to be able to reach your Prometheus instance. If you're running Prometheus Operator, the URL is usually http://prometheus-operated.monitoring.svc:9090. Get this wrong and every custom metric query returns empty.

Verify the adapter is registered:

kubectl get apiservices | grep custom.metrics

You should see:

v1beta1.custom.metrics.k8s.io   monitoring/prometheus-adapter   True    5m

If that shows False in the Available column, check the adapter pod logs immediately.

Step 2: Configure Metric Discovery Rules

This is where people get stuck. The adapter needs rules that tell it how to translate Prometheus metrics into Kubernetes-style metrics. Here's a real-world configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter
  namespace: monitoring
data:
  config.yaml: |
    rules:
    # Rule for HTTP request rate per pod
    - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
      seriesFilters: []
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: "^(.*)_total$"
        as: "${1}_per_second"
      metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'

    # Rule for queue depth
    - seriesQuery: 'rabbitmq_queue_messages{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: "^(.*)$"
        as: "${1}"
      metricsQuery: 'avg(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

    # Rule for active WebSocket connections
    - seriesQuery: 'websocket_active_connections{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: "^(.*)$"
        as: "${1}"
      metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

Let me break down what's happening here because this syntax is not intuitive:

  • seriesQuery: The Prometheus metric name with required label filters
  • resources.overrides: Maps Prometheus labels to Kubernetes resource types — this is how the adapter knows which pod or namespace the metric belongs to
  • name: Transforms the Prometheus metric name into the custom metric name
  • metricsQuery: The actual PromQL query, with template variables that the adapter fills in

The <<.Series>>, <<.LabelMatchers>>, and <<.GroupBy>> placeholders are critical. The adapter substitutes these at query time based on which pods the HPA is asking about.

Step 3: Verify Custom Metrics Are Available

After deploying the config, restart the adapter and check:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq '.resources[].name'

You should see your metrics listed. To query a specific metric:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" | jq .

If this returns empty, your seriesQuery doesn't match any actual Prometheus series. Go back and verify the metric exists in Prometheus first:

http_requests_total{namespace="production",pod=~"api-server.*"}

Step 4: Create the HPA with Custom Metrics

Now the part you've been waiting for. Here's an HPA that scales on HTTP request rate instead of CPU:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60

Let me tell you why the behavior section matters more than most people think. Without it, the HPA uses defaults that scale down aggressively. I once watched a deployment scale from 15 pods to 3 in sixty seconds during a traffic lull, then get crushed when the next wave hit. The stabilization window and gradual scale-down policy prevent that whiplash.

Combining Multiple Metrics

In production, you rarely want to scale on a single metric. Here's a more realistic example that considers both request rate and response latency:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 25
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  - type: Pods
    pods:
      metric:
        name: http_request_duration_seconds_p95
      target:
        type: AverageValue
        averageValue: "0.5"
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

When you specify multiple metrics, the HPA evaluates each one independently and picks the highest replica count. This is a "scale to the worst case" approach, and it's exactly what you want. If request rate is fine but latency is spiking, you still scale up.

Debugging When Things Go Wrong

Here's the thing about custom metrics HPA — when it doesn't work, the error messages are terrible. Here's my debugging checklist:

# 1. Check the HPA status and conditions
kubectl describe hpa api-server-hpa -n production

# 2. Look for "unable to fetch metrics" errors
kubectl get hpa api-server-hpa -n production -o yaml | grep -A5 conditions

# 3. Verify the adapter can see the metric
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second"

# 4. Check the adapter logs
kubectl logs -n monitoring deploy/prometheus-adapter --tail=50

# 5. Verify the metric exists in Prometheus
kubectl port-forward -n monitoring svc/prometheus-server 9090:9090
# Then query: http_requests_total{namespace="production"}

The most common failure mode I see: the Prometheus metric has labels that don't match the adapter's resources.overrides mapping. If your metric uses kubernetes_pod_name instead of pod, the adapter can't map it to a Kubernetes pod resource, and the HPA gets nothing.

External Metrics: Scaling on Non-Pod Metrics

Custom metrics are tied to Kubernetes objects (pods, services). But sometimes the signal you need doesn't come from a pod at all. External metrics let you scale based on any metric — a cloud queue length, a database connection count, or a third-party API response time.

Scaling on SQS Queue Depth

This is one of the most common patterns I deploy. A worker deployment scales based on how many messages are waiting in an SQS queue:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: queue-worker-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-worker
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: External
    external:
      metric:
        name: sqs_queue_visible_messages
        selector:
          matchLabels:
            queue_name: "order-processing"
      target:
        type: AverageValue
        averageValue: "20"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately for queues
      policies:
      - type: Pods
        value: 10
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 25
        periodSeconds: 120

The AverageValue of 20 means: for every 20 messages in the queue, maintain one pod. If the queue has 200 messages, the HPA targets 10 pods. This keeps the queue draining at a consistent rate regardless of depth.

The adapter configuration for external metrics is slightly different:

rules:
- seriesQuery: 'aws_sqs_approximate_number_of_messages_visible'
  seriesFilters: []
  resources:
    overrides:
      namespace:
        resource: namespace
  name:
    matches: "^aws_sqs_(.*)$"
    as: "sqs_${1}"
  metricsQuery: 'avg(<<.Series>>{<<.LabelMatchers>>})'

Scaling on Kafka Consumer Lag

For event-driven architectures using Kafka, consumer lag is the metric that matters:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: kafka-consumer-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: kafka-consumer
  minReplicas: 3
  maxReplicas: 30
  metrics:
  - type: External
    external:
      metric:
        name: kafka_consumer_lag
        selector:
          matchLabels:
            consumer_group: "order-processor"
            topic: "orders"
      target:
        type: Value
        value: "1000"
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Note the use of type: Value instead of type: AverageValue for Kafka lag. Consumer lag is a total across all partitions, not a per-pod metric. Setting the target to 1000 means the HPA will scale up whenever total lag exceeds 1000 messages.

KEDA: The Alternative to Prometheus Adapter

If the Prometheus Adapter feels like too much plumbing, KEDA (Kubernetes Event-Driven Autoscaling) provides a more integrated solution. It supports 60+ scalers out of the box, including AWS SQS, Kafka, RabbitMQ, PostgreSQL, and Prometheus.

Installing KEDA

helm repo add kedacore https://kedacore.github.io/charts
helm repo update

helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace

KEDA ScaledObject for Prometheus Metrics

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-server-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: api-server
  minReplicaCount: 3
  maxReplicaCount: 25
  pollingInterval: 15
  cooldownPeriod: 300
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: http_requests_per_second
        query: |
          sum(rate(http_requests_total{
            namespace="production",
            deployment="api-server"
          }[2m]))
        threshold: "100"
        activationThreshold: "10"
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: http_p99_latency
        query: |
          histogram_quantile(0.99,
            sum by (le) (
              rate(http_request_duration_seconds_bucket{
                namespace="production",
                deployment="api-server"
              }[2m])
            )
          )
        threshold: "0.5"

The activationThreshold is KEDA's killer feature for cost optimization. When the metric drops below this threshold, KEDA can scale the deployment to zero. This is perfect for dev/staging environments or batch workloads that only need to run when there's work to do.

KEDA vs Prometheus Adapter: When to Use Which

FeaturePrometheus AdapterKEDA
Scale to zeroNoYes
Built-in scalersPrometheus only60+ (SQS, Kafka, etc.)
HPA integrationNative (custom metrics API)Creates HPA automatically
ConfigurationConfigMap + rulesCRD per workload
Operational overheadLower (one adapter)Higher (operator + CRDs)
Community adoptionModerateGrowing rapidly

My rule of thumb: if you only need Prometheus-based metrics and never scale to zero, the Prometheus Adapter is simpler. If you need event-driven scaling, scale-to-zero, or integration with cloud-native queues, KEDA is the better choice.

Advanced Behavior Configuration

The behavior section of the HPA spec is where you tune scaling responsiveness. Most guides skip this, but it's the difference between a stable system and one that thrashes.

Understanding Stabilization Windows

behavior:
  scaleUp:
    stabilizationWindowSeconds: 30
    selectPolicy: Max
    policies:
    - type: Pods
      value: 4
      periodSeconds: 60
    - type: Percent
      value: 100
      periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 300
    selectPolicy: Min
    policies:
    - type: Pods
      value: 2
      periodSeconds: 120
    - type: Percent
      value: 10
      periodSeconds: 120

Here's what each field does:

  • stabilizationWindowSeconds: The HPA looks at all recommended replica counts within this window and picks the highest (for scale-up) or lowest (for scale-down). A 300-second scale-down window means the HPA won't reduce replicas until the metric has been below the target for 5 full minutes.

  • selectPolicy: When multiple policies are defined, Max picks the policy that allows the most change (aggressive), Min picks the policy that allows the least change (conservative). Use Max for scale-up (respond quickly to load) and Min for scale-down (be cautious about removing capacity).

  • Pods vs Percent policies: Pod-based policies set an absolute limit ("add at most 4 pods per minute"). Percent-based policies scale relative to current size ("add at most 100% of current pods"). Having both gives you bounded behavior at any scale.

Behavior Profiles for Common Workloads

API servers — scale up fast, scale down slowly:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Percent
      value: 100
      periodSeconds: 30
  scaleDown:
    stabilizationWindowSeconds: 600
    policies:
    - type: Percent
      value: 10
      periodSeconds: 120

Background workers — scale both directions moderately:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 60
    policies:
    - type: Pods
      value: 5
      periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 120
    policies:
    - type: Pods
      value: 3
      periodSeconds: 60

Batch processors with KEDA — scale up aggressively, scale to zero when idle:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Pods
      value: 20
      periodSeconds: 30
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
      value: 50
      periodSeconds: 60

Monitoring HPA Health

An HPA that's silently failing is worse than no HPA at all. Set up monitoring for the autoscaler itself.

Prometheus Alerts for HPA Issues

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: hpa-alerts
  namespace: monitoring
spec:
  groups:
    - name: hpa.rules
      rules:
        - alert: HPAMaxedOut
          expr: |
            kube_horizontalpodautoscaler_status_current_replicas
            ==
            kube_horizontalpodautoscaler_spec_max_replicas
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is at max replicas"
            description: "The HPA has been at its maximum replica count for 15 minutes. The workload may need a higher max or the scaling metric target needs adjustment."

        - alert: HPAUnableToScale
          expr: |
            kube_horizontalpodautoscaler_status_condition{condition="ScalingActive",status="false"} == 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} cannot scale"
            description: "The HPA reports ScalingActive=false. Custom metrics may be unavailable."

        - alert: HPAMetricUnavailable
          expr: |
            kube_horizontalpodautoscaler_status_condition{condition="AbleToScale",status="false"} == 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} cannot fetch metrics"

Grafana Dashboard Queries

Add these panels to your autoscaling dashboard:

# Current vs desired replicas (shows scaling lag)
kube_horizontalpodautoscaler_status_current_replicas{namespace="production"}
kube_horizontalpodautoscaler_status_desired_replicas{namespace="production"}

# Scaling events over time
changes(kube_horizontalpodautoscaler_status_current_replicas{namespace="production"}[1h])

# HPA target utilization vs actual
kube_horizontalpodautoscaler_status_target_metric{namespace="production"}

Load Testing Your HPA Configuration

Never ship HPA config to production without testing it under synthetic load. Here's my testing workflow.

Using k6 for Load Testing

# Install k6
brew install k6

# Create a load test script
cat > load-test.js <<'EOF'
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 50 },   // Ramp up
    { duration: '5m', target: 50 },   // Sustain
    { duration: '2m', target: 200 },  // Spike
    { duration: '5m', target: 200 },  // Sustain spike
    { duration: '3m', target: 0 },    // Ramp down
  ],
};

export default function () {
  const res = http.get('http://api-server.production.svc:3000/api/data');
  check(res, {
    'status is 200': (r) => r.status === 200,
    'latency < 500ms': (r) => r.timings.duration < 500,
  });
  sleep(0.1);
}
EOF

# Run from inside the cluster
kubectl run k6-test --rm -i --restart=Never \
  --image=grafana/k6 \
  -n production \
  -- run - < load-test.js

Watching HPA During Load Tests

Open a terminal and watch the HPA respond in real time:

# Watch HPA scaling decisions
kubectl get hpa -n production -w

# In another terminal, watch pod count
kubectl get pods -n production -l app=api-server -w

# Check HPA events for scaling decisions
kubectl describe hpa api-server-hpa -n production | tail -20

What to look for during the test:

  1. Scale-up latency: How long between the load increase and the first new pod? Target under 2 minutes.
  2. Overshoot: Does the HPA create too many pods? Check if the stabilization window is too short.
  3. Scale-down timing: After load drops, how long before pods are removed? Ensure it's not too aggressive.
  4. Metric accuracy: Do the HPA's reported metric values match what you see in Prometheus?

Production Recommendations

After running custom metrics HPA across dozens of clusters, here's what I've learned:

  1. Always keep a CPU/memory metric as a fallback. If your custom metrics pipeline breaks, you still want basic autoscaling.

  2. Set sensible min/max replicas. A minReplicas of 1 is asking for trouble. Keep at least 2-3 for availability.

  3. Use stabilization windows. Scale up fast (30-60s window), scale down slow (300-600s window). Traffic is bursty, and you don't want to shed capacity prematurely.

  4. Monitor the HPA itself. Set up alerts for when the HPA reports ScalingLimited or FailedGetPodsMetric conditions.

  5. Test with load generators. Before going to production, use tools like hey or k6 to verify the HPA responds correctly to load patterns.

  6. Watch out for metric cardinality. If your Prometheus adapter config is too broad, it'll try to register thousands of metrics with the API server. Be explicit about which series you want.

  7. Version your adapter config. Treat the Prometheus Adapter ConfigMap like application code. Review changes in PRs, test in staging first.

  8. Set resource requests on the adapter. The Prometheus Adapter itself needs resources. Underprovisioned adapters return slow or empty responses, and the HPA logs "unable to fetch metrics" without telling you why.

# Prometheus Adapter resource requirements
resources:
  requests:
    memory: 128Mi
    cpu: 100m
  limits:
    memory: 256Mi
    cpu: 500m

Final Thoughts

Custom metrics HPA is one of those features that separates "we run Kubernetes" from "we run Kubernetes well." CPU-based scaling is a blunt instrument. Your workloads deserve better.

The setup isn't trivial — there's a real pipeline to build and maintain. But once it's running, your applications scale on the signals that actually predict capacity needs, not trailing indicators that tell you about problems after they've already started.

Start with one workload, one custom metric, and get comfortable with the debugging workflow. Then expand from there. That's how you build confidence in the system without gambling your production stability.

The investment pays for itself the first time your API scales up 30 seconds before the traffic spike hits instead of 2 minutes after users start seeing errors. That's the difference between proactive scaling and reactive firefighting.

And remember: autoscaling is not a substitute for capacity planning. Custom metrics tell you when to scale, but you still need to understand your workload's baseline requirements, set appropriate min/max bounds, and ensure your cluster has enough headroom to accommodate new pods. The best HPA configuration in the world does nothing if there's no node capacity to schedule the pods it requests.

Share:
Aareez Asif
Aareez Asif

Senior Kubernetes Architect

10+ years orchestrating containers in production. Battle-tested opinions on everything from pod scheduling to service mesh. I've seen clusters burn and helped rebuild them better.

Related Articles