Kubernetes HPA with Custom Metrics: Stop Scaling on CPU Alone
CPU-Based Scaling Is a Lie (for Most Workloads)
Here's the thing — if you're scaling your pods based on CPU utilization alone, you're probably doing it wrong. I've watched teams burn through thousands of dollars in cloud spend because their HPA was thrashing pods up and down based on CPU, while the actual bottleneck was request queue depth.
CPU and memory are lagging indicators. By the time CPU spikes, your users have already felt the pain. What you actually want is to scale on leading indicators: request rate, queue length, connection count, or whatever metric tells you "load is coming" before it arrives.
Let me tell you why custom metrics change the game, and how to wire Prometheus into your HPA so your workloads scale on signals that matter.
The Architecture: How Custom Metrics Flow
Before we touch any YAML, you need to understand the data path:
- Your application exposes metrics (or a ServiceMonitor scrapes them)
- Prometheus collects those metrics
- The Prometheus Adapter translates Prometheus queries into the Kubernetes Custom Metrics API
- The HPA controller queries the Custom Metrics API to make scaling decisions
If any link in this chain breaks, your HPA sits there doing nothing. I've debugged this exact issue at 2 AM more times than I'd like to admit.
App Pods --> Prometheus --> Prometheus Adapter --> Custom Metrics API --> HPA Controller
Step 1: Deploy the Prometheus Adapter
The Prometheus Adapter bridges Prometheus and the Kubernetes metrics API. Install it with Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--set prometheus.url=http://prometheus-server.monitoring.svc \
--set prometheus.port=9090
Here's the thing most guides skip: the adapter needs to be able to reach your Prometheus instance. If you're running Prometheus Operator, the URL is usually http://prometheus-operated.monitoring.svc:9090. Get this wrong and every custom metric query returns empty.
Verify the adapter is registered:
kubectl get apiservices | grep custom.metrics
You should see:
v1beta1.custom.metrics.k8s.io monitoring/prometheus-adapter True 5m
If that shows False in the Available column, check the adapter pod logs immediately.
Step 2: Configure Metric Discovery Rules
This is where people get stuck. The adapter needs rules that tell it how to translate Prometheus metrics into Kubernetes-style metrics. Here's a real-world configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-adapter
namespace: monitoring
data:
config.yaml: |
rules:
# Rule for HTTP request rate per pod
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
seriesFilters: []
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: "^(.*)_total$"
as: "${1}_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
# Rule for queue depth
- seriesQuery: 'rabbitmq_queue_messages{namespace!="",pod!=""}'
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: "^(.*)$"
as: "${1}"
metricsQuery: 'avg(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
# Rule for active WebSocket connections
- seriesQuery: 'websocket_active_connections{namespace!="",pod!=""}'
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: "^(.*)$"
as: "${1}"
metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
Let me break down what's happening here because this syntax is not intuitive:
- seriesQuery: The Prometheus metric name with required label filters
- resources.overrides: Maps Prometheus labels to Kubernetes resource types — this is how the adapter knows which pod or namespace the metric belongs to
- name: Transforms the Prometheus metric name into the custom metric name
- metricsQuery: The actual PromQL query, with template variables that the adapter fills in
The <<.Series>>, <<.LabelMatchers>>, and <<.GroupBy>> placeholders are critical. The adapter substitutes these at query time based on which pods the HPA is asking about.
Step 3: Verify Custom Metrics Are Available
After deploying the config, restart the adapter and check:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq '.resources[].name'
You should see your metrics listed. To query a specific metric:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" | jq .
If this returns empty, your seriesQuery doesn't match any actual Prometheus series. Go back and verify the metric exists in Prometheus first:
http_requests_total{namespace="production",pod=~"api-server.*"}
Step 4: Create the HPA with Custom Metrics
Now the part you've been waiting for. Here's an HPA that scales on HTTP request rate instead of CPU:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
Let me tell you why the behavior section matters more than most people think. Without it, the HPA uses defaults that scale down aggressively. I once watched a deployment scale from 15 pods to 3 in sixty seconds during a traffic lull, then get crushed when the next wave hit. The stabilization window and gradual scale-down policy prevent that whiplash.
Combining Multiple Metrics
In production, you rarely want to scale on a single metric. Here's a more realistic example that considers both request rate and response latency:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 25
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
- type: Pods
pods:
metric:
name: http_request_duration_seconds_p95
target:
type: AverageValue
averageValue: "0.5"
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
When you specify multiple metrics, the HPA evaluates each one independently and picks the highest replica count. This is a "scale to the worst case" approach, and it's exactly what you want. If request rate is fine but latency is spiking, you still scale up.
Debugging When Things Go Wrong
Here's the thing about custom metrics HPA — when it doesn't work, the error messages are terrible. Here's my debugging checklist:
# 1. Check the HPA status and conditions
kubectl describe hpa api-server-hpa -n production
# 2. Look for "unable to fetch metrics" errors
kubectl get hpa api-server-hpa -n production -o yaml | grep -A5 conditions
# 3. Verify the adapter can see the metric
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second"
# 4. Check the adapter logs
kubectl logs -n monitoring deploy/prometheus-adapter --tail=50
# 5. Verify the metric exists in Prometheus
kubectl port-forward -n monitoring svc/prometheus-server 9090:9090
# Then query: http_requests_total{namespace="production"}
The most common failure mode I see: the Prometheus metric has labels that don't match the adapter's resources.overrides mapping. If your metric uses kubernetes_pod_name instead of pod, the adapter can't map it to a Kubernetes pod resource, and the HPA gets nothing.
External Metrics: Scaling on Non-Pod Metrics
Custom metrics are tied to Kubernetes objects (pods, services). But sometimes the signal you need doesn't come from a pod at all. External metrics let you scale based on any metric — a cloud queue length, a database connection count, or a third-party API response time.
Scaling on SQS Queue Depth
This is one of the most common patterns I deploy. A worker deployment scales based on how many messages are waiting in an SQS queue:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: queue-worker-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: queue-worker
minReplicas: 2
maxReplicas: 50
metrics:
- type: External
external:
metric:
name: sqs_queue_visible_messages
selector:
matchLabels:
queue_name: "order-processing"
target:
type: AverageValue
averageValue: "20"
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately for queues
policies:
- type: Pods
value: 10
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120
The AverageValue of 20 means: for every 20 messages in the queue, maintain one pod. If the queue has 200 messages, the HPA targets 10 pods. This keeps the queue draining at a consistent rate regardless of depth.
The adapter configuration for external metrics is slightly different:
rules:
- seriesQuery: 'aws_sqs_approximate_number_of_messages_visible'
seriesFilters: []
resources:
overrides:
namespace:
resource: namespace
name:
matches: "^aws_sqs_(.*)$"
as: "sqs_${1}"
metricsQuery: 'avg(<<.Series>>{<<.LabelMatchers>>})'
Scaling on Kafka Consumer Lag
For event-driven architectures using Kafka, consumer lag is the metric that matters:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: kafka-consumer-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: kafka-consumer
minReplicas: 3
maxReplicas: 30
metrics:
- type: External
external:
metric:
name: kafka_consumer_lag
selector:
matchLabels:
consumer_group: "order-processor"
topic: "orders"
target:
type: Value
value: "1000"
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Note the use of type: Value instead of type: AverageValue for Kafka lag. Consumer lag is a total across all partitions, not a per-pod metric. Setting the target to 1000 means the HPA will scale up whenever total lag exceeds 1000 messages.
KEDA: The Alternative to Prometheus Adapter
If the Prometheus Adapter feels like too much plumbing, KEDA (Kubernetes Event-Driven Autoscaling) provides a more integrated solution. It supports 60+ scalers out of the box, including AWS SQS, Kafka, RabbitMQ, PostgreSQL, and Prometheus.
Installing KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
--namespace keda \
--create-namespace
KEDA ScaledObject for Prometheus Metrics
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-server-scaler
namespace: production
spec:
scaleTargetRef:
name: api-server
minReplicaCount: 3
maxReplicaCount: 25
pollingInterval: 15
cooldownPeriod: 300
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: http_requests_per_second
query: |
sum(rate(http_requests_total{
namespace="production",
deployment="api-server"
}[2m]))
threshold: "100"
activationThreshold: "10"
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: http_p99_latency
query: |
histogram_quantile(0.99,
sum by (le) (
rate(http_request_duration_seconds_bucket{
namespace="production",
deployment="api-server"
}[2m])
)
)
threshold: "0.5"
The activationThreshold is KEDA's killer feature for cost optimization. When the metric drops below this threshold, KEDA can scale the deployment to zero. This is perfect for dev/staging environments or batch workloads that only need to run when there's work to do.
KEDA vs Prometheus Adapter: When to Use Which
| Feature | Prometheus Adapter | KEDA |
|---|---|---|
| Scale to zero | No | Yes |
| Built-in scalers | Prometheus only | 60+ (SQS, Kafka, etc.) |
| HPA integration | Native (custom metrics API) | Creates HPA automatically |
| Configuration | ConfigMap + rules | CRD per workload |
| Operational overhead | Lower (one adapter) | Higher (operator + CRDs) |
| Community adoption | Moderate | Growing rapidly |
My rule of thumb: if you only need Prometheus-based metrics and never scale to zero, the Prometheus Adapter is simpler. If you need event-driven scaling, scale-to-zero, or integration with cloud-native queues, KEDA is the better choice.
Advanced Behavior Configuration
The behavior section of the HPA spec is where you tune scaling responsiveness. Most guides skip this, but it's the difference between a stable system and one that thrashes.
Understanding Stabilization Windows
behavior:
scaleUp:
stabilizationWindowSeconds: 30
selectPolicy: Max
policies:
- type: Pods
value: 4
periodSeconds: 60
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
selectPolicy: Min
policies:
- type: Pods
value: 2
periodSeconds: 120
- type: Percent
value: 10
periodSeconds: 120
Here's what each field does:
-
stabilizationWindowSeconds: The HPA looks at all recommended replica counts within this window and picks the highest (for scale-up) or lowest (for scale-down). A 300-second scale-down window means the HPA won't reduce replicas until the metric has been below the target for 5 full minutes.
-
selectPolicy: When multiple policies are defined,
Maxpicks the policy that allows the most change (aggressive),Minpicks the policy that allows the least change (conservative). UseMaxfor scale-up (respond quickly to load) andMinfor scale-down (be cautious about removing capacity). -
Pods vs Percent policies: Pod-based policies set an absolute limit ("add at most 4 pods per minute"). Percent-based policies scale relative to current size ("add at most 100% of current pods"). Having both gives you bounded behavior at any scale.
Behavior Profiles for Common Workloads
API servers — scale up fast, scale down slowly:
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Percent
value: 10
periodSeconds: 120
Background workers — scale both directions moderately:
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 5
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 120
policies:
- type: Pods
value: 3
periodSeconds: 60
Batch processors with KEDA — scale up aggressively, scale to zero when idle:
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Pods
value: 20
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
Monitoring HPA Health
An HPA that's silently failing is worse than no HPA at all. Set up monitoring for the autoscaler itself.
Prometheus Alerts for HPA Issues
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: hpa-alerts
namespace: monitoring
spec:
groups:
- name: hpa.rules
rules:
- alert: HPAMaxedOut
expr: |
kube_horizontalpodautoscaler_status_current_replicas
==
kube_horizontalpodautoscaler_spec_max_replicas
for: 15m
labels:
severity: warning
annotations:
summary: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is at max replicas"
description: "The HPA has been at its maximum replica count for 15 minutes. The workload may need a higher max or the scaling metric target needs adjustment."
- alert: HPAUnableToScale
expr: |
kube_horizontalpodautoscaler_status_condition{condition="ScalingActive",status="false"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} cannot scale"
description: "The HPA reports ScalingActive=false. Custom metrics may be unavailable."
- alert: HPAMetricUnavailable
expr: |
kube_horizontalpodautoscaler_status_condition{condition="AbleToScale",status="false"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} cannot fetch metrics"
Grafana Dashboard Queries
Add these panels to your autoscaling dashboard:
# Current vs desired replicas (shows scaling lag)
kube_horizontalpodautoscaler_status_current_replicas{namespace="production"}
kube_horizontalpodautoscaler_status_desired_replicas{namespace="production"}
# Scaling events over time
changes(kube_horizontalpodautoscaler_status_current_replicas{namespace="production"}[1h])
# HPA target utilization vs actual
kube_horizontalpodautoscaler_status_target_metric{namespace="production"}
Load Testing Your HPA Configuration
Never ship HPA config to production without testing it under synthetic load. Here's my testing workflow.
Using k6 for Load Testing
# Install k6
brew install k6
# Create a load test script
cat > load-test.js <<'EOF'
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 50 }, // Ramp up
{ duration: '5m', target: 50 }, // Sustain
{ duration: '2m', target: 200 }, // Spike
{ duration: '5m', target: 200 }, // Sustain spike
{ duration: '3m', target: 0 }, // Ramp down
],
};
export default function () {
const res = http.get('http://api-server.production.svc:3000/api/data');
check(res, {
'status is 200': (r) => r.status === 200,
'latency < 500ms': (r) => r.timings.duration < 500,
});
sleep(0.1);
}
EOF
# Run from inside the cluster
kubectl run k6-test --rm -i --restart=Never \
--image=grafana/k6 \
-n production \
-- run - < load-test.js
Watching HPA During Load Tests
Open a terminal and watch the HPA respond in real time:
# Watch HPA scaling decisions
kubectl get hpa -n production -w
# In another terminal, watch pod count
kubectl get pods -n production -l app=api-server -w
# Check HPA events for scaling decisions
kubectl describe hpa api-server-hpa -n production | tail -20
What to look for during the test:
- Scale-up latency: How long between the load increase and the first new pod? Target under 2 minutes.
- Overshoot: Does the HPA create too many pods? Check if the stabilization window is too short.
- Scale-down timing: After load drops, how long before pods are removed? Ensure it's not too aggressive.
- Metric accuracy: Do the HPA's reported metric values match what you see in Prometheus?
Production Recommendations
After running custom metrics HPA across dozens of clusters, here's what I've learned:
-
Always keep a CPU/memory metric as a fallback. If your custom metrics pipeline breaks, you still want basic autoscaling.
-
Set sensible min/max replicas. A minReplicas of 1 is asking for trouble. Keep at least 2-3 for availability.
-
Use stabilization windows. Scale up fast (30-60s window), scale down slow (300-600s window). Traffic is bursty, and you don't want to shed capacity prematurely.
-
Monitor the HPA itself. Set up alerts for when the HPA reports
ScalingLimitedorFailedGetPodsMetricconditions. -
Test with load generators. Before going to production, use tools like
heyork6to verify the HPA responds correctly to load patterns. -
Watch out for metric cardinality. If your Prometheus adapter config is too broad, it'll try to register thousands of metrics with the API server. Be explicit about which series you want.
-
Version your adapter config. Treat the Prometheus Adapter ConfigMap like application code. Review changes in PRs, test in staging first.
-
Set resource requests on the adapter. The Prometheus Adapter itself needs resources. Underprovisioned adapters return slow or empty responses, and the HPA logs "unable to fetch metrics" without telling you why.
# Prometheus Adapter resource requirements
resources:
requests:
memory: 128Mi
cpu: 100m
limits:
memory: 256Mi
cpu: 500m
Final Thoughts
Custom metrics HPA is one of those features that separates "we run Kubernetes" from "we run Kubernetes well." CPU-based scaling is a blunt instrument. Your workloads deserve better.
The setup isn't trivial — there's a real pipeline to build and maintain. But once it's running, your applications scale on the signals that actually predict capacity needs, not trailing indicators that tell you about problems after they've already started.
Start with one workload, one custom metric, and get comfortable with the debugging workflow. Then expand from there. That's how you build confidence in the system without gambling your production stability.
The investment pays for itself the first time your API scales up 30 seconds before the traffic spike hits instead of 2 minutes after users start seeing errors. That's the difference between proactive scaling and reactive firefighting.
And remember: autoscaling is not a substitute for capacity planning. Custom metrics tell you when to scale, but you still need to understand your workload's baseline requirements, set appropriate min/max bounds, and ensure your cluster has enough headroom to accommodate new pods. The best HPA configuration in the world does nothing if there's no node capacity to schedule the pods it requests.
Related Articles
Senior Kubernetes Architect
10+ years orchestrating containers in production. Battle-tested opinions on everything from pod scheduling to service mesh. I've seen clusters burn and helped rebuild them better.
Related Articles
The Complete Guide to Kubernetes Deployment Strategies: Rolling, Blue-Green, Canary, and Progressive Delivery
A comprehensive guide to every Kubernetes deployment strategy — rolling updates, blue-green, canary, and progressive delivery with Argo Rollouts and Flagger.
Building a Complete Prometheus + Grafana Monitoring Stack from Scratch
Build a production Prometheus and Grafana monitoring stack from scratch — service discovery, recording rules, alerting, and dashboards.
Kubernetes Ingress vs Gateway API: When to Migrate and How to Do It Without Breaking Everything
A practical comparison of Kubernetes Ingress and Gateway API, with a migration strategy that won't take down your production traffic.