Istio Observability: Kiali, Jaeger, and Prometheus Integration
One of Istio's most compelling features is that observability comes for free. Because every request flows through an Envoy sidecar, the mesh can automatically generate metrics, traces, and access logs without any instrumentation in your application code. This is a fundamental shift from traditional observability, where each team must manually instrument their services, configure exporters, and maintain consistency across languages and frameworks. With Istio, the moment a service joins the mesh, it immediately starts producing golden signal metrics (latency, traffic, errors, saturation), trace spans, and structured access logs.
Combine this with Kiali's service graph visualization, Jaeger's distributed tracing, Prometheus's metrics collection, and Grafana's dashboards, and you get a comprehensive observability stack that covers your entire microservices architecture from a single vantage point. This guide covers how to deploy and configure each observability component for production use, how to use them effectively during incident response, performance tuning for high-traffic environments, and advanced customization patterns.
Istio's Observability Architecture
Understanding how telemetry flows through the mesh is essential for both configuration and troubleshooting.
[ Application ] --> [ Envoy Sidecar ] --> [ Destination Envoy ] --> [ Application ]
| |
(generates) (generates)
| |
[ Metrics ] [ Metrics ]
[ Trace Span ] [ Trace Span ]
[ Access Log ] [ Access Log ]
| |
v v
[ Prometheus ] <--- scrapes --- [ Prometheus ]
[ Jaeger Collector ] <--- sends --- [ Jaeger Collector ]
[ stdout/Fluentd ] <--- streams --- [ stdout/Fluentd ]
Istio's Envoy sidecars generate three types of telemetry automatically:
| Telemetry Type | What It Provides | Default Backend | Overhead |
|---|---|---|---|
| Metrics | Request count, latency, error rate per service | Prometheus | Low (counters and histograms) |
| Traces | End-to-end request flow across services | Jaeger, Zipkin, or Datadog | Medium (sampling-dependent) |
| Access Logs | Detailed per-request log entries | stdout, Fluentd, Loki | High (one log per request) |
Key metrics Istio generates out of the box:
| Metric | Type | Description |
|---|---|---|
istio_requests_total | Counter | Total request count by source, destination, response code |
istio_request_duration_milliseconds | Histogram | Request latency distribution |
istio_request_bytes | Histogram | Request body size distribution |
istio_response_bytes | Histogram | Response body size distribution |
istio_tcp_connections_opened_total | Counter | TCP connections opened |
istio_tcp_connections_closed_total | Counter | TCP connections closed |
istio_tcp_sent_bytes_total | Counter | Total bytes sent over TCP |
istio_tcp_received_bytes_total | Counter | Total bytes received over TCP |
Installing Observability Addons
Istio provides sample manifests for common observability tools. For production, use Helm charts with custom configuration, but the samples are an excellent starting point for evaluation.
Quick Install (All Addons)
# From the Istio installation directory
kubectl apply -f samples/addons/prometheus.yaml
kubectl apply -f samples/addons/grafana.yaml
kubectl apply -f samples/addons/jaeger.yaml
kubectl apply -f samples/addons/kiali.yaml
# Wait for all pods to be ready
kubectl rollout status deployment/prometheus -n istio-system --timeout=120s
kubectl rollout status deployment/grafana -n istio-system --timeout=120s
kubectl rollout status deployment/jaeger -n istio-system --timeout=120s
kubectl rollout status deployment/kiali -n istio-system --timeout=120s
Production Prometheus Installation
For production, deploy Prometheus with the kube-prometheus-stack Helm chart and configure it to scrape Istio metrics:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values prometheus-values.yaml \
--wait
Production prometheus-values.yaml:
# prometheus-values.yaml
prometheus:
prometheusSpec:
# Accept ServiceMonitor/PodMonitor from any namespace
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
# Storage configuration
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 200Gi
# Retention settings
retention: 15d
retentionSize: "180GB"
# Resource allocation for production
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "4"
memory: 8Gi
# Additional scrape configs for Istio
additionalScrapeConfigs:
- job_name: 'istio-mesh'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- istio-system
relabel_configs:
- source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: istio-telemetry;prometheus
grafana:
enabled: true
persistence:
enabled: true
size: 10Gi
resources:
requests:
cpu: 250m
memory: 512Mi
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: istio
orgId: 1
folder: Istio
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/istio
alertmanager:
config:
route:
group_by: ['namespace', 'alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
send_resolved: true
Create a ServiceMonitor for Istio's control plane:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: istio-component-monitor
namespace: monitoring
labels:
release: prometheus
spec:
jobLabel: istio
targetLabels: [app]
selector:
matchExpressions:
- key: istio
operator: In
values: [pilot]
namespaceSelector:
matchNames:
- istio-system
endpoints:
- port: http-monitoring
interval: 15s
path: /metrics
And a PodMonitor for Envoy sidecar metrics:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: envoy-stats-monitor
namespace: monitoring
labels:
release: prometheus
spec:
selector:
matchExpressions:
- key: security.istio.io/tlsMode
operator: Exists
namespaceSelector:
any: true
jobLabel: envoy-stats
podMetricsEndpoints:
- path: /stats/prometheus
interval: 15s
relabelings:
- action: keep
sourceLabels: [__meta_kubernetes_pod_container_name]
regex: "istio-proxy"
- action: keep
sourceLabels: [__meta_kubernetes_pod_annotationpresent_prometheus_io_scrape]
# Drop high-cardinality metrics to reduce storage
- action: drop
sourceLabels: [__name__]
regex: "envoy_.*_bucket"
Kiali: Service Graph and Configuration
Kiali is the management console for Istio. It provides a visual service graph showing how your services communicate, along with real-time traffic metrics, configuration validation, and troubleshooting tools.
Production Kiali Installation
helm repo add kiali https://kiali.org/helm-charts
helm repo update
helm install kiali-server kiali/kiali-server \
--namespace istio-system \
--values kiali-values.yaml
Production kiali-values.yaml:
# kiali-values.yaml
auth:
strategy: openid
openid:
client_id: kiali
issuer_uri: https://auth.example.com
username_claim: email
scopes:
- openid
- email
disable_rbac: false
deployment:
replicas: 2
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
external_services:
prometheus:
url: http://prometheus-kube-prometheus-prometheus.monitoring:9090
cache_enabled: true
cache_duration: 10
grafana:
enabled: true
in_cluster_url: http://prometheus-grafana.monitoring:80
url: https://grafana.example.com
tracing:
enabled: true
in_cluster_url: http://jaeger-query.observability:16685/jaeger
url: https://jaeger.example.com
use_grpc: true
server:
web_root: /kiali
web_fqdn: kiali.example.com
web_schema: https
Accessing Kiali
# Port-forward for quick access
kubectl port-forward svc/kiali -n istio-system 20001:20001
# Or use istioctl
istioctl dashboard kiali
What Kiali Shows You
Service Graph --- The most valuable view. It displays:
- All services in the mesh and their connections
- Request rate, error rate, and latency on each edge
- TCP vs HTTP traffic distinction
- Health status of each service (green, yellow, red)
- mTLS lock icons showing encryption status
- Traffic animation showing real-time request flow
Configuration Validation --- Kiali validates your Istio configuration and flags common issues:
- VirtualService referencing a non-existent DestinationRule subset
- Conflicting policies (overlapping VirtualServices)
- Missing sidecar injection
- Unreachable services
- Misconfigured gateway hosts
Workload Details --- For each service, view:
- Inbound and outbound metrics with time-series graphs
- Istio configuration (VirtualServices, DestinationRules, AuthorizationPolicies)
- Logs from both the app container and sidecar
- Envoy proxy configuration and sync status
- Distributed traces associated with the service
Using Kiali During Incident Response
During an incident, Kiali's service graph is your first stop:
- Open the graph view filtered to the affected namespace
- Look for red edges (5xx errors) or yellow edges (4xx errors)
- Click on the affected edge to see detailed metrics
- Check if the error is on the client side or server side using the
reporterlabel - Navigate to the workload details to see individual pod health
- Check the Envoy access logs for specific error messages
- Link to Jaeger traces for detailed request-level analysis
Jaeger: Distributed Tracing
Distributed tracing follows a single request as it flows through multiple services, showing exactly where time is spent and where failures occur. This is indispensable for debugging latency issues and understanding the request flow in complex architectures.
How Tracing Works in Istio
Envoy sidecars automatically generate trace spans for every request. However, for traces to connect across service boundaries, your application must propagate specific HTTP headers. Istio does not do this automatically because the application may make multiple outbound calls from a single inbound request, and only the application knows which outbound calls are part of the same trace.
Headers to propagate:
# W3C Trace Context (preferred, modern standard)
traceparent
tracestate
# B3 headers (Zipkin-style, widely supported)
x-b3-traceid
x-b3-spanid
x-b3-parentspanid
x-b3-sampled
x-b3-flags
b3
# Istio-specific
x-request-id
Header Propagation Example (Python/Flask)
import requests
from flask import Flask, request
app = Flask(__name__)
TRACE_HEADERS = [
'x-request-id', 'x-b3-traceid', 'x-b3-spanid',
'x-b3-parentspanid', 'x-b3-sampled', 'x-b3-flags',
'b3', 'traceparent', 'tracestate'
]
def propagate_headers():
"""Extract trace headers from incoming request for propagation."""
headers = {}
for header in TRACE_HEADERS:
value = request.headers.get(header)
if value:
headers[header] = value
return headers
@app.route('/api/orders')
def get_orders():
headers = propagate_headers()
# These outbound calls will be part of the same trace
users = requests.get('http://users-service:8080/api/users', headers=headers)
inventory = requests.get('http://inventory-service:8080/api/stock', headers=headers)
pricing = requests.get('http://pricing-service:8080/api/prices', headers=headers)
return {
"orders": [],
"users": users.json(),
"stock": inventory.json(),
"prices": pricing.json()
}
@app.route('/health')
def health():
return {"status": "healthy"}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
Header Propagation Example (Go)
package main
import (
"io"
"net/http"
)
var traceHeaders = []string{
"x-request-id", "x-b3-traceid", "x-b3-spanid",
"x-b3-parentspanid", "x-b3-sampled", "x-b3-flags",
"b3", "traceparent", "tracestate",
}
func propagateHeaders(r *http.Request) http.Header {
headers := make(http.Header)
for _, h := range traceHeaders {
if val := r.Header.Get(h); val != "" {
headers.Set(h, val)
}
}
return headers
}
func ordersHandler(w http.ResponseWriter, r *http.Request) {
headers := propagateHeaders(r)
// Create a new request with propagated headers
req, _ := http.NewRequest("GET", "http://users-service:8080/api/users", nil)
req.Header = headers
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
defer resp.Body.Close()
w.Header().Set("Content-Type", "application/json")
io.Copy(w, resp.Body)
}
func main() {
http.HandleFunc("/api/orders", ordersHandler)
http.ListenAndServe(":8080", nil)
}
Header Propagation Example (Node.js/Express)
const express = require('express');
const axios = require('axios');
const app = express();
const TRACE_HEADERS = [
'x-request-id', 'x-b3-traceid', 'x-b3-spanid',
'x-b3-parentspanid', 'x-b3-sampled', 'x-b3-flags',
'b3', 'traceparent', 'tracestate'
];
function propagateHeaders(req) {
const headers = {};
for (const header of TRACE_HEADERS) {
const value = req.headers[header];
if (value) {
headers[header] = value;
}
}
return headers;
}
app.get('/api/orders', async (req, res) => {
const headers = propagateHeaders(req);
try {
const [users, inventory] = await Promise.all([
axios.get('http://users-service:8080/api/users', { headers }),
axios.get('http://inventory-service:8080/api/stock', { headers })
]);
res.json({
orders: [],
users: users.data,
stock: inventory.data
});
} catch (error) {
res.status(500).json({ error: error.message });
}
});
app.listen(8080, () => console.log('Server running on port 8080'));
Sampling Strategies
Not every request needs to be traced. Sampling reduces the overhead and storage costs of tracing.
# Configure sampling in Istio mesh config
meshConfig:
defaultConfig:
tracing:
sampling: 1.0 # 1% of requests (value is percentage, 0-100)
extensionProviders:
- name: jaeger
opentelemetry:
service: jaeger-collector.observability.svc.cluster.local
port: 4317
| Sampling Rate | Use Case | Storage Impact |
|---|---|---|
| 100% | Development and debugging | Very high |
| 10% | Staging environments | High |
| 1% | Production (moderate traffic) | Moderate |
| 0.1% | Very high traffic (more than 10k RPS) | Low |
You can also configure per-workload sampling overrides:
apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
name: high-sampling
namespace: production
spec:
selector:
matchLabels:
app: payment-service # Critical service gets higher sampling
tracing:
- providers:
- name: jaeger
randomSamplingPercentage: 10.0 # 10% for this specific service
Production Jaeger Deployment
The sample Jaeger deployment uses in-memory storage. For production, use Elasticsearch or Cassandra as the backend:
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update
kubectl create namespace observability
helm install jaeger jaegertracing/jaeger \
--namespace observability \
--values jaeger-values.yaml
Production jaeger-values.yaml:
# jaeger-values.yaml
provisionDataStore:
cassandra: false
elasticsearch: true
storage:
type: elasticsearch
elasticsearch:
replicas: 3
minimumMasterNodes: 2
persistence:
enabled: true
size: 200Gi
storageClassName: gp3
resources:
requests:
cpu: "1"
memory: 4Gi
limits:
cpu: "2"
memory: 8Gi
esJavaOpts: "-Xms4g -Xmx4g"
collector:
replicaCount: 2
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
service:
grpc:
port: 14250
otlp:
grpc:
port: 4317
http:
port: 4318
query:
replicaCount: 2
resources:
requests:
cpu: 250m
memory: 256Mi
service:
port: 16686
# Configure index cleaner to manage storage
esIndexCleaner:
enabled: true
schedule: "55 23 * * *"
numberOfDays: 7
resources:
requests:
cpu: 100m
memory: 128Mi
Accessing Jaeger
# Port-forward
kubectl port-forward svc/jaeger-query -n observability 16686:16686
# Or use istioctl
istioctl dashboard jaeger
In the Jaeger UI, you can:
- Search traces by service, operation, duration, and tags
- View the full call graph for a single request
- Compare two traces side-by-side to identify performance regressions
- Find the exact span where an error occurred
- Analyze the critical path through your request flow
Using Jaeger for Debugging
When investigating a slow request:
- Search for traces with duration above your SLA threshold
- Open a slow trace and look at the waterfall view
- Identify which service span consumes the most time
- Check if the slow span has error tags
- Look at the span's logs for additional context
- Compare with a fast trace of the same operation to spot differences
Prometheus: Istio Metrics
Key Istio Metrics and Queries
Query these metrics in Prometheus to understand mesh behavior:
# Request rate by service (requests per second)
sum(rate(istio_requests_total{reporter="destination"}[5m]))
by (destination_service_name)
# Error rate (5xx responses) by service
sum(rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m]))
by (destination_service_name)
/
sum(rate(istio_requests_total{reporter="destination"}[5m]))
by (destination_service_name)
# P50 latency by service
histogram_quantile(0.50,
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
by (destination_service_name, le)
)
# P95 latency by service
histogram_quantile(0.95,
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
by (destination_service_name, le)
)
# P99 latency by service
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
by (destination_service_name, le)
)
# Request volume by source and destination (service dependency map)
sum(rate(istio_requests_total{reporter="source"}[5m]))
by (source_workload, destination_service_name)
# TCP bytes sent between services
sum(rate(istio_tcp_sent_bytes_total[5m]))
by (source_workload, destination_workload)
# Connection pool utilization
sum(envoy_cluster_upstream_cx_active)
by (cluster_name)
# Circuit breaker ejections
sum(rate(envoy_cluster_outlier_detection_ejections_total[5m]))
by (cluster_name)
# Sidecar resource usage
sum(container_memory_usage_bytes{container="istio-proxy"})
by (namespace, pod)
Custom Metrics with Telemetry API
Istio's Telemetry API lets you customize which metrics are collected and what labels they include:
apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
name: custom-metrics
namespace: istio-system # Mesh-wide
spec:
metrics:
- providers:
- name: prometheus
overrides:
# Add request host to REQUEST_COUNT metric
- match:
metric: REQUEST_COUNT
mode: CLIENT_AND_SERVER
tagOverrides:
request_host:
operation: UPSERT
value: "request.host"
request_path:
operation: UPSERT
value: "request.url_path"
# Ensure REQUEST_DURATION is always enabled
- match:
metric: REQUEST_DURATION
disabled: false
Reduce Metric Cardinality
High cardinality is the biggest performance threat to Prometheus in a mesh environment. Disable metrics you do not need:
apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
name: reduce-metrics
namespace: istio-system
spec:
metrics:
- providers:
- name: prometheus
overrides:
# Disable byte-counting metrics (rarely needed)
- match:
metric: REQUEST_BYTES
disabled: true
- match:
metric: RESPONSE_BYTES
disabled: true
# Disable TCP metrics if you primarily use HTTP
- match:
metric: TCP_OPENED_CONNECTIONS
disabled: true
- match:
metric: TCP_CLOSED_CONNECTIONS
disabled: true
# Remove high-cardinality labels from remaining metrics
- match:
metric: REQUEST_COUNT
tagOverrides:
request_protocol:
operation: REMOVE
connection_security_policy:
operation: REMOVE
Estimating Metric Cardinality
Before deploying to production, estimate the number of time series your mesh will generate:
Time Series = num_metrics * num_source_services * num_dest_services * num_response_codes * num_methods * num_reporters
Example:
6 metrics * 50 services * 50 services * 5 response codes * 4 methods * 2 reporters
= 6,000,000 time series (with all labels)
After optimization (reducing labels):
4 metrics * 50 services * 50 services * 3 response code groups * 2 reporters
= 60,000 time series (much more manageable)
Grafana: Pre-Built Istio Dashboards
Istio ships with several Grafana dashboards that provide immediate insight into mesh health.
Available Dashboards
| Dashboard | ID | Shows |
|---|---|---|
| Mesh Dashboard | 7639 | Global view of all services --- request volume, success rate, latency |
| Service Dashboard | 7636 | Detailed metrics for a single service --- inbound/outbound traffic |
| Workload Dashboard | 7630 | Per-pod metrics for a specific workload |
| Performance Dashboard | 11829 | Resource usage of Envoy proxies and istiod |
| Control Plane Dashboard | 7645 | istiod health --- config push latency, xDS connections, errors |
Accessing Grafana
# Port-forward
kubectl port-forward svc/grafana -n monitoring 3000:3000
# Or use istioctl
istioctl dashboard grafana
Importing Dashboards into Existing Grafana
If you already have a Grafana instance, import the Istio dashboards:
# Download dashboard JSON files
DASHBOARDS=(
"7639" # Istio Mesh Dashboard
"7636" # Istio Service Dashboard
"7630" # Istio Workload Dashboard
"11829" # Istio Performance Dashboard
"7645" # Istio Control Plane Dashboard
)
for id in "${DASHBOARDS[@]}"; do
curl -s "https://grafana.com/api/dashboards/${id}/revisions/latest/download" \
-o "istio-dashboard-${id}.json"
done
# Import via Grafana API
for file in istio-dashboard-*.json; do
curl -X POST http://admin:admin@grafana:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d "{
\"dashboard\": $(cat $file),
\"folderId\": 0,
\"overwrite\": true
}"
done
Custom Dashboard: Service Health Overview
Create a custom dashboard that shows the most important metrics at a glance:
{
"title": "Service Mesh Health",
"panels": [
{
"title": "Global Success Rate",
"type": "gauge",
"targets": [{
"expr": "sum(rate(istio_requests_total{reporter='destination',response_code!~'5..'}[5m])) / sum(rate(istio_requests_total{reporter='destination'}[5m])) * 100"
}],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"color": "red", "value": null},
{"color": "orange", "value": 95},
{"color": "green", "value": 99}
]
},
"unit": "percent"
}
}
},
{
"title": "Services with Errors",
"type": "table",
"targets": [{
"expr": "topk(10, sum(rate(istio_requests_total{reporter='destination',response_code=~'5..'}[5m])) by (destination_service_name))"
}]
}
]
}
Access Logging Configuration
Access logs record every request passing through the mesh. They are invaluable for debugging but generate significant volume in high-traffic environments.
Enable Access Logging
# In IstioOperator or Helm values
meshConfig:
accessLogFile: /dev/stdout
accessLogEncoding: JSON
accessLogFormat: |
{
"start_time": "%START_TIME%",
"method": "%REQ(:METHOD)%",
"path": "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
"protocol": "%PROTOCOL%",
"response_code": "%RESPONSE_CODE%",
"response_flags": "%RESPONSE_FLAGS%",
"upstream_host": "%UPSTREAM_HOST%",
"upstream_cluster": "%UPSTREAM_CLUSTER%",
"upstream_service_time": "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%",
"duration": "%DURATION%",
"request_id": "%REQ(X-REQUEST-ID)%",
"source_address": "%DOWNSTREAM_REMOTE_ADDRESS%",
"destination_address": "%UPSTREAM_HOST%",
"user_agent": "%REQ(USER-AGENT)%",
"trace_id": "%REQ(X-B3-TRACEID)%",
"authority": "%REQ(:AUTHORITY)%",
"bytes_received": "%BYTES_RECEIVED%",
"bytes_sent": "%BYTES_SENT%"
}
Response Flags Reference
The response_flags field in access logs tells you exactly what went wrong:
| Flag | Meaning | Common Cause |
|---|---|---|
UH | No healthy upstream | All endpoints ejected by circuit breaker |
UF | Upstream connection failure | Service crashed or unreachable |
UO | Upstream overflow (circuit breaking) | Connection pool exhausted |
NR | No route configured | Missing VirtualService or DestinationRule |
URX | Upstream retry limit exceeded | All retries failed |
DT | Downstream request timeout | Client gave up waiting |
UT | Upstream request timeout | Backend too slow |
DC | Downstream connection termination | Client disconnected |
RL | Rate limited | Local or global rate limit hit |
UAEX | Unauthorized (ext authz) | External auth provider denied |
RLSE | Rate limit service error | Rate limit service unreachable |
Selective Access Logging with Telemetry API
Enable access logs only for specific workloads or conditions to reduce noise:
apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
name: access-logs
namespace: production
spec:
selector:
matchLabels:
app: api-service
accessLogging:
- providers:
- name: envoy
filter:
expression: "response.code >= 400" # Only log errors
---
# Enable full logging for the payment service (audit requirement)
apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
name: payment-audit-logs
namespace: production
spec:
selector:
matchLabels:
app: payment-service
accessLogging:
- providers:
- name: envoy
# No filter = log everything
Envoy Stats and Deep Debugging
Each Envoy proxy exposes detailed statistics about its operation:
# View all Envoy stats for a specific pod
kubectl exec deploy/api-service -c istio-proxy -- \
pilot-agent request GET stats
# Filter for specific stats
kubectl exec deploy/api-service -c istio-proxy -- \
pilot-agent request GET "stats?filter=cluster.outbound"
# View active clusters (upstream services)
kubectl exec deploy/api-service -c istio-proxy -- \
pilot-agent request GET clusters
# View listeners (what ports Envoy is listening on)
istioctl proxy-config listeners deploy/api-service
# View routes (how requests are routed)
istioctl proxy-config routes deploy/api-service
# View endpoints (which pod IPs are available)
istioctl proxy-config endpoints deploy/api-service
# Full configuration dump (large output)
istioctl proxy-config all deploy/api-service -o json > proxy-dump.json
Key Envoy stats to monitor:
| Stat | Meaning | Alert Threshold |
|---|---|---|
upstream_cx_active | Active connections to upstream | Near connection pool max |
upstream_rq_pending_active | Requests waiting in queue | Consistently above 0 |
upstream_rq_retry | Number of retries | High ratio to total requests |
upstream_rq_timeout | Number of timeouts | Any sustained increase |
upstream_cx_connect_fail | Connection failures | Any non-zero value |
upstream_rq_pending_overflow | Rejected due to circuit breaker | Any non-zero value |
membership_healthy | Healthy endpoints in cluster | Below expected count |
Setting Up Alerting for Mesh Health
Create Prometheus alerting rules for critical mesh conditions:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: istio-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: istio.service.rules
rules:
# High error rate on a service
- alert: IstioHighErrorRate
expr: |
(
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name, namespace)
/
sum(rate(istio_requests_total[5m])) by (destination_service_name, namespace)
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High 5xx error rate on {{ $labels.destination_service_name }}"
description: "Error rate is {{ $value | humanizePercentage }} in namespace {{ $labels.namespace }}."
runbook_url: "https://runbooks.company.com/istio-high-error-rate"
# High P99 latency
- alert: IstioHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
by (destination_service_name, le)
) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency on {{ $labels.destination_service_name }}"
description: "P99 latency is {{ $value }}ms."
# Request volume drop (potential outage indicator)
- alert: IstioRequestVolumeDrop
expr: |
(
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
/
sum(rate(istio_requests_total{reporter="destination"}[5m] offset 1h)) by (destination_service_name)
) < 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Request volume dropped 50%+ for {{ $labels.destination_service_name }}"
- name: istio.mesh.rules
rules:
# Sidecar injection missing
- alert: IstioPodWithoutSidecar
expr: |
count(
kube_pod_info{namespace!~"kube-system|istio-system|monitoring"}
) by (namespace)
-
count(
kube_pod_container_info{container="istio-proxy",namespace!~"kube-system|istio-system|monitoring"}
) by (namespace)
> 0
for: 10m
labels:
severity: warning
annotations:
summary: "Pods without Istio sidecar in {{ $labels.namespace }}"
# Control plane health
- alert: IstiodUnhealthy
expr: |
sum(rate(pilot_xds_pushes{type="cds"}[5m])) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Istiod is not pushing CDS configuration updates"
runbook_url: "https://runbooks.company.com/istiod-unhealthy"
# Config push errors
- alert: IstiodPushErrors
expr: |
sum(rate(pilot_xds_push_errors[5m])) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Istiod is experiencing configuration push errors"
# Proxy config out of sync
- alert: IstioProxyConfigStale
expr: |
sum(pilot_proxy_convergence_time_bucket{le="30"}) / sum(pilot_proxy_convergence_time_count) < 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "More than 10% of proxy configs are taking over 30s to converge"
Production Observability Best Practices
-
Set sampling rates appropriately --- 100% tracing in production will overwhelm your Jaeger backend. Start at 1% and adjust based on traffic volume. Increase sampling for critical paths like payment processing.
-
Use the Telemetry API to reduce metric cardinality --- High cardinality labels (like request path with path parameters) can cause Prometheus performance issues. Disable metrics you do not query and remove labels you do not need.
-
Separate observability namespaces --- Run Prometheus, Grafana, Jaeger, and Kiali in a dedicated
monitoringorobservabilitynamespace with its own resource quotas. This prevents observability tools from competing with application workloads for resources. -
Retain traces strategically --- Keep traces for 7 days in production. For specific incident investigations, export relevant traces to long-term storage before they expire.
-
Set up dashboards before you need them --- Import Istio's Grafana dashboards on day one. When an incident happens, you want the dashboards already there. Create custom dashboards for your specific SLOs.
-
Monitor the control plane --- Istiod configuration push failures or high latency mean your routing rules and security policies are not reaching the sidecars. Alert on
pilot_xds_push_timeandpilot_xds_push_errors. -
Propagate trace headers in every service --- This is the most commonly missed step. Without header propagation, you get disconnected single-hop spans instead of end-to-end traces. Add header propagation to your service template or shared middleware.
-
Use Kiali for day-to-day operations --- Before diving into Prometheus queries, check Kiali's service graph. It often reveals the problem immediately through visual traffic flow and error highlighting.
-
Set up log aggregation --- Send Envoy access logs to a centralized system (Loki, Elasticsearch, Datadog). When traces and metrics point to a problem, access logs give you the request-level detail needed for root cause analysis.
-
Right-size your observability infrastructure --- Monitor the resource usage of Prometheus, Jaeger, and Elasticsearch themselves. These tools can become resource-hungry as your mesh grows. Plan for 2-3x storage growth over 6 months.
Summary
Istio's observability stack transforms how you understand your microservices architecture. Prometheus metrics give you the numbers, Jaeger traces give you the request flow, Kiali gives you the visual map, and Grafana ties it all together in dashboards. The key to making this work in production is proper sampling configuration, strategic metric collection to manage cardinality, and ensuring every service propagates tracing headers. Deploy the observability addons alongside Istio from day one --- retrofitting observability is always harder than starting with it. Invest time in building alerting rules that catch problems early, and train your team to use Kiali's service graph as the first step in incident response. The combination of automatic telemetry generation and a well-configured observability stack gives you unprecedented visibility into your distributed system, turning "we have no idea why it is slow" into "the payment service's database connection pool is saturated at 14:23."
SRE & Observability Engineer
If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.
Related Articles
Istio Installation & Architecture: Your First Service Mesh
Install Istio on Kubernetes, understand the control plane architecture, deploy your first sidecar proxy, and configure namespace injection.
Istio mTLS & Security: Zero-Trust Service Communication
Enable mutual TLS in Istio, configure PeerAuthentication and AuthorizationPolicy, and secure service-to-service communication with zero-trust principles.
Istio Traffic Management: Routing, Canary, and Circuit Breaking
Configure Istio VirtualServices, DestinationRules, and Gateways for advanced traffic routing, canary deployments, fault injection, and circuit breaking.