DevOpsil
Istio
92%
Fresh

Istio Observability: Kiali, Jaeger, and Prometheus Integration

Riku TanakaRiku Tanaka21 min read

One of Istio's most compelling features is that observability comes for free. Because every request flows through an Envoy sidecar, the mesh can automatically generate metrics, traces, and access logs without any instrumentation in your application code. This is a fundamental shift from traditional observability, where each team must manually instrument their services, configure exporters, and maintain consistency across languages and frameworks. With Istio, the moment a service joins the mesh, it immediately starts producing golden signal metrics (latency, traffic, errors, saturation), trace spans, and structured access logs.

Combine this with Kiali's service graph visualization, Jaeger's distributed tracing, Prometheus's metrics collection, and Grafana's dashboards, and you get a comprehensive observability stack that covers your entire microservices architecture from a single vantage point. This guide covers how to deploy and configure each observability component for production use, how to use them effectively during incident response, performance tuning for high-traffic environments, and advanced customization patterns.

Istio's Observability Architecture

Understanding how telemetry flows through the mesh is essential for both configuration and troubleshooting.

[ Application ] --> [ Envoy Sidecar ] --> [ Destination Envoy ] --> [ Application ]
                         |                        |
                    (generates)              (generates)
                         |                        |
                    [ Metrics ]              [ Metrics ]
                    [ Trace Span ]           [ Trace Span ]
                    [ Access Log ]           [ Access Log ]
                         |                        |
                         v                        v
                  [ Prometheus ] <--- scrapes --- [ Prometheus ]
                  [ Jaeger Collector ] <--- sends --- [ Jaeger Collector ]
                  [ stdout/Fluentd ] <--- streams --- [ stdout/Fluentd ]

Istio's Envoy sidecars generate three types of telemetry automatically:

Telemetry TypeWhat It ProvidesDefault BackendOverhead
MetricsRequest count, latency, error rate per servicePrometheusLow (counters and histograms)
TracesEnd-to-end request flow across servicesJaeger, Zipkin, or DatadogMedium (sampling-dependent)
Access LogsDetailed per-request log entriesstdout, Fluentd, LokiHigh (one log per request)

Key metrics Istio generates out of the box:

MetricTypeDescription
istio_requests_totalCounterTotal request count by source, destination, response code
istio_request_duration_millisecondsHistogramRequest latency distribution
istio_request_bytesHistogramRequest body size distribution
istio_response_bytesHistogramResponse body size distribution
istio_tcp_connections_opened_totalCounterTCP connections opened
istio_tcp_connections_closed_totalCounterTCP connections closed
istio_tcp_sent_bytes_totalCounterTotal bytes sent over TCP
istio_tcp_received_bytes_totalCounterTotal bytes received over TCP

Installing Observability Addons

Istio provides sample manifests for common observability tools. For production, use Helm charts with custom configuration, but the samples are an excellent starting point for evaluation.

Quick Install (All Addons)

# From the Istio installation directory
kubectl apply -f samples/addons/prometheus.yaml
kubectl apply -f samples/addons/grafana.yaml
kubectl apply -f samples/addons/jaeger.yaml
kubectl apply -f samples/addons/kiali.yaml

# Wait for all pods to be ready
kubectl rollout status deployment/prometheus -n istio-system --timeout=120s
kubectl rollout status deployment/grafana -n istio-system --timeout=120s
kubectl rollout status deployment/jaeger -n istio-system --timeout=120s
kubectl rollout status deployment/kiali -n istio-system --timeout=120s

Production Prometheus Installation

For production, deploy Prometheus with the kube-prometheus-stack Helm chart and configure it to scrape Istio metrics:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

kubectl create namespace monitoring

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values prometheus-values.yaml \
  --wait

Production prometheus-values.yaml:

# prometheus-values.yaml
prometheus:
  prometheusSpec:
    # Accept ServiceMonitor/PodMonitor from any namespace
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false

    # Storage configuration
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 200Gi

    # Retention settings
    retention: 15d
    retentionSize: "180GB"

    # Resource allocation for production
    resources:
      requests:
        cpu: "2"
        memory: 4Gi
      limits:
        cpu: "4"
        memory: 8Gi

    # Additional scrape configs for Istio
    additionalScrapeConfigs:
      - job_name: 'istio-mesh'
        kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
                - istio-system
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
            action: keep
            regex: istio-telemetry;prometheus

grafana:
  enabled: true
  persistence:
    enabled: true
    size: 10Gi
  resources:
    requests:
      cpu: 250m
      memory: 512Mi
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: istio
          orgId: 1
          folder: Istio
          type: file
          disableDeletion: false
          editable: true
          options:
            path: /var/lib/grafana/dashboards/istio

alertmanager:
  config:
    route:
      group_by: ['namespace', 'alertname']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: 'slack-notifications'
    receivers:
      - name: 'slack-notifications'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
            channel: '#alerts'
            send_resolved: true

Create a ServiceMonitor for Istio's control plane:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: istio-component-monitor
  namespace: monitoring
  labels:
    release: prometheus
spec:
  jobLabel: istio
  targetLabels: [app]
  selector:
    matchExpressions:
      - key: istio
        operator: In
        values: [pilot]
  namespaceSelector:
    matchNames:
      - istio-system
  endpoints:
    - port: http-monitoring
      interval: 15s
      path: /metrics

And a PodMonitor for Envoy sidecar metrics:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: envoy-stats-monitor
  namespace: monitoring
  labels:
    release: prometheus
spec:
  selector:
    matchExpressions:
      - key: security.istio.io/tlsMode
        operator: Exists
  namespaceSelector:
    any: true
  jobLabel: envoy-stats
  podMetricsEndpoints:
    - path: /stats/prometheus
      interval: 15s
      relabelings:
        - action: keep
          sourceLabels: [__meta_kubernetes_pod_container_name]
          regex: "istio-proxy"
        - action: keep
          sourceLabels: [__meta_kubernetes_pod_annotationpresent_prometheus_io_scrape]
        # Drop high-cardinality metrics to reduce storage
        - action: drop
          sourceLabels: [__name__]
          regex: "envoy_.*_bucket"

Kiali: Service Graph and Configuration

Kiali is the management console for Istio. It provides a visual service graph showing how your services communicate, along with real-time traffic metrics, configuration validation, and troubleshooting tools.

Production Kiali Installation

helm repo add kiali https://kiali.org/helm-charts
helm repo update

helm install kiali-server kiali/kiali-server \
  --namespace istio-system \
  --values kiali-values.yaml

Production kiali-values.yaml:

# kiali-values.yaml
auth:
  strategy: openid
  openid:
    client_id: kiali
    issuer_uri: https://auth.example.com
    username_claim: email
    scopes:
      - openid
      - email
    disable_rbac: false

deployment:
  replicas: 2
  resources:
    requests:
      cpu: 200m
      memory: 256Mi
    limits:
      cpu: 500m
      memory: 512Mi

external_services:
  prometheus:
    url: http://prometheus-kube-prometheus-prometheus.monitoring:9090
    cache_enabled: true
    cache_duration: 10
  grafana:
    enabled: true
    in_cluster_url: http://prometheus-grafana.monitoring:80
    url: https://grafana.example.com
  tracing:
    enabled: true
    in_cluster_url: http://jaeger-query.observability:16685/jaeger
    url: https://jaeger.example.com
    use_grpc: true

server:
  web_root: /kiali
  web_fqdn: kiali.example.com
  web_schema: https

Accessing Kiali

# Port-forward for quick access
kubectl port-forward svc/kiali -n istio-system 20001:20001

# Or use istioctl
istioctl dashboard kiali

What Kiali Shows You

Service Graph --- The most valuable view. It displays:

  • All services in the mesh and their connections
  • Request rate, error rate, and latency on each edge
  • TCP vs HTTP traffic distinction
  • Health status of each service (green, yellow, red)
  • mTLS lock icons showing encryption status
  • Traffic animation showing real-time request flow

Configuration Validation --- Kiali validates your Istio configuration and flags common issues:

  • VirtualService referencing a non-existent DestinationRule subset
  • Conflicting policies (overlapping VirtualServices)
  • Missing sidecar injection
  • Unreachable services
  • Misconfigured gateway hosts

Workload Details --- For each service, view:

  • Inbound and outbound metrics with time-series graphs
  • Istio configuration (VirtualServices, DestinationRules, AuthorizationPolicies)
  • Logs from both the app container and sidecar
  • Envoy proxy configuration and sync status
  • Distributed traces associated with the service

Using Kiali During Incident Response

During an incident, Kiali's service graph is your first stop:

  1. Open the graph view filtered to the affected namespace
  2. Look for red edges (5xx errors) or yellow edges (4xx errors)
  3. Click on the affected edge to see detailed metrics
  4. Check if the error is on the client side or server side using the reporter label
  5. Navigate to the workload details to see individual pod health
  6. Check the Envoy access logs for specific error messages
  7. Link to Jaeger traces for detailed request-level analysis

Jaeger: Distributed Tracing

Distributed tracing follows a single request as it flows through multiple services, showing exactly where time is spent and where failures occur. This is indispensable for debugging latency issues and understanding the request flow in complex architectures.

How Tracing Works in Istio

Envoy sidecars automatically generate trace spans for every request. However, for traces to connect across service boundaries, your application must propagate specific HTTP headers. Istio does not do this automatically because the application may make multiple outbound calls from a single inbound request, and only the application knows which outbound calls are part of the same trace.

Headers to propagate:

# W3C Trace Context (preferred, modern standard)
traceparent
tracestate

# B3 headers (Zipkin-style, widely supported)
x-b3-traceid
x-b3-spanid
x-b3-parentspanid
x-b3-sampled
x-b3-flags
b3

# Istio-specific
x-request-id

Header Propagation Example (Python/Flask)

import requests
from flask import Flask, request

app = Flask(__name__)

TRACE_HEADERS = [
    'x-request-id', 'x-b3-traceid', 'x-b3-spanid',
    'x-b3-parentspanid', 'x-b3-sampled', 'x-b3-flags',
    'b3', 'traceparent', 'tracestate'
]

def propagate_headers():
    """Extract trace headers from incoming request for propagation."""
    headers = {}
    for header in TRACE_HEADERS:
        value = request.headers.get(header)
        if value:
            headers[header] = value
    return headers

@app.route('/api/orders')
def get_orders():
    headers = propagate_headers()

    # These outbound calls will be part of the same trace
    users = requests.get('http://users-service:8080/api/users', headers=headers)
    inventory = requests.get('http://inventory-service:8080/api/stock', headers=headers)
    pricing = requests.get('http://pricing-service:8080/api/prices', headers=headers)

    return {
        "orders": [],
        "users": users.json(),
        "stock": inventory.json(),
        "prices": pricing.json()
    }

@app.route('/health')
def health():
    return {"status": "healthy"}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Header Propagation Example (Go)

package main

import (
    "io"
    "net/http"
)

var traceHeaders = []string{
    "x-request-id", "x-b3-traceid", "x-b3-spanid",
    "x-b3-parentspanid", "x-b3-sampled", "x-b3-flags",
    "b3", "traceparent", "tracestate",
}

func propagateHeaders(r *http.Request) http.Header {
    headers := make(http.Header)
    for _, h := range traceHeaders {
        if val := r.Header.Get(h); val != "" {
            headers.Set(h, val)
        }
    }
    return headers
}

func ordersHandler(w http.ResponseWriter, r *http.Request) {
    headers := propagateHeaders(r)

    // Create a new request with propagated headers
    req, _ := http.NewRequest("GET", "http://users-service:8080/api/users", nil)
    req.Header = headers

    client := &http.Client{}
    resp, err := client.Do(req)
    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }
    defer resp.Body.Close()

    w.Header().Set("Content-Type", "application/json")
    io.Copy(w, resp.Body)
}

func main() {
    http.HandleFunc("/api/orders", ordersHandler)
    http.ListenAndServe(":8080", nil)
}

Header Propagation Example (Node.js/Express)

const express = require('express');
const axios = require('axios');

const app = express();

const TRACE_HEADERS = [
    'x-request-id', 'x-b3-traceid', 'x-b3-spanid',
    'x-b3-parentspanid', 'x-b3-sampled', 'x-b3-flags',
    'b3', 'traceparent', 'tracestate'
];

function propagateHeaders(req) {
    const headers = {};
    for (const header of TRACE_HEADERS) {
        const value = req.headers[header];
        if (value) {
            headers[header] = value;
        }
    }
    return headers;
}

app.get('/api/orders', async (req, res) => {
    const headers = propagateHeaders(req);

    try {
        const [users, inventory] = await Promise.all([
            axios.get('http://users-service:8080/api/users', { headers }),
            axios.get('http://inventory-service:8080/api/stock', { headers })
        ]);

        res.json({
            orders: [],
            users: users.data,
            stock: inventory.data
        });
    } catch (error) {
        res.status(500).json({ error: error.message });
    }
});

app.listen(8080, () => console.log('Server running on port 8080'));

Sampling Strategies

Not every request needs to be traced. Sampling reduces the overhead and storage costs of tracing.

# Configure sampling in Istio mesh config
meshConfig:
  defaultConfig:
    tracing:
      sampling: 1.0  # 1% of requests (value is percentage, 0-100)
  extensionProviders:
    - name: jaeger
      opentelemetry:
        service: jaeger-collector.observability.svc.cluster.local
        port: 4317
Sampling RateUse CaseStorage Impact
100%Development and debuggingVery high
10%Staging environmentsHigh
1%Production (moderate traffic)Moderate
0.1%Very high traffic (more than 10k RPS)Low

You can also configure per-workload sampling overrides:

apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
  name: high-sampling
  namespace: production
spec:
  selector:
    matchLabels:
      app: payment-service  # Critical service gets higher sampling
  tracing:
    - providers:
        - name: jaeger
      randomSamplingPercentage: 10.0  # 10% for this specific service

Production Jaeger Deployment

The sample Jaeger deployment uses in-memory storage. For production, use Elasticsearch or Cassandra as the backend:

helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update

kubectl create namespace observability

helm install jaeger jaegertracing/jaeger \
  --namespace observability \
  --values jaeger-values.yaml

Production jaeger-values.yaml:

# jaeger-values.yaml
provisionDataStore:
  cassandra: false
  elasticsearch: true

storage:
  type: elasticsearch

elasticsearch:
  replicas: 3
  minimumMasterNodes: 2
  persistence:
    enabled: true
    size: 200Gi
    storageClassName: gp3
  resources:
    requests:
      cpu: "1"
      memory: 4Gi
    limits:
      cpu: "2"
      memory: 8Gi
  esJavaOpts: "-Xms4g -Xmx4g"

collector:
  replicaCount: 2
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: "1"
      memory: 1Gi
  service:
    grpc:
      port: 14250
    otlp:
      grpc:
        port: 4317
      http:
        port: 4318

query:
  replicaCount: 2
  resources:
    requests:
      cpu: 250m
      memory: 256Mi
  service:
    port: 16686

# Configure index cleaner to manage storage
esIndexCleaner:
  enabled: true
  schedule: "55 23 * * *"
  numberOfDays: 7
  resources:
    requests:
      cpu: 100m
      memory: 128Mi

Accessing Jaeger

# Port-forward
kubectl port-forward svc/jaeger-query -n observability 16686:16686

# Or use istioctl
istioctl dashboard jaeger

In the Jaeger UI, you can:

  • Search traces by service, operation, duration, and tags
  • View the full call graph for a single request
  • Compare two traces side-by-side to identify performance regressions
  • Find the exact span where an error occurred
  • Analyze the critical path through your request flow

Using Jaeger for Debugging

When investigating a slow request:

  1. Search for traces with duration above your SLA threshold
  2. Open a slow trace and look at the waterfall view
  3. Identify which service span consumes the most time
  4. Check if the slow span has error tags
  5. Look at the span's logs for additional context
  6. Compare with a fast trace of the same operation to spot differences

Prometheus: Istio Metrics

Key Istio Metrics and Queries

Query these metrics in Prometheus to understand mesh behavior:

# Request rate by service (requests per second)
sum(rate(istio_requests_total{reporter="destination"}[5m]))
by (destination_service_name)

# Error rate (5xx responses) by service
sum(rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m]))
by (destination_service_name)
/
sum(rate(istio_requests_total{reporter="destination"}[5m]))
by (destination_service_name)

# P50 latency by service
histogram_quantile(0.50,
  sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
  by (destination_service_name, le)
)

# P95 latency by service
histogram_quantile(0.95,
  sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
  by (destination_service_name, le)
)

# P99 latency by service
histogram_quantile(0.99,
  sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
  by (destination_service_name, le)
)

# Request volume by source and destination (service dependency map)
sum(rate(istio_requests_total{reporter="source"}[5m]))
by (source_workload, destination_service_name)

# TCP bytes sent between services
sum(rate(istio_tcp_sent_bytes_total[5m]))
by (source_workload, destination_workload)

# Connection pool utilization
sum(envoy_cluster_upstream_cx_active)
by (cluster_name)

# Circuit breaker ejections
sum(rate(envoy_cluster_outlier_detection_ejections_total[5m]))
by (cluster_name)

# Sidecar resource usage
sum(container_memory_usage_bytes{container="istio-proxy"})
by (namespace, pod)

Custom Metrics with Telemetry API

Istio's Telemetry API lets you customize which metrics are collected and what labels they include:

apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
  name: custom-metrics
  namespace: istio-system  # Mesh-wide
spec:
  metrics:
    - providers:
        - name: prometheus
      overrides:
        # Add request host to REQUEST_COUNT metric
        - match:
            metric: REQUEST_COUNT
            mode: CLIENT_AND_SERVER
          tagOverrides:
            request_host:
              operation: UPSERT
              value: "request.host"
            request_path:
              operation: UPSERT
              value: "request.url_path"
        # Ensure REQUEST_DURATION is always enabled
        - match:
            metric: REQUEST_DURATION
          disabled: false

Reduce Metric Cardinality

High cardinality is the biggest performance threat to Prometheus in a mesh environment. Disable metrics you do not need:

apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
  name: reduce-metrics
  namespace: istio-system
spec:
  metrics:
    - providers:
        - name: prometheus
      overrides:
        # Disable byte-counting metrics (rarely needed)
        - match:
            metric: REQUEST_BYTES
          disabled: true
        - match:
            metric: RESPONSE_BYTES
          disabled: true
        # Disable TCP metrics if you primarily use HTTP
        - match:
            metric: TCP_OPENED_CONNECTIONS
          disabled: true
        - match:
            metric: TCP_CLOSED_CONNECTIONS
          disabled: true
        # Remove high-cardinality labels from remaining metrics
        - match:
            metric: REQUEST_COUNT
          tagOverrides:
            request_protocol:
              operation: REMOVE
            connection_security_policy:
              operation: REMOVE

Estimating Metric Cardinality

Before deploying to production, estimate the number of time series your mesh will generate:

Time Series = num_metrics * num_source_services * num_dest_services * num_response_codes * num_methods * num_reporters

Example:
  6 metrics * 50 services * 50 services * 5 response codes * 4 methods * 2 reporters
  = 6,000,000 time series (with all labels)

After optimization (reducing labels):
  4 metrics * 50 services * 50 services * 3 response code groups * 2 reporters
  = 60,000 time series (much more manageable)

Grafana: Pre-Built Istio Dashboards

Istio ships with several Grafana dashboards that provide immediate insight into mesh health.

Available Dashboards

DashboardIDShows
Mesh Dashboard7639Global view of all services --- request volume, success rate, latency
Service Dashboard7636Detailed metrics for a single service --- inbound/outbound traffic
Workload Dashboard7630Per-pod metrics for a specific workload
Performance Dashboard11829Resource usage of Envoy proxies and istiod
Control Plane Dashboard7645istiod health --- config push latency, xDS connections, errors

Accessing Grafana

# Port-forward
kubectl port-forward svc/grafana -n monitoring 3000:3000

# Or use istioctl
istioctl dashboard grafana

Importing Dashboards into Existing Grafana

If you already have a Grafana instance, import the Istio dashboards:

# Download dashboard JSON files
DASHBOARDS=(
  "7639"   # Istio Mesh Dashboard
  "7636"   # Istio Service Dashboard
  "7630"   # Istio Workload Dashboard
  "11829"  # Istio Performance Dashboard
  "7645"   # Istio Control Plane Dashboard
)

for id in "${DASHBOARDS[@]}"; do
  curl -s "https://grafana.com/api/dashboards/${id}/revisions/latest/download" \
    -o "istio-dashboard-${id}.json"
done

# Import via Grafana API
for file in istio-dashboard-*.json; do
  curl -X POST http://admin:admin@grafana:3000/api/dashboards/db \
    -H "Content-Type: application/json" \
    -d "{
      \"dashboard\": $(cat $file),
      \"folderId\": 0,
      \"overwrite\": true
    }"
done

Custom Dashboard: Service Health Overview

Create a custom dashboard that shows the most important metrics at a glance:

{
  "title": "Service Mesh Health",
  "panels": [
    {
      "title": "Global Success Rate",
      "type": "gauge",
      "targets": [{
        "expr": "sum(rate(istio_requests_total{reporter='destination',response_code!~'5..'}[5m])) / sum(rate(istio_requests_total{reporter='destination'}[5m])) * 100"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"color": "red", "value": null},
              {"color": "orange", "value": 95},
              {"color": "green", "value": 99}
            ]
          },
          "unit": "percent"
        }
      }
    },
    {
      "title": "Services with Errors",
      "type": "table",
      "targets": [{
        "expr": "topk(10, sum(rate(istio_requests_total{reporter='destination',response_code=~'5..'}[5m])) by (destination_service_name))"
      }]
    }
  ]
}

Access Logging Configuration

Access logs record every request passing through the mesh. They are invaluable for debugging but generate significant volume in high-traffic environments.

Enable Access Logging

# In IstioOperator or Helm values
meshConfig:
  accessLogFile: /dev/stdout
  accessLogEncoding: JSON
  accessLogFormat: |
    {
      "start_time": "%START_TIME%",
      "method": "%REQ(:METHOD)%",
      "path": "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
      "protocol": "%PROTOCOL%",
      "response_code": "%RESPONSE_CODE%",
      "response_flags": "%RESPONSE_FLAGS%",
      "upstream_host": "%UPSTREAM_HOST%",
      "upstream_cluster": "%UPSTREAM_CLUSTER%",
      "upstream_service_time": "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%",
      "duration": "%DURATION%",
      "request_id": "%REQ(X-REQUEST-ID)%",
      "source_address": "%DOWNSTREAM_REMOTE_ADDRESS%",
      "destination_address": "%UPSTREAM_HOST%",
      "user_agent": "%REQ(USER-AGENT)%",
      "trace_id": "%REQ(X-B3-TRACEID)%",
      "authority": "%REQ(:AUTHORITY)%",
      "bytes_received": "%BYTES_RECEIVED%",
      "bytes_sent": "%BYTES_SENT%"
    }

Response Flags Reference

The response_flags field in access logs tells you exactly what went wrong:

FlagMeaningCommon Cause
UHNo healthy upstreamAll endpoints ejected by circuit breaker
UFUpstream connection failureService crashed or unreachable
UOUpstream overflow (circuit breaking)Connection pool exhausted
NRNo route configuredMissing VirtualService or DestinationRule
URXUpstream retry limit exceededAll retries failed
DTDownstream request timeoutClient gave up waiting
UTUpstream request timeoutBackend too slow
DCDownstream connection terminationClient disconnected
RLRate limitedLocal or global rate limit hit
UAEXUnauthorized (ext authz)External auth provider denied
RLSERate limit service errorRate limit service unreachable

Selective Access Logging with Telemetry API

Enable access logs only for specific workloads or conditions to reduce noise:

apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
  name: access-logs
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-service
  accessLogging:
    - providers:
        - name: envoy
      filter:
        expression: "response.code >= 400"  # Only log errors
---
# Enable full logging for the payment service (audit requirement)
apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
  name: payment-audit-logs
  namespace: production
spec:
  selector:
    matchLabels:
      app: payment-service
  accessLogging:
    - providers:
        - name: envoy
      # No filter = log everything

Envoy Stats and Deep Debugging

Each Envoy proxy exposes detailed statistics about its operation:

# View all Envoy stats for a specific pod
kubectl exec deploy/api-service -c istio-proxy -- \
  pilot-agent request GET stats

# Filter for specific stats
kubectl exec deploy/api-service -c istio-proxy -- \
  pilot-agent request GET "stats?filter=cluster.outbound"

# View active clusters (upstream services)
kubectl exec deploy/api-service -c istio-proxy -- \
  pilot-agent request GET clusters

# View listeners (what ports Envoy is listening on)
istioctl proxy-config listeners deploy/api-service

# View routes (how requests are routed)
istioctl proxy-config routes deploy/api-service

# View endpoints (which pod IPs are available)
istioctl proxy-config endpoints deploy/api-service

# Full configuration dump (large output)
istioctl proxy-config all deploy/api-service -o json > proxy-dump.json

Key Envoy stats to monitor:

StatMeaningAlert Threshold
upstream_cx_activeActive connections to upstreamNear connection pool max
upstream_rq_pending_activeRequests waiting in queueConsistently above 0
upstream_rq_retryNumber of retriesHigh ratio to total requests
upstream_rq_timeoutNumber of timeoutsAny sustained increase
upstream_cx_connect_failConnection failuresAny non-zero value
upstream_rq_pending_overflowRejected due to circuit breakerAny non-zero value
membership_healthyHealthy endpoints in clusterBelow expected count

Setting Up Alerting for Mesh Health

Create Prometheus alerting rules for critical mesh conditions:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: istio-alerts
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
    - name: istio.service.rules
      rules:
        # High error rate on a service
        - alert: IstioHighErrorRate
          expr: |
            (
              sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name, namespace)
              /
              sum(rate(istio_requests_total[5m])) by (destination_service_name, namespace)
            ) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High 5xx error rate on {{ $labels.destination_service_name }}"
            description: "Error rate is {{ $value | humanizePercentage }} in namespace {{ $labels.namespace }}."
            runbook_url: "https://runbooks.company.com/istio-high-error-rate"

        # High P99 latency
        - alert: IstioHighLatency
          expr: |
            histogram_quantile(0.99,
              sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
              by (destination_service_name, le)
            ) > 1000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High P99 latency on {{ $labels.destination_service_name }}"
            description: "P99 latency is {{ $value }}ms."

        # Request volume drop (potential outage indicator)
        - alert: IstioRequestVolumeDrop
          expr: |
            (
              sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
              /
              sum(rate(istio_requests_total{reporter="destination"}[5m] offset 1h)) by (destination_service_name)
            ) < 0.5
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Request volume dropped 50%+ for {{ $labels.destination_service_name }}"

    - name: istio.mesh.rules
      rules:
        # Sidecar injection missing
        - alert: IstioPodWithoutSidecar
          expr: |
            count(
              kube_pod_info{namespace!~"kube-system|istio-system|monitoring"}
            ) by (namespace)
            -
            count(
              kube_pod_container_info{container="istio-proxy",namespace!~"kube-system|istio-system|monitoring"}
            ) by (namespace)
            > 0
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Pods without Istio sidecar in {{ $labels.namespace }}"

        # Control plane health
        - alert: IstiodUnhealthy
          expr: |
            sum(rate(pilot_xds_pushes{type="cds"}[5m])) == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Istiod is not pushing CDS configuration updates"
            runbook_url: "https://runbooks.company.com/istiod-unhealthy"

        # Config push errors
        - alert: IstiodPushErrors
          expr: |
            sum(rate(pilot_xds_push_errors[5m])) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Istiod is experiencing configuration push errors"

        # Proxy config out of sync
        - alert: IstioProxyConfigStale
          expr: |
            sum(pilot_proxy_convergence_time_bucket{le="30"}) / sum(pilot_proxy_convergence_time_count) < 0.9
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "More than 10% of proxy configs are taking over 30s to converge"

Production Observability Best Practices

  1. Set sampling rates appropriately --- 100% tracing in production will overwhelm your Jaeger backend. Start at 1% and adjust based on traffic volume. Increase sampling for critical paths like payment processing.

  2. Use the Telemetry API to reduce metric cardinality --- High cardinality labels (like request path with path parameters) can cause Prometheus performance issues. Disable metrics you do not query and remove labels you do not need.

  3. Separate observability namespaces --- Run Prometheus, Grafana, Jaeger, and Kiali in a dedicated monitoring or observability namespace with its own resource quotas. This prevents observability tools from competing with application workloads for resources.

  4. Retain traces strategically --- Keep traces for 7 days in production. For specific incident investigations, export relevant traces to long-term storage before they expire.

  5. Set up dashboards before you need them --- Import Istio's Grafana dashboards on day one. When an incident happens, you want the dashboards already there. Create custom dashboards for your specific SLOs.

  6. Monitor the control plane --- Istiod configuration push failures or high latency mean your routing rules and security policies are not reaching the sidecars. Alert on pilot_xds_push_time and pilot_xds_push_errors.

  7. Propagate trace headers in every service --- This is the most commonly missed step. Without header propagation, you get disconnected single-hop spans instead of end-to-end traces. Add header propagation to your service template or shared middleware.

  8. Use Kiali for day-to-day operations --- Before diving into Prometheus queries, check Kiali's service graph. It often reveals the problem immediately through visual traffic flow and error highlighting.

  9. Set up log aggregation --- Send Envoy access logs to a centralized system (Loki, Elasticsearch, Datadog). When traces and metrics point to a problem, access logs give you the request-level detail needed for root cause analysis.

  10. Right-size your observability infrastructure --- Monitor the resource usage of Prometheus, Jaeger, and Elasticsearch themselves. These tools can become resource-hungry as your mesh grows. Plan for 2-3x storage growth over 6 months.

Summary

Istio's observability stack transforms how you understand your microservices architecture. Prometheus metrics give you the numbers, Jaeger traces give you the request flow, Kiali gives you the visual map, and Grafana ties it all together in dashboards. The key to making this work in production is proper sampling configuration, strategic metric collection to manage cardinality, and ensuring every service propagates tracing headers. Deploy the observability addons alongside Istio from day one --- retrofitting observability is always harder than starting with it. Invest time in building alerting rules that catch problems early, and train your team to use Kiali's service graph as the first step in incident response. The combination of automatic telemetry generation and a well-configured observability stack gives you unprecedented visibility into your distributed system, turning "we have no idea why it is slow" into "the payment service's database connection pool is saturated at 14:23."

Share:
Riku Tanaka
Riku Tanaka

SRE & Observability Engineer

If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.

Related Articles