DevOpsil
Monitoring
85%
Fresh

OpenTelemetry Collector: Deploying Your Observability Pipeline the Right Way

Riku TanakaRiku Tanaka8 min read

The Vendor Lock-In Problem Observability Forgot

Every time I've watched a team migrate from one monitoring backend to another, it's been painful. Re-instrument every service, update libraries, rewrite exporters, hope nothing breaks in production. It takes months.

OpenTelemetry solves this by separating instrumentation from destination. You instrument once, then route data wherever you need it. The Collector is the central piece — a vendor-agnostic pipeline that receives, processes, and exports telemetry data. The Google SRE book emphasizes that your monitoring system should be treated as a production service itself. The Collector deserves that same rigor.

Let me walk through deploying and configuring it properly.

Architecture: Agent vs. Gateway

There are two deployment patterns, and most production setups use both.

Agent mode: A Collector instance runs as a sidecar or DaemonSet on every node. It collects telemetry from local applications with minimal network overhead.

Gateway mode: A standalone Collector deployment receives data from agents, applies heavy processing (batching, sampling, enrichment), and exports to backends.

┌─────────┐     ┌─────────────┐     ┌──────────────┐
│  App Pod │────▶│  Agent      │────▶│  Gateway     │────▶ Backends
│  (OTLP)  │    │  (DaemonSet) │    │  (Deployment) │    (Prometheus,
└─────────┘     └─────────────┘     └──────────────┘     Jaeger, Loki)

Start with agents. Add gateways when you need centralized processing or tail-based sampling.

Deploying the Agent as a DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector-agent
  namespace: observability
spec:
  selector:
    matchLabels:
      app: otel-collector-agent
  template:
    metadata:
      labels:
        app: otel-collector-agent
    spec:
      serviceAccountName: otel-collector
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib:0.96.0
          args: ["--config=/etc/otel/config.yaml"]
          ports:
            - containerPort: 4317   # OTLP gRPC
            - containerPort: 4318   # OTLP HTTP
            - containerPort: 8888   # Collector metrics
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          volumeMounts:
            - name: config
              mountPath: /etc/otel
      volumes:
        - name: config
          configMap:
            name: otel-agent-config

Resource limits matter here. An unbounded Collector will happily consume all available memory during a traffic spike. Set limits, then monitor them — I'll show you how below.

The Pipeline Configuration

The Collector's config has four sections: receivers, processors, exporters, and service. Think of it as a DAG — data flows from receivers through processors to exporters.

# otel-agent-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # Scrape the Collector's own metrics
  prometheus:
    config:
      scrape_configs:
        - job_name: otel-collector
          scrape_interval: 10s
          static_configs:
            - targets: ["localhost:8888"]

processors:
  batch:
    send_batch_size: 1024
    send_batch_max_size: 2048
    timeout: 5s

  memory_limiter:
    check_interval: 1s
    limit_mib: 400
    spike_limit_mib: 100

  resource:
    attributes:
      - key: k8s.cluster.name
        value: production
        action: upsert

exporters:
  otlphttp:
    endpoint: http://otel-gateway.observability:4318
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s

  prometheus:
    endpoint: 0.0.0.0:9090
    namespace: otel

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [otlphttp]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlphttp]

  telemetry:
    metrics:
      address: 0.0.0.0:8888

Two things to notice. First, memory_limiter appears before batch in every pipeline. This is deliberate — you want to start dropping data before OOM-killing the process. Second, every pipeline has the processors listed in order. The Collector executes them sequentially, and order matters.

The Gateway Configuration

The gateway handles heavier processing — tail-based sampling, attribute enrichment, and fan-out to multiple backends.

# otel-gateway-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1500
    spike_limit_mib: 512

  batch:
    send_batch_size: 4096
    timeout: 10s

  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: errors-always
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-sample
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert
      - key: collector.version
        value: "0.96.0"
        action: insert

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector.observability:4317
    tls:
      insecure: true

  prometheusremotewrite:
    endpoint: http://prometheus.observability:9090/api/v1/write
    resource_to_telemetry_conversion:
      enabled: true

  loki:
    endpoint: http://loki.observability:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, attributes, batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, attributes, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, attributes, batch]
      exporters: [loki]

The tail_sampling processor is the gateway's most valuable feature. It keeps 100% of error traces and slow traces, while sampling only 10% of normal traffic. This gives you full visibility into problems without storing terabytes of healthy traces.

Monitoring the Collector Itself

A Collector that silently drops data is worse than no Collector. Monitor it with the same rigor as any production service.

# Dropped spans — this should be zero under normal conditions
sum(rate(otelcol_exporter_send_failed_spans_total[5m])) by (exporter)

# Queue saturation — approaching 1.0 means you're about to drop data
otelcol_exporter_queue_size / otelcol_exporter_queue_capacity

# Collector memory usage relative to limit
otelcol_process_memory_rss / (400 * 1024 * 1024)

# Receiver accepted vs refused — refused means backpressure is active
sum(rate(otelcol_receiver_accepted_spans_total[5m])) by (receiver)
sum(rate(otelcol_receiver_refused_spans_total[5m])) by (receiver)

Create alerting rules for these. If otelcol_exporter_send_failed_spans_total is increasing, your pipeline has a problem and you're losing observability data — the one time you need it most.

groups:
  - name: otel-collector
    rules:
      - alert: OtelCollectorExportFailures
        expr: |
          sum(rate(otelcol_exporter_send_failed_spans_total[5m])) by (exporter) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "OTel Collector failing to export spans via {{ $labels.exporter }}"
          runbook: "https://wiki.internal/runbooks/otel-collector-export-failure"

      - alert: OtelCollectorQueueSaturation
        expr: |
          otelcol_exporter_queue_size / otelcol_exporter_queue_capacity > 0.8
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "OTel Collector export queue above 80% capacity"

Common Mistakes

Running without memory_limiter. The Collector will OOM-kill during spikes. Always set it, and set it below your container memory limit to leave headroom.

Batching too aggressively. A send_batch_size of 50,000 looks efficient on paper, but it increases latency and memory usage. Start at 1024, measure, and adjust.

Forgetting retry configuration. Network hiccups between the Collector and backends are normal. Without retry_on_failure, those become data loss.

Not versioning your config. Treat Collector configs like application code. Store them in Git, review changes, deploy through CI/CD. A bad config change can blind your entire observability stack.

Kubernetes Auto-Instrumentation

The OpenTelemetry Operator can inject instrumentation into your applications without code changes.

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: auto-instrumentation
  namespace: observability
spec:
  exporter:
    endpoint: http://otel-collector-agent.observability:4317
  propagators:
    - tracecontext
    - baggage
  sampler:
    type: parentbased_traceidratio
    argument: "0.25"
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.44b0
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:2.3.0
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-node:0.49.1

Annotate your deployment to enable auto-instrumentation:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-python: "observability/auto-instrumentation"
    spec:
      containers:
        - name: api
          image: api-service:v1.4.0

The operator injects an init container that loads the instrumentation agent. Your application produces traces, metrics, and (for some languages) logs without any SDK integration. This is the fastest path from zero to full observability.

Resource Attributes and Semantic Conventions

Consistent resource attributes make your telemetry queryable across services.

processors:
  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: upsert
      - key: service.namespace
        value: payments
        action: upsert

  k8sattributes:
    auth_type: "serviceAccount"
    passthrough: false
    extract:
      metadata:
        - k8s.pod.name
        - k8s.namespace.name
        - k8s.deployment.name
        - k8s.node.name
      labels:
        - tag_name: app.kubernetes.io/name
          key: app.kubernetes.io/name
        - tag_name: app.kubernetes.io/version
          key: app.kubernetes.io/version
    pod_association:
      - sources:
          - from: resource_attribute
            name: k8s.pod.ip

The k8sattributes processor enriches every span and metric with Kubernetes metadata. When you query Jaeger for slow traces, you see the pod name, node, and deployment — not just a service name.

Scaling the Collector

Horizontal Pod Autoscaling for Gateways

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: otel-gateway
  namespace: observability
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: otel-collector-gateway
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: otelcol_exporter_queue_size
        target:
          type: AverageValue
          averageValue: "500"

Scale gateways based on queue depth. When the export queue grows, add more Collector instances to drain it. This prevents data loss during traffic spikes.

Load Balancing Between Agents and Gateways

apiVersion: v1
kind: Service
metadata:
  name: otel-gateway
  namespace: observability
spec:
  type: ClusterIP
  selector:
    app: otel-collector-gateway
  ports:
    - name: otlp-grpc
      port: 4317
      targetPort: 4317
    - name: otlp-http
      port: 4318
      targetPort: 4318

Agents send to the otel-gateway service. Kubernetes distributes requests across gateway pods. For gRPC, use appProtocol: kubernetes.io/h2c to enable proper HTTP/2 load balancing — without it, gRPC connections stick to a single pod.

Troubleshooting

Problem: Traces appear in Jaeger but metrics don't show in Prometheus. Fix: Check that your metrics pipeline is configured. Traces and metrics are separate pipelines in the Collector. A working trace pipeline doesn't mean metrics are flowing.

Problem: High memory usage on agents. Fix: Reduce send_batch_size and lower the memory_limiter threshold. Also check if tail_sampling is running on agents — it should only run on gateways because it buffers entire traces.

Problem: Duplicate spans in Jaeger. Fix: Ensure only one Collector pipeline sends to Jaeger. If both agents and gateways export to Jaeger, you get duplicates. Agents should export to the gateway, and only the gateway exports to backends.

Problem: Collector pods restart with OOMKilled. Fix: The memory_limiter processor must be set below the container memory limit. If your container limit is 512Mi, set limit_mib: 400 and spike_limit_mib: 80. The processor drops data gracefully instead of crashing.

Problem: Missing traces for short-lived services (CronJobs, Lambdas). Fix: Short-lived processes may exit before spans are flushed. Set the OTLP exporter timeout to a lower value and enable synchronous export: OTEL_BSP_SCHEDULE_DELAY=1000.

Start Simple, Then Grow

Deploy agents first with OTLP receivers and a single exporter. Get data flowing. Then add a gateway when you need sampling or multi-backend routing. Add processors one at a time and validate that data still arrives correctly.

The Collector is infrastructure. Treat it like infrastructure — monitor it, alert on it, and give it the resources it needs to handle your worst day, not just your average one.

Share:
Riku Tanaka
Riku Tanaka

SRE & Observability Engineer

If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.

Related Articles

MonitoringQuick RefFresh

PromQL: Cheat Sheet

PromQL cheat sheet with copy-paste query examples for rates, aggregations, histograms, label matching, recording rules, and alerting expressions.

Riku Tanaka·
2 min read