OpenTelemetry Collector: Deploying Your Observability Pipeline the Right Way
The Vendor Lock-In Problem Observability Forgot
Every time I've watched a team migrate from one monitoring backend to another, it's been painful. Re-instrument every service, update libraries, rewrite exporters, hope nothing breaks in production. It takes months.
OpenTelemetry solves this by separating instrumentation from destination. You instrument once, then route data wherever you need it. The Collector is the central piece — a vendor-agnostic pipeline that receives, processes, and exports telemetry data. The Google SRE book emphasizes that your monitoring system should be treated as a production service itself. The Collector deserves that same rigor.
Let me walk through deploying and configuring it properly.
Architecture: Agent vs. Gateway
There are two deployment patterns, and most production setups use both.
Agent mode: A Collector instance runs as a sidecar or DaemonSet on every node. It collects telemetry from local applications with minimal network overhead.
Gateway mode: A standalone Collector deployment receives data from agents, applies heavy processing (batching, sampling, enrichment), and exports to backends.
┌─────────┐ ┌─────────────┐ ┌──────────────┐
│ App Pod │────▶│ Agent │────▶│ Gateway │────▶ Backends
│ (OTLP) │ │ (DaemonSet) │ │ (Deployment) │ (Prometheus,
└─────────┘ └─────────────┘ └──────────────┘ Jaeger, Loki)
Start with agents. Add gateways when you need centralized processing or tail-based sampling.
Deploying the Agent as a DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector-agent
namespace: observability
spec:
selector:
matchLabels:
app: otel-collector-agent
template:
metadata:
labels:
app: otel-collector-agent
spec:
serviceAccountName: otel-collector
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:0.96.0
args: ["--config=/etc/otel/config.yaml"]
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
- containerPort: 8888 # Collector metrics
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
volumeMounts:
- name: config
mountPath: /etc/otel
volumes:
- name: config
configMap:
name: otel-agent-config
Resource limits matter here. An unbounded Collector will happily consume all available memory during a traffic spike. Set limits, then monitor them — I'll show you how below.
The Pipeline Configuration
The Collector's config has four sections: receivers, processors, exporters, and service. Think of it as a DAG — data flows from receivers through processors to exporters.
# otel-agent-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Scrape the Collector's own metrics
prometheus:
config:
scrape_configs:
- job_name: otel-collector
scrape_interval: 10s
static_configs:
- targets: ["localhost:8888"]
processors:
batch:
send_batch_size: 1024
send_batch_max_size: 2048
timeout: 5s
memory_limiter:
check_interval: 1s
limit_mib: 400
spike_limit_mib: 100
resource:
attributes:
- key: k8s.cluster.name
value: production
action: upsert
exporters:
otlphttp:
endpoint: http://otel-gateway.observability:4318
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
prometheus:
endpoint: 0.0.0.0:9090
namespace: otel
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [otlphttp]
logs:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [otlphttp]
telemetry:
metrics:
address: 0.0.0.0:8888
Two things to notice. First, memory_limiter appears before batch in every pipeline. This is deliberate — you want to start dropping data before OOM-killing the process. Second, every pipeline has the processors listed in order. The Collector executes them sequentially, and order matters.
The Gateway Configuration
The gateway handles heavier processing — tail-based sampling, attribute enrichment, and fan-out to multiple backends.
# otel-gateway-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 1s
limit_mib: 1500
spike_limit_mib: 512
batch:
send_batch_size: 4096
timeout: 10s
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- name: errors-always
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-traces
type: latency
latency:
threshold_ms: 1000
- name: probabilistic-sample
type: probabilistic
probabilistic:
sampling_percentage: 10
attributes:
actions:
- key: environment
value: production
action: upsert
- key: collector.version
value: "0.96.0"
action: insert
exporters:
otlp/jaeger:
endpoint: jaeger-collector.observability:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://prometheus.observability:9090/api/v1/write
resource_to_telemetry_conversion:
enabled: true
loki:
endpoint: http://loki.observability:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, attributes, batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, attributes, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, attributes, batch]
exporters: [loki]
The tail_sampling processor is the gateway's most valuable feature. It keeps 100% of error traces and slow traces, while sampling only 10% of normal traffic. This gives you full visibility into problems without storing terabytes of healthy traces.
Monitoring the Collector Itself
A Collector that silently drops data is worse than no Collector. Monitor it with the same rigor as any production service.
# Dropped spans — this should be zero under normal conditions
sum(rate(otelcol_exporter_send_failed_spans_total[5m])) by (exporter)
# Queue saturation — approaching 1.0 means you're about to drop data
otelcol_exporter_queue_size / otelcol_exporter_queue_capacity
# Collector memory usage relative to limit
otelcol_process_memory_rss / (400 * 1024 * 1024)
# Receiver accepted vs refused — refused means backpressure is active
sum(rate(otelcol_receiver_accepted_spans_total[5m])) by (receiver)
sum(rate(otelcol_receiver_refused_spans_total[5m])) by (receiver)
Create alerting rules for these. If otelcol_exporter_send_failed_spans_total is increasing, your pipeline has a problem and you're losing observability data — the one time you need it most.
groups:
- name: otel-collector
rules:
- alert: OtelCollectorExportFailures
expr: |
sum(rate(otelcol_exporter_send_failed_spans_total[5m])) by (exporter) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "OTel Collector failing to export spans via {{ $labels.exporter }}"
runbook: "https://wiki.internal/runbooks/otel-collector-export-failure"
- alert: OtelCollectorQueueSaturation
expr: |
otelcol_exporter_queue_size / otelcol_exporter_queue_capacity > 0.8
for: 5m
labels:
severity: critical
annotations:
summary: "OTel Collector export queue above 80% capacity"
Common Mistakes
Running without memory_limiter. The Collector will OOM-kill during spikes. Always set it, and set it below your container memory limit to leave headroom.
Batching too aggressively. A send_batch_size of 50,000 looks efficient on paper, but it increases latency and memory usage. Start at 1024, measure, and adjust.
Forgetting retry configuration. Network hiccups between the Collector and backends are normal. Without retry_on_failure, those become data loss.
Not versioning your config. Treat Collector configs like application code. Store them in Git, review changes, deploy through CI/CD. A bad config change can blind your entire observability stack.
Kubernetes Auto-Instrumentation
The OpenTelemetry Operator can inject instrumentation into your applications without code changes.
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: auto-instrumentation
namespace: observability
spec:
exporter:
endpoint: http://otel-collector-agent.observability:4317
propagators:
- tracecontext
- baggage
sampler:
type: parentbased_traceidratio
argument: "0.25"
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.44b0
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:2.3.0
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-node:0.49.1
Annotate your deployment to enable auto-instrumentation:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
template:
metadata:
annotations:
instrumentation.opentelemetry.io/inject-python: "observability/auto-instrumentation"
spec:
containers:
- name: api
image: api-service:v1.4.0
The operator injects an init container that loads the instrumentation agent. Your application produces traces, metrics, and (for some languages) logs without any SDK integration. This is the fastest path from zero to full observability.
Resource Attributes and Semantic Conventions
Consistent resource attributes make your telemetry queryable across services.
processors:
resource:
attributes:
- key: deployment.environment
value: production
action: upsert
- key: service.namespace
value: payments
action: upsert
k8sattributes:
auth_type: "serviceAccount"
passthrough: false
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.deployment.name
- k8s.node.name
labels:
- tag_name: app.kubernetes.io/name
key: app.kubernetes.io/name
- tag_name: app.kubernetes.io/version
key: app.kubernetes.io/version
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
The k8sattributes processor enriches every span and metric with Kubernetes metadata. When you query Jaeger for slow traces, you see the pod name, node, and deployment — not just a service name.
Scaling the Collector
Horizontal Pod Autoscaling for Gateways
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: otel-gateway
namespace: observability
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: otel-collector-gateway
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: otelcol_exporter_queue_size
target:
type: AverageValue
averageValue: "500"
Scale gateways based on queue depth. When the export queue grows, add more Collector instances to drain it. This prevents data loss during traffic spikes.
Load Balancing Between Agents and Gateways
apiVersion: v1
kind: Service
metadata:
name: otel-gateway
namespace: observability
spec:
type: ClusterIP
selector:
app: otel-collector-gateway
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
- name: otlp-http
port: 4318
targetPort: 4318
Agents send to the otel-gateway service. Kubernetes distributes requests across gateway pods. For gRPC, use appProtocol: kubernetes.io/h2c to enable proper HTTP/2 load balancing — without it, gRPC connections stick to a single pod.
Troubleshooting
Problem: Traces appear in Jaeger but metrics don't show in Prometheus. Fix: Check that your metrics pipeline is configured. Traces and metrics are separate pipelines in the Collector. A working trace pipeline doesn't mean metrics are flowing.
Problem: High memory usage on agents.
Fix: Reduce send_batch_size and lower the memory_limiter threshold. Also check if tail_sampling is running on agents — it should only run on gateways because it buffers entire traces.
Problem: Duplicate spans in Jaeger. Fix: Ensure only one Collector pipeline sends to Jaeger. If both agents and gateways export to Jaeger, you get duplicates. Agents should export to the gateway, and only the gateway exports to backends.
Problem: Collector pods restart with OOMKilled.
Fix: The memory_limiter processor must be set below the container memory limit. If your container limit is 512Mi, set limit_mib: 400 and spike_limit_mib: 80. The processor drops data gracefully instead of crashing.
Problem: Missing traces for short-lived services (CronJobs, Lambdas).
Fix: Short-lived processes may exit before spans are flushed. Set the OTLP exporter timeout to a lower value and enable synchronous export: OTEL_BSP_SCHEDULE_DELAY=1000.
Start Simple, Then Grow
Deploy agents first with OTLP receivers and a single exporter. Get data flowing. Then add a gateway when you need sampling or multi-backend routing. Add processors one at a time and validate that data still arrives correctly.
The Collector is infrastructure. Treat it like infrastructure — monitor it, alert on it, and give it the resources it needs to handle your worst day, not just your average one.
Related Articles
SRE & Observability Engineer
If it's not measured, it doesn't exist. SLO-driven, metrics-obsessed, and the person who gets paged at 3 AM so you don't have to. Observability isn't optional.
Related Articles
Building a Complete Prometheus + Grafana Monitoring Stack from Scratch
Build a production Prometheus and Grafana monitoring stack from scratch — service discovery, recording rules, alerting, and dashboards.
Scalable Log Aggregation with Grafana Loki and Promtail
Deploy Grafana Loki and Promtail for cost-effective, scalable log aggregation — without indexing yourself into bankruptcy.
PromQL: Cheat Sheet
PromQL cheat sheet with copy-paste query examples for rates, aggregations, histograms, label matching, recording rules, and alerting expressions.