Kubernetes Observability Without the Complexity
Kubernetes makes traditional monitoring approaches break. Ephemeral pods, high cardinality from dynamic naming, and massive metric volumes require new patterns. Here is what works.
Why Kubernetes breaks traditional monitoring
The fundamental assumptions of infrastructure monitoring do not hold in Kubernetes:
- Ephemeral containers: Pods spin up and down constantly. Data disappears with them.
- Dynamic naming: Pod names include unique hashes. Every deployment creates new identifiers.
- High cardinality explosion: 30,000 pods times 45 metrics times 10-second scrape intervals equals massive data volumes.
- Multi-layer complexity: You need visibility into pods, nodes, services, ingress, and the control plane simultaneously.
Traditional per-host billing models compound the problem. A 50-node cluster with 500 pods should not cost 10x more to monitor than the same workload on 50 VMs.
OpenTelemetry Collector deployment patterns
The OpenTelemetry Collector can run in three patterns, each with different tradeoffs.
| Pattern | K8s Mode | Use Case | Memory per Node |
|---|---|---|---|
| Agent | DaemonSet | Node-level metrics/logs | ~400Mi |
| Sidecar | Per-pod | High-fidelity app context | ~128Mi per pod |
| Gateway | Deployment | Central sampling/routing | ~1Gi |
DaemonSet: the standard approach
Most deployments start here. One collector per node handles all telemetry from pods on that node. Benefits include:
- Consistent resource footprint regardless of pod count
- Automatic collection from new pods without configuration
- Node-level metrics (disk, network, kernel) included
The downside: you lose some application context. The collector sees all traffic but may not correlate it perfectly with specific pod behavior.
Sidecar: maximum fidelity
When you need rich application-level context, deploy the collector as a sidecar. The OTel Operator can inject sidecars automatically via pod annotations:
metadata:
annotations:
sidecar.opentelemetry.io/inject: "true"The cost is ~128Mi per pod. For a cluster with 1,000 pods, that is 128GB of memory just for collectors. Calculate carefully.
Gateway: central processing
A gateway deployment handles centralized processing: tail-based sampling decisions, routing to multiple backends, or transformation. This is where you run expensive operations once instead of on every node.
OTel Operator capabilities
The OpenTelemetry Operator for Kubernetes provides features that reduce operational complexity:
- Automatic sidecar injection: Add annotations, get collectors
- Auto-instrumentation injection: Inject OTel SDK for Java, Python, Node.js, .NET, Go without code changes
- Target Allocator: Distributes Prometheus scrape targets across collectors to prevent single-node bottlenecks
- CRD-based configuration: Manage collector config as Kubernetes resources
eBPF vs instrumentation: the tradeoffs
eBPF-based observability runs in the Linux kernel, observing system calls without modifying application code. Traditional instrumentation runs inside your application.
| Aspect | eBPF | Instrumentation (OTel) |
|---|---|---|
| Code Changes | None | Required |
| Visibility | Kernel/syscall level | Application level |
| Application Context | Limited | Rich (business logic) |
| Kernel Requirements | Linux 4.14+ (5.x better) | None |
| CPU Overhead | ~1-5% | 5-15% typical |
eBPF vendors
Several vendors have built observability on eBPF:
- groundcover: Flora Sensor, BYOC deployment, claims lowest overhead
- Pixie (New Relic): Open-source, acquired by New Relic
- Cilium/Hubble: Network observability, part of the service mesh layer
When to use each
eBPF excels at infrastructure visibility: network flows, syscall patterns, resource consumption. Instrumentation excels at application visibility: business transactions, user journeys, custom metrics.
Many production deployments use both. eBPF for infrastructure baseline, OTel instrumentation for application traces.
Agent resource overhead benchmarks
groundcover published benchmarks comparing agent overhead at 3000 req/s baseline:
| Agent | CPU Overhead | Notes |
|---|---|---|
| groundcover (eBPF) | +9% | Kernel-level, no code changes |
| Pixie | +32% | eBPF-based, open-source |
| OpenTelemetry | +59% | Application instrumentation |
| Datadog | +249% | Full agent stack |
At scale, these differences compound. A 100-node cluster running Datadog agents consumes significantly more CPU than the same cluster with eBPF-based collection.
Resource requests by agent
| Agent | CPU Request | Memory Request |
|---|---|---|
| Datadog Agent | 200m | 256Mi |
| Datadog Trace Agent | 100m | 200Mi |
| Dynatrace OneAgent | 100m | 512Mi |
| OTel Collector (DaemonSet) | 200m | 400Mi |
| groundcover (eBPF) | ~50m | ~100Mi |
Cardinality management
Kubernetes generates high-cardinality metrics by default. Pod names, container IDs, commit SHAs, and build IDs create unique label combinations that explode storage costs.
Strategies to manage cardinality:
- Drop high-cardinality labels: Pod name hashes rarely help debugging. Consider dropping them.
- Aggregate at collection: Use OTel Collector processors to pre-aggregate before storage.
- Use recording rules: Pre-compute common queries to avoid expensive aggregations at query time.
- Choose your dimensions: Service, namespace, and node are usually sufficient. Pod-level granularity is rarely needed in metrics.
The Sampleless approach
Sampleless takes a different path for Kubernetes observability:
- BYOC deployment: The collector runs in your cluster, eliminating egress costs regardless of data volume.
- No sampling required: When egress is not a cost constraint, you can collect everything.
- Flat pricing: 50 nodes or 500 nodes, the cost is based on cloud accounts, not hosts.
- OpenTelemetry native: Standard instrumentation, no proprietary agents with high overhead.
Kubernetes observability should not require choosing between visibility and cost. With the right architecture, you get both.
Frequently asked questions
Should I use DaemonSet or sidecar for OpenTelemetry Collector?
DaemonSet is the standard choice for most deployments. It runs one collector per node (~400Mi memory) and handles node-level metrics and logs. Use sidecars (~128Mi per pod) only when you need high-fidelity application context or strict resource isolation between services.
Is eBPF-based observability better than instrumentation?
eBPF offers lower overhead (~1-5% CPU vs 5-15% for instrumentation) and requires no code changes, but provides limited application-level context. Instrumentation gives rich business logic visibility. Many production deployments use both: eBPF for infrastructure, OTel for application traces.
How much overhead does the Datadog agent add to Kubernetes?
Groundcover benchmarks show Datadog agent adds +249% CPU overhead at 3000 req/s baseline. By comparison, OpenTelemetry adds +59%, Pixie adds +32%, and eBPF-based groundcover adds +9%. These numbers matter at scale.
Kubernetes observability without the overhead
See how BYOC architecture makes full-fidelity collection economical for K8s.