Kubernetes Observability Without the Complexity

Why Kubernetes breaks traditional monitoring

The fundamental assumptions of infrastructure monitoring do not hold in Kubernetes:

Ephemeral containers: Pods spin up and down constantly. Data disappears with them.
Dynamic naming: Pod names include unique hashes. Every deployment creates new identifiers.
High cardinality explosion: 30,000 pods times 45 metrics times 10-second scrape intervals equals massive data volumes.
Multi-layer complexity: You need visibility into pods, nodes, services, ingress, and the control plane simultaneously.

Traditional per-host billing models compound the problem. A 50-node cluster with 500 pods should not cost 10x more to monitor than the same workload on 50 VMs.

OpenTelemetry Collector deployment patterns

The OpenTelemetry Collector can run in three patterns, each with different tradeoffs.

Pattern	K8s Mode	Use Case	Memory per Node
Agent	DaemonSet	Node-level metrics/logs	~400Mi
Sidecar	Per-pod	High-fidelity app context	~128Mi per pod
Gateway	Deployment	Central sampling/routing	~1Gi

DaemonSet: the standard approach

Most deployments start here. One collector per node handles all telemetry from pods on that node. Benefits include:

Consistent resource footprint regardless of pod count
Automatic collection from new pods without configuration
Node-level metrics (disk, network, kernel) included

The downside: you lose some application context. The collector sees all traffic but may not correlate it perfectly with specific pod behavior.

Sidecar: maximum fidelity

When you need rich application-level context, deploy the collector as a sidecar. The OTel Operator can inject sidecars automatically via pod annotations:

metadata:
  annotations:
    sidecar.opentelemetry.io/inject: "true"

The cost is ~128Mi per pod. For a cluster with 1,000 pods, that is 128GB of memory just for collectors. Calculate carefully.

Gateway: central processing

A gateway deployment handles centralized processing: tail-based sampling decisions, routing to multiple backends, or transformation. This is where you run expensive operations once instead of on every node.

OTel Operator capabilities

The OpenTelemetry Operator for Kubernetes provides features that reduce operational complexity:

Automatic sidecar injection: Add annotations, get collectors
Auto-instrumentation injection: Inject OTel SDK for Java, Python, Node.js, .NET, Go without code changes
Target Allocator: Distributes Prometheus scrape targets across collectors to prevent single-node bottlenecks
CRD-based configuration: Manage collector config as Kubernetes resources

eBPF vs instrumentation: the tradeoffs

eBPF-based observability runs in the Linux kernel, observing system calls without modifying application code. Traditional instrumentation runs inside your application.

Aspect	eBPF	Instrumentation (OTel)
Code Changes	None	Required
Visibility	Kernel/syscall level	Application level
Application Context	Limited	Rich (business logic)
Kernel Requirements	Linux 4.14+ (5.x better)	None
CPU Overhead	~1-5%	5-15% typical

eBPF vendors

Several vendors have built observability on eBPF:

groundcover: Flora Sensor, BYOC deployment, claims lowest overhead
Pixie (New Relic): Open-source, acquired by New Relic
Cilium/Hubble: Network observability, part of the service mesh layer

When to use each

eBPF excels at infrastructure visibility: network flows, syscall patterns, resource consumption. Instrumentation excels at application visibility: business transactions, user journeys, custom metrics.

Many production deployments use both. eBPF for infrastructure baseline, OTel instrumentation for application traces.

Agent resource overhead benchmarks

groundcover published benchmarks comparing agent overhead at 3000 req/s baseline:

Agent	CPU Overhead	Notes
groundcover (eBPF)	+9%	Kernel-level, no code changes
Pixie	+32%	eBPF-based, open-source
OpenTelemetry	+59%	Application instrumentation
Datadog	+249%	Full agent stack

At scale, these differences compound. A 100-node cluster running Datadog agents consumes significantly more CPU than the same cluster with eBPF-based collection.

Resource requests by agent

Agent	CPU Request	Memory Request
Datadog Agent	200m	256Mi
Datadog Trace Agent	100m	200Mi
Dynatrace OneAgent	100m	512Mi
OTel Collector (DaemonSet)	200m	400Mi
groundcover (eBPF)	~50m	~100Mi

Cardinality management

Kubernetes generates high-cardinality metrics by default. Pod names, container IDs, commit SHAs, and build IDs create unique label combinations that explode storage costs.

Strategies to manage cardinality:

Drop high-cardinality labels: Pod name hashes rarely help debugging. Consider dropping them.
Aggregate at collection: Use OTel Collector processors to pre-aggregate before storage.
Use recording rules: Pre-compute common queries to avoid expensive aggregations at query time.
Choose your dimensions: Service, namespace, and node are usually sufficient. Pod-level granularity is rarely needed in metrics.

The Sampleless approach

Sampleless takes a different path for Kubernetes observability:

BYOC deployment: The collector runs in your cluster, eliminating egress costs regardless of data volume.
No sampling required: When egress is not a cost constraint, you can collect everything.
Flat pricing: 50 nodes or 500 nodes, the cost is based on cloud accounts, not hosts.
OpenTelemetry native: Standard instrumentation, no proprietary agents with high overhead.

Kubernetes observability should not require choosing between visibility and cost. With the right architecture, you get both.

Frequently asked questions

Should I use DaemonSet or sidecar for OpenTelemetry Collector?

DaemonSet is the standard choice for most deployments. It runs one collector per node (~400Mi memory) and handles node-level metrics and logs. Use sidecars (~128Mi per pod) only when you need high-fidelity application context or strict resource isolation between services.

Is eBPF-based observability better than instrumentation?

eBPF offers lower overhead (~1-5% CPU vs 5-15% for instrumentation) and requires no code changes, but provides limited application-level context. Instrumentation gives rich business logic visibility. Many production deployments use both: eBPF for infrastructure, OTel for application traces.

How much overhead does the Datadog agent add to Kubernetes?

Groundcover benchmarks show Datadog agent adds +249% CPU overhead at 3000 req/s baseline. By comparison, OpenTelemetry adds +59%, Pixie adds +32%, and eBPF-based groundcover adds +9%. These numbers matter at scale.