Why Sampling Creates Observability Blind Spots
At 1% sampling with a 0.1% error rate, you need 230,000 requests to achieve 90% confidence of capturing at least one error. Here is the math, the tradeoffs, and why we built Sampleless to avoid sampling entirely.
The mathematics of missed events
The probability of detecting at least one occurrence of an event follows the formula:
Let's apply this to a real scenario: a 0.1% error rate (1 in 1,000 requests fail) with 1% sampling.
| Requests | Expected Errors | P(catching at least 1) |
|---|---|---|
| 23,000 | 23 | ~21% |
| 100,000 | 100 | ~63% |
| 230,000 | 230 | ~90% |
| 460,000 | 460 | ~99% |
At 1% sampling, you need approximately 230,000 requests to achieve 90% confidence of capturing at least one 0.1% error.
For a service handling 100 requests per second, that is 38 minutes of traffic before you have reasonable confidence of seeing an error that is already affecting 1 in 1,000 users.
Sampling techniques explained
Head-based sampling
Decision made at trace start using deterministic trace ID hashing.
- Pros: Simple, efficient, guarantees complete traces
- Cons: "Blind" to trace content. Cannot prioritize errors or high latency.
Most agents default to 10% or 10 traces/second head-based sampling.
Tail-based sampling
Decision deferred until trace completes. Evaluates full context before deciding.
- Pros: Can always sample errors and high-latency traces
- Cons: Requires stateful infrastructure, buffering, and routing by trace ID
Tail-based sampling is difficult to operate at scale. High-volume services may need dozens or hundreds of compute nodes just for sampling decisions.
Adaptive/dynamic sampling (Google Dapper approach)
Adjusts rate based on system load:
- High-load services: 0.01% (1 in 10,000)
- Moderate services: ~0.1%
- Errors and rare endpoints: "Dynamically cranks sampling to 100%"
The Google Dapper paper notes: "For high-throughput services, aggressive sampling does not hinder most important analyses." But Google's scale is unusual. Most companies are not processing enough traffic for statistical sufficiency at 0.01% sampling.
Industry sampling statistics
| Traffic Level | Typical Rate | Notes |
|---|---|---|
| Development | 100% | Full visibility, low volume |
| Low volume (<100 req/s) | 25-100% | Can often afford full traces |
| Medium volume | 10-20% | Balance cost and visibility |
| High volume (>1000 req/s) | 1-5% | Cost-driven |
| Ultra-high (Google scale) | 0.01-0.1% | Statistical sufficiency at volume |
Alibaba's 2025 research reveals the scale of modern tracing challenges: they generate 18.6-20.5 PB of trace data per day. Even with aggressive tail-based sampling, query miss rates for normal traces can reach 27.17%.
Impact on ML and anomaly detection
This is where sampling causes the most insidious problems.
ML models require representative training data to establish accurate behavioral baselines. Datadog's Watchdog requires a minimum of 2 weeks of historical data to train metric baselines. Netdata's ML trains on 6 hours of data and retrains every 3 hours.
If you sample during the baseline period, the ML model learns from an incomplete picture.
Consider what sampling misses:
- Rare but important error classes that occur less than your sampling rate
- Latency spikes that happen to not get sampled
- Entire user segments with low traffic
- Intermittent failures that only affect certain request patterns
The model cannot learn what it never sees. Anomaly detection trained on sampled data will have blind spots that mirror your sampling gaps.
OpenTelemetry sampling configuration
If you must sample, here is how to configure it properly in OpenTelemetry.
TraceIdRatioBased with ParentBased
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased, ParentBased
# 10% sampling
sampler = ParentBased(root=TraceIdRatioBased(0.1))Environment variables
export OTEL_TRACES_SAMPLER="parentbased_traceidratio"
export OTEL_TRACES_SAMPLER_ARG="0.1"OTel Collector tail sampling
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: error-policy
type: status_code
status_code: {status_codes: ["ERROR"]}
- name: latency-policy
type: latency
latency: {threshold_ms: 200}
- name: probabilistic-policy
type: probabilistic
probabilistic: {sampling_percentage: 5}This configuration always keeps errors and high-latency traces while sampling 5% of everything else. Better than pure head-based sampling, but still misses events that do not match your policies.
The alternative: eliminate sampling entirely
Sampling exists because of economics. SaaS observability vendors charge per-GB, and egress costs add $0.135/GB minimum. At scale, full-fidelity collection becomes prohibitively expensive.
BYOC architecture changes the equation:
- Data stays in your cloud. Zero egress costs.
- No per-GB charges from the observability vendor.
- Full-fidelity collection becomes economically viable.
Sampleless collects 100% of your telemetry because BYOC makes it cost-effective. Complete data means:
- ML baselines trained on representative data
- The trace you need is always there
- No blind spots from sampling gaps
- Accurate anomaly detection without data bias
Frequently asked questions
What sampling rate should I use?
There is no universal answer. Higher traffic requires lower rates to control costs: low volume (<100 req/s) can often use 25-100%, medium volume typically uses 10-20%, high volume (>1000 req/s) often drops to 1-5%. But every reduction increases the risk of missing important events.
Does tail-based sampling solve the problem?
Tail-based sampling helps by always capturing errors and high-latency traces, but it requires stateful infrastructure, buffering, and routing by trace ID. It is difficult to operate at scale and still misses events that do not match your defined policies.
How does sampling affect ML-based anomaly detection?
ML models require representative training data. If you sample during the baseline period, the model learns from an incomplete picture and may miss entire classes of normal or anomalous behavior. Datadog Watchdog requires 2 weeks of data; if that data is sampled, baselines are biased.
Conclusion
Sampling is a necessary compromise when economics force the choice between visibility and cost. But every reduction in sampling rate increases the probability of missing the exact event you need to debug a production issue.
The question is not whether to sample 1% or 5%. The question is whether you can afford to miss 99% or 95% of your data.
If you cannot, BYOC architecture makes full-fidelity collection economically viable. Sampleless collects everything because we believe observability should not require gambling on which data to keep.
Stop gambling on which data to keep
See how Sampleless collects 100% of your telemetry with predictable pricing.