Detecting the Invisible: How Big Tech Finds Anomalies in Billions of Data Points

January 28, 2026 (3w ago)

At 3:47 AM on a Tuesday, a single metric at Uber shifted by 0.3%.

Among the millions of metrics flowing through their monitoring systems every second — CPU utilization, request latency, error rates, queue depths, cache hit ratios — one metric on one service in one datacenter moved slightly outside its normal range.

Twelve minutes later, that 0.3% shift cascaded into a full regional outage affecting ride matching.

The question isn't whether anomalies will happen. It's whether you'll detect them before your users do.

Why Traditional Alerting Fails at Scale

Most teams start with threshold-based alerts:

1IF cpu_usage > 80% THEN alert
2IF error_rate > 5% THEN alert
3IF latency_p99 > 500ms THEN alert

This works for ten services. At 1,000 services with 100 metrics each, you have 100,000 potential alerts. Setting and maintaining thresholds for each one is impossible. And static thresholds ignore context:

Big tech companies solved this with automated anomaly detection — systems that learn what "normal" looks like and alert when reality diverges from the model.

Uber's Argos: Real-Time Root Cause Analysis

Uber built Argos, a system that doesn't just detect anomalies — it traces them to their root cause automatically.

The Architecture

1Metrics Collection (millions/sec)
234┌───────────────────┐
5│  Time Series DB   │  ← Store all metrics
6└───────┬───────────┘
789┌───────────────────┐
10│  Anomaly Detector │  ← Statistical models per metric
11└───────┬───────────┘
12        │ (anomalies detected)
1314┌───────────────────┐
15│  Correlation       │  ← Which anomalies are related?
16│  Engine           │
17└───────┬───────────┘
181920┌───────────────────┐
21│  Root Cause       │  ← Dependency graph + temporal analysis
22│  Analyzer         │
23└───────────────────┘

How It Works

Step 1: Learn Normal. For every metric, Argos builds a model of expected behavior. This model accounts for:

Step 2: Detect Deviations. When a metric falls outside its expected range (accounting for natural variance), it's flagged as an anomaly.

Step 3: Correlate. When 50 anomalies fire simultaneously, they're usually related. Argos uses the service dependency graph to cluster related anomalies.

Step 4: Root Cause. Using temporal ordering (which anomaly appeared first?) and dependency analysis (which service do the affected services depend on?), Argos identifies the probable root cause.

Instead of an on-call engineer seeing 50 alerts and spending 30 minutes figuring out what went wrong, they see: "Database connection pool exhaustion on Service X caused cascading latency increases in 12 downstream services."

LinkedIn's Isolation Forests

LinkedIn uses a fascinating algorithm called Isolation Forest for anomaly detection. The intuition is elegant:

Normal data points are hard to isolate. Anomalies are easy to isolate.

Here's how it works:

1Given a dataset, build a random tree:
21. Pick a random feature (e.g., latency)
32. Pick a random split value
43. Partition data into left (< value) and right (>= value)
54. Repeat recursively
6
7Anomalies end up in their own partition quickly (short path length)
8Normal points require many splits to isolate (long path length)

Visually:

1Normal distribution of response times:
2[10ms, 12ms, 11ms, 13ms, 10ms, 500ms, 11ms, 12ms]
3
4Isolation tree:
5  Split at 100ms?
6  ├── Yes: [500ms] ← Isolated in ONE split → ANOMALY
7  └── No: [10, 12, 11, 13, 10, 11, 12]
8           Split at 11.5ms?
9           ├── Yes: [12, 13, 12]
10           │        Split at 12.5ms?
11           │        ... (many more splits to isolate any single point)
12           └── No: [10, 11, 10, 11]
13                   ... (many more splits)

The 500ms response time was isolated in a single split. The normal values take many splits. The path length in the isolation tree becomes the anomaly score.

Why LinkedIn chose this:

Microsoft's Anomaly Detection Service

Microsoft took a different approach. Instead of building per-team solutions, they created a centralized Anomaly Detection service used across Azure, Office 365, and Bing.

Their key innovation: Spectral Residual (SR) algorithm combined with Convolutional Neural Networks.

Spectral Residual: Finding What's Unusual in Frequency Space

The SR algorithm comes from computer vision (originally used to detect salient objects in images). Microsoft adapted it for time series:

11. Transform the time series to frequency domain (FFT)
22. Compute the "spectral residual" — the difference between
3   the log spectrum and its smoothed version
43. Transform back to time domain
54. Points with high spectral residual are anomalies

Why this works: Regular patterns (daily cycles, weekly patterns) show up as dominant frequencies. When you subtract these regular patterns, what remains is the anomalous signal.

This is computationally efficient and handles seasonal patterns without explicit modeling.

Facebook's Prophet-Based Detection

Facebook (Meta) open-sourced Prophet, a time series forecasting tool that's become a foundation for anomaly detection across the industry.

Prophet decomposes a time series into:

1y(t) = trend(t) + seasonality(t) + holidays(t) + error(t)

For anomaly detection, the approach is:

  1. Fit a Prophet model to historical data
  2. Generate predictions with uncertainty intervals
  3. Flag any actual values outside the uncertainty interval
1from prophet import Prophet
2
3# Train on historical metrics
4model = Prophet(interval_width=0.99)  # 99% confidence interval
5model.fit(historical_data)
6
7# Predict expected range
8forecast = model.predict(new_data)
9
10# Detect anomalies
11anomalies = new_data[
12    (new_data['y'] < forecast['yhat_lower']) |
13    (new_data['y'] > forecast['yhat_upper'])
14]

The strength of this approach: It explicitly models:

Building Your Own Anomaly Detection: A Practical Approach

You don't need Uber's infrastructure to benefit from anomaly detection. Here's a practical approach that works at any scale:

Level 1: Dynamic Thresholds

Replace static thresholds with rolling statistics:

1def is_anomaly(current_value, recent_values, num_std=3):
2    mean = np.mean(recent_values)
3    std = np.std(recent_values)
4
5    lower_bound = mean - (num_std * std)
6    upper_bound = mean + (num_std * std)
7
8    return current_value < lower_bound or current_value > upper_bound

This catches obvious anomalies while adapting to changing baselines. It takes 30 minutes to implement.

Level 2: Seasonal Decomposition

Account for time-of-day and day-of-week patterns:

1from statsmodels.tsa.seasonal import seasonal_decompose
2
3# Decompose into trend + seasonal + residual
4result = seasonal_decompose(metrics, model='additive', period=24*7)
5
6# Anomalies are extreme residuals
7residual_std = result.resid.std()
8anomalies = abs(result.resid) > 3 * residual_std

Level 3: Multi-Metric Correlation

The most powerful anomalies aren't in individual metrics — they're in the relationships between metrics:

1Normal: CPU up → Latency stable → Error rate stable
2  (healthy scaling under load)
3
4Anomalous: CPU up → Latency up → Error rate up
5  (something is wrong — resource contention, memory leak, etc.)

Monitoring the correlation between metrics catches issues that no single metric would flag.

The Alert Fatigue Problem

The hardest part of anomaly detection isn't detecting anomalies — it's not crying wolf.

At LinkedIn's scale, naive anomaly detection generates thousands of alerts per day. Most are false positives. Engineers start ignoring alerts. Then a real anomaly gets missed.

Solutions that work:

Alert Aggregation

Group related alerts by service, time window, and dependency chain. Instead of 50 individual alerts, show one incident with context.

Severity Scoring

Not all anomalies are equal. Score based on:

Feedback Loops

Let engineers mark alerts as true/false positives. Feed this back into the model. Over time, the system learns which anomalies matter.

What I've Learned From Studying These Systems

  1. The right algorithm matters less than the right data. An Isolation Forest with clean, well-structured metrics outperforms a deep learning model with noisy data every time.

  2. Seasonality is not optional. Any system that doesn't account for time-of-day and day-of-week patterns will drown in false positives.

  3. Detection without root cause analysis is just noise. Knowing something is wrong is useless if the engineer still spends 30 minutes figuring out what.

  4. Start simple, iterate. Dynamic thresholds (Level 1 above) catch 80% of meaningful anomalies. Add complexity only when the simple approach fails.

  5. The on-call engineer is the final algorithm. Every system ultimately relies on a human making a judgment call. Your anomaly detection system should reduce the noise they have to process, not replace their judgment.

The companies that do monitoring well don't have fancier algorithms. They have systems that respect their engineers' attention as the scarce resource it is.


This analysis synthesizes public information from Uber's Argos system, LinkedIn's anomaly detection research, Microsoft's Anomaly Detection service, and Facebook's Prophet framework, combined with practical experience implementing monitoring systems at scale.

Up next

Consistent Hashing: The Algorithm That Quietly Powers Half the Internet

From DynamoDB to Discord, from Akamai to Cassandra — one algorithm decides where your data lives. Here's how consistent hashing works, why it's brilliant, and how Google improved it.