At 3:47 AM on a Tuesday, a single metric at Uber shifted by 0.3%.
Among the millions of metrics flowing through their monitoring systems every second — CPU utilization, request latency, error rates, queue depths, cache hit ratios — one metric on one service in one datacenter moved slightly outside its normal range.
Twelve minutes later, that 0.3% shift cascaded into a full regional outage affecting ride matching.
The question isn't whether anomalies will happen. It's whether you'll detect them before your users do.
Why Traditional Alerting Fails at Scale
Most teams start with threshold-based alerts:
1IF cpu_usage > 80% THEN alert
2IF error_rate > 5% THEN alert
3IF latency_p99 > 500ms THEN alert
This works for ten services. At 1,000 services with 100 metrics each, you have 100,000 potential alerts. Setting and maintaining thresholds for each one is impossible. And static thresholds ignore context:
- CPU at 80% during peak hours? Normal.
- CPU at 80% at 3 AM? Something is very wrong.
- Latency at 200ms on Black Friday? Expected.
- Latency at 200ms on a random Tuesday? Investigate.
Big tech companies solved this with automated anomaly detection — systems that learn what "normal" looks like and alert when reality diverges from the model.
Uber's Argos: Real-Time Root Cause Analysis
Uber built Argos, a system that doesn't just detect anomalies — it traces them to their root cause automatically.
The Architecture
1Metrics Collection (millions/sec)
2 │
3 ▼
4┌───────────────────┐
5│ Time Series DB │ ← Store all metrics
6└───────┬───────────┘
7 │
8 ▼
9┌───────────────────┐
10│ Anomaly Detector │ ← Statistical models per metric
11└───────┬───────────┘
12 │ (anomalies detected)
13 ▼
14┌───────────────────┐
15│ Correlation │ ← Which anomalies are related?
16│ Engine │
17└───────┬───────────┘
18 │
19 ▼
20┌───────────────────┐
21│ Root Cause │ ← Dependency graph + temporal analysis
22│ Analyzer │
23└───────────────────┘
How It Works
Step 1: Learn Normal. For every metric, Argos builds a model of expected behavior. This model accounts for:
- Time of day (traffic is higher at noon than 3 AM)
- Day of week (Monday mornings differ from Saturday nights)
- Seasonal trends (summer vs. winter patterns)
- Special events (holidays, promotions)
Step 2: Detect Deviations. When a metric falls outside its expected range (accounting for natural variance), it's flagged as an anomaly.
Step 3: Correlate. When 50 anomalies fire simultaneously, they're usually related. Argos uses the service dependency graph to cluster related anomalies.
Step 4: Root Cause. Using temporal ordering (which anomaly appeared first?) and dependency analysis (which service do the affected services depend on?), Argos identifies the probable root cause.
Instead of an on-call engineer seeing 50 alerts and spending 30 minutes figuring out what went wrong, they see: "Database connection pool exhaustion on Service X caused cascading latency increases in 12 downstream services."
LinkedIn's Isolation Forests
LinkedIn uses a fascinating algorithm called Isolation Forest for anomaly detection. The intuition is elegant:
Normal data points are hard to isolate. Anomalies are easy to isolate.
Here's how it works:
1Given a dataset, build a random tree:
21. Pick a random feature (e.g., latency)
32. Pick a random split value
43. Partition data into left (< value) and right (>= value)
54. Repeat recursively
6
7Anomalies end up in their own partition quickly (short path length)
8Normal points require many splits to isolate (long path length)
Visually:
1Normal distribution of response times:
2[10ms, 12ms, 11ms, 13ms, 10ms, 500ms, 11ms, 12ms]
3
4Isolation tree:
5 Split at 100ms?
6 ├── Yes: [500ms] ← Isolated in ONE split → ANOMALY
7 └── No: [10, 12, 11, 13, 10, 11, 12]
8 Split at 11.5ms?
9 ├── Yes: [12, 13, 12]
10 │ Split at 12.5ms?
11 │ ... (many more splits to isolate any single point)
12 └── No: [10, 11, 10, 11]
13 ... (many more splits)
The 500ms response time was isolated in a single split. The normal values take many splits. The path length in the isolation tree becomes the anomaly score.
Why LinkedIn chose this:
- Linear time complexity: O(n) — critical when processing millions of data points
- No need to define "normal" upfront — it's learned from data
- Works on high-dimensional data (many metrics simultaneously)
- Handles concept drift naturally (retrain periodically)
Microsoft's Anomaly Detection Service
Microsoft took a different approach. Instead of building per-team solutions, they created a centralized Anomaly Detection service used across Azure, Office 365, and Bing.
Their key innovation: Spectral Residual (SR) algorithm combined with Convolutional Neural Networks.
Spectral Residual: Finding What's Unusual in Frequency Space
The SR algorithm comes from computer vision (originally used to detect salient objects in images). Microsoft adapted it for time series:
11. Transform the time series to frequency domain (FFT)
22. Compute the "spectral residual" — the difference between
3 the log spectrum and its smoothed version
43. Transform back to time domain
54. Points with high spectral residual are anomalies
Why this works: Regular patterns (daily cycles, weekly patterns) show up as dominant frequencies. When you subtract these regular patterns, what remains is the anomalous signal.
This is computationally efficient and handles seasonal patterns without explicit modeling.
Facebook's Prophet-Based Detection
Facebook (Meta) open-sourced Prophet, a time series forecasting tool that's become a foundation for anomaly detection across the industry.
Prophet decomposes a time series into:
1y(t) = trend(t) + seasonality(t) + holidays(t) + error(t)
For anomaly detection, the approach is:
- Fit a Prophet model to historical data
- Generate predictions with uncertainty intervals
- Flag any actual values outside the uncertainty interval
1from prophet import Prophet
2
3# Train on historical metrics
4model = Prophet(interval_width=0.99) # 99% confidence interval
5model.fit(historical_data)
6
7# Predict expected range
8forecast = model.predict(new_data)
9
10# Detect anomalies
11anomalies = new_data[
12 (new_data['y'] < forecast['yhat_lower']) |
13 (new_data['y'] > forecast['yhat_upper'])
14]
The strength of this approach: It explicitly models:
- Trend: Long-term increase or decrease
- Weekly seasonality: Monday-Sunday patterns
- Annual seasonality: Holiday and seasonal patterns
- Holiday effects: Known events that cause unusual behavior
Building Your Own Anomaly Detection: A Practical Approach
You don't need Uber's infrastructure to benefit from anomaly detection. Here's a practical approach that works at any scale:
Level 1: Dynamic Thresholds
Replace static thresholds with rolling statistics:
1def is_anomaly(current_value, recent_values, num_std=3):
2 mean = np.mean(recent_values)
3 std = np.std(recent_values)
4
5 lower_bound = mean - (num_std * std)
6 upper_bound = mean + (num_std * std)
7
8 return current_value < lower_bound or current_value > upper_bound
This catches obvious anomalies while adapting to changing baselines. It takes 30 minutes to implement.
Level 2: Seasonal Decomposition
Account for time-of-day and day-of-week patterns:
1from statsmodels.tsa.seasonal import seasonal_decompose
2
3# Decompose into trend + seasonal + residual
4result = seasonal_decompose(metrics, model='additive', period=24*7)
5
6# Anomalies are extreme residuals
7residual_std = result.resid.std()
8anomalies = abs(result.resid) > 3 * residual_std
Level 3: Multi-Metric Correlation
The most powerful anomalies aren't in individual metrics — they're in the relationships between metrics:
1Normal: CPU up → Latency stable → Error rate stable
2 (healthy scaling under load)
3
4Anomalous: CPU up → Latency up → Error rate up
5 (something is wrong — resource contention, memory leak, etc.)
Monitoring the correlation between metrics catches issues that no single metric would flag.
The Alert Fatigue Problem
The hardest part of anomaly detection isn't detecting anomalies — it's not crying wolf.
At LinkedIn's scale, naive anomaly detection generates thousands of alerts per day. Most are false positives. Engineers start ignoring alerts. Then a real anomaly gets missed.
Solutions that work:
Alert Aggregation
Group related alerts by service, time window, and dependency chain. Instead of 50 individual alerts, show one incident with context.
Severity Scoring
Not all anomalies are equal. Score based on:
- Blast radius: How many users are affected?
- Duration: Has this persisted for 5 seconds or 5 minutes?
- Trend: Is it getting worse?
- Historical impact: Has this pattern caused outages before?
Feedback Loops
Let engineers mark alerts as true/false positives. Feed this back into the model. Over time, the system learns which anomalies matter.
What I've Learned From Studying These Systems
-
The right algorithm matters less than the right data. An Isolation Forest with clean, well-structured metrics outperforms a deep learning model with noisy data every time.
-
Seasonality is not optional. Any system that doesn't account for time-of-day and day-of-week patterns will drown in false positives.
-
Detection without root cause analysis is just noise. Knowing something is wrong is useless if the engineer still spends 30 minutes figuring out what.
-
Start simple, iterate. Dynamic thresholds (Level 1 above) catch 80% of meaningful anomalies. Add complexity only when the simple approach fails.
-
The on-call engineer is the final algorithm. Every system ultimately relies on a human making a judgment call. Your anomaly detection system should reduce the noise they have to process, not replace their judgment.
The companies that do monitoring well don't have fancier algorithms. They have systems that respect their engineers' attention as the scarce resource it is.
This analysis synthesizes public information from Uber's Argos system, LinkedIn's anomaly detection research, Microsoft's Anomaly Detection service, and Facebook's Prophet framework, combined with practical experience implementing monitoring systems at scale.