The Architecture Behind Serving 40 Million Requests Per Second: Lessons from Uber's Caching Strategy

Forty million requests per second.

Let that number sink in. That's roughly 470,000 requests hitting Uber's infrastructure every single second of every single day. And the only reason their systems don't collapse under this load is a carefully designed caching architecture that most engineers never see.

I've spent years studying how companies like Uber, Google, and Microsoft design their caching layers. What I've found is that the difference between amateur and production-grade caching isn't the cache itself — it's the strategy around it.

Why Caching Is the Most Underestimated Skill in System Design

Most engineers think of caching as "put Redis in front of your database." That's like saying surgery is "cut and stitch." Technically true. Practically useless.

Effective caching requires answering questions most people never ask:

What happens when the cache goes down?
How do you prevent thundering herds after a cache miss?
How do you keep cache and database consistent?
When should you NOT cache something?

Let's walk through the patterns that answer these questions at scale.

Pattern 1: Cache-Aside (Lazy Loading)

This is the most common pattern and what most people mean when they say "caching."

1Read Request Flow:
21. Check cache → Hit? Return data
32. Cache miss → Read from database
43. Write result to cache
54. Return data

1async function getUser(userId: string): Promise<User> {
2  // Step 1: Check cache
3  const cached = await redis.get(`user:${userId}`);
4  if (cached) return JSON.parse(cached);
5
6  // Step 2: Cache miss - read from DB
7  const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
8
9  // Step 3: Populate cache with TTL
10  await redis.setex(`user:${userId}`, 3600, JSON.stringify(user));
11
12  return user;
13}

When to use it: Read-heavy workloads where stale data is acceptable for short periods.

The trap: On a cache restart, every request becomes a cache miss simultaneously. This is the thundering herd problem, and it can take down your database instantly.

Pattern 2: Write-Through Cache

Every write goes to both the cache and the database simultaneously.

1Write Request Flow:
21. Write to cache
32. Cache writes to database
43. Confirm to client

The key insight from Uber's architecture: Write-through caching eliminates the inconsistency window that cache-aside creates. When you write to the DB and then update the cache (cache-aside), there's a brief moment where the cache is stale. At 40M RPS, "brief" means thousands of requests served stale data.

The trade-off: Higher write latency (every write hits two systems) but perfect read consistency.

Pattern 3: Write-Behind (Write-Back) Cache

This is where it gets interesting. Writes go to the cache only, and the cache asynchronously flushes to the database.

1Write Request Flow:
21. Write to cache → Return immediately
32. (Async) Cache writes to database in batches

Why Uber uses this for specific workloads: When write latency matters more than durability. Location updates from millions of active drivers are a perfect example — you can afford to lose a few seconds of location data, but you can't afford each update to take 50ms instead of 2ms.

The risk: If the cache crashes before flushing, you lose data. This pattern is only appropriate when you explicitly accept that trade-off.

The Thundering Herd Solution

This is the problem that separates toy systems from production systems.

Imagine your cache for a popular item expires. In the microseconds before any single request can repopulate it, 10,000 requests arrive and all see a cache miss. All 10,000 hit the database simultaneously.

The solution: Request coalescing (also called cache stampede protection)

1const inFlightRequests = new Map<string, Promise<any>>();
2
3async function getWithCoalescing(key: string, fetcher: () => Promise<any>) {
4  // Check cache first
5  const cached = await redis.get(key);
6  if (cached) return JSON.parse(cached);
7
8  // If someone is already fetching this key, wait for their result
9  if (inFlightRequests.has(key)) {
10    return inFlightRequests.get(key);
11  }
12
13  // First request wins - everyone else waits
14  const promise = fetcher().then(async (result) => {
15    await redis.setex(key, 3600, JSON.stringify(result));
16    inFlightRequests.delete(key);
17    return result;
18  });
19
20  inFlightRequests.set(key, promise);
21  return promise;
22}

Google's Guava cache implements this natively with LoadingCache. Only one thread computes the value; all others wait. This single pattern prevented more outages at Google than any other caching strategy.

Cache Invalidation: The Two Hardest Problems

"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton

Here's why cache invalidation is hard. Consider this scenario:

1Time 0ms: User updates email to "new@email.com"
2Time 1ms: DB updated successfully
3Time 2ms: Another request reads user data (cache hit — returns old email)
4Time 3ms: Cache invalidated
5Time 4ms: Requests now see updated email

For 2 milliseconds, the system served stale data. At Uber's scale, that's 80,000 requests with wrong data.

The strategies that actually work at scale:

Event-Driven Invalidation (CDC)

Netflix pioneered using Change Data Capture for cache invalidation. Instead of application code invalidating the cache, a CDC pipeline watches the database transaction log and invalidates cache entries as data changes.

This decouples your application from cache management entirely and guarantees no write can be missed.

TTL with Jitter

Never set the same TTL for all cache entries:

1// Bad: All entries expire at exactly the same time
2await redis.setex(key, 3600, value);
3
4// Good: Random jitter prevents synchronized expiration
5const jitter = Math.floor(Math.random() * 600); // 0-10 minutes
6await redis.setex(key, 3600 + jitter, value);

Without jitter, popular cache entries expire simultaneously, causing periodic load spikes that look like heartbeats on your monitoring dashboard.

The Cache Hierarchy That Handles 40M RPS

Uber's actual architecture uses multiple cache layers:

1L1: In-process cache (microsecond access, limited size)
2    ↓ miss
3L2: Distributed cache - Redis/Memcached (sub-millisecond, large capacity)
4    ↓ miss
5L3: Database with connection pooling

The in-process cache (L1) is the secret weapon. Libraries like Google's Caffeine (JVM) provide:

Window TinyLFU eviction — near-optimal hit rates
Async refresh — expired entries serve stale while refreshing in the background
Size-based eviction — automatically manages memory pressure

At Uber's scale, an L1 cache absorbing even 30% of requests means 12 million fewer requests hitting the distributed cache per second.

What I've Learned

After studying caching architectures at Uber, Google, Netflix, and Microsoft, here are the principles that actually matter:

Cache the computation, not just the data. If a query involves joining 5 tables, cache the result, not the individual tables.
Monitor your hit rates religiously. A cache with 50% hit rate is worse than no cache — you have all the complexity of cache invalidation with half the benefit.
Design for cache failure. Your system must survive a total cache wipeout. If it can't, your cache isn't a cache — it's a single point of failure.
Start with cache-aside, evolve as needed. Most systems never need write-behind caching. Don't over-engineer.
The best cache is the request you never make. Before adding a cache layer, ask if you can redesign the access pattern to avoid the hot path entirely.

Caching isn't glamorous. It doesn't make for exciting architecture diagrams. But at the difference between a system that handles 40 million RPS and one that falls over at 40 thousand, it's almost always the caching layer.

This analysis draws from public engineering blogs and architecture talks from Uber, Google, Netflix, and Microsoft, combined with patterns documented in Martin Kleppmann's "Designing Data-Intensive Applications."