Why Every Senior Engineer Should Master Back-of-the-Envelope Calculations

A few years ago, I sat in a design review where a team proposed building a real-time analytics pipeline. They'd spent three weeks designing the architecture. Kafka, Flink, Elasticsearch, Redis — the whole zoo.

A senior architect asked one question: "How many events per second are we expecting?"

"About 50."

Fifty events per second. A single PostgreSQL instance with a basic index handles that without breaking a sweat. The team had designed a system 1,000x more complex than necessary because nobody did the math first.

This is why back-of-the-envelope calculations matter. They're not about getting exact numbers. They're about getting the order of magnitude right so you don't build a rocket ship when you need a bicycle.

The Numbers Every Engineer Should Memorize

Before you can estimate anything, you need reference points. Memorize these:

Latency Numbers

1L1 cache reference:                 1 ns
2L2 cache reference:                 4 ns
3Main memory reference:             100 ns
4SSD random read:                   150 μs
5HDD random read:                    10 ms
6Network round trip (same DC):      500 μs
7Network round trip (cross-country):  60 ms
8Network round trip (cross-ocean):   150 ms

The key insight: There's a 10x-100x gap between each level. Memory is 1,000x faster than SSD. SSD is 100x faster than HDD. Same datacenter is 100x faster than cross-country.

Throughput Numbers

1Sequential read from SSD:          1 GB/s
2Sequential read from HDD:        100 MB/s
3Network bandwidth (1 Gbps):      125 MB/s
4Network bandwidth (10 Gbps):    1.25 GB/s

Storage Numbers

11 ASCII character:                  1 byte
2Average English word:               5 bytes
3Average tweet/message:            200 bytes
4Average JSON API response:          2 KB
5Average web page:                   2 MB
6Average photo (compressed):         2 MB
71 minute of HD video:             150 MB

Capacity Numbers

1QPS a single web server handles:     1,000 - 10,000
2QPS a single database handles:       5,000 - 10,000
3QPS Redis handles (single node):   100,000 - 200,000
4Requests a load balancer handles: 100,000+

The Framework: 5 Steps to Any Estimate

I use the same framework every time, whether I'm estimating storage for a social media feed or throughput for a payment system.

Step 1: Define the Scale

Start with the number of users and their behavior.

1Example: Estimate Twitter's storage needs
2
3Users:              500 million total
4Daily active users: 200 million (40% DAU ratio)
5Tweets per user:    2 per day (average)
6Total tweets/day:   400 million

Step 2: Estimate the Data Size Per Unit

1Average tweet:
2  - Text:           280 chars → 280 bytes
3  - Metadata:       user_id, timestamp, etc. → 200 bytes
4  - Media link:     (30% have media) → 100 bytes average
5  Total per tweet:  ~600 bytes ≈ 1 KB (round up for safety)

Pro tip: Always round up. It's better to overestimate capacity than underestimate. In system design, the cost of having too much capacity is low; the cost of too little is an outage.

Step 3: Calculate Daily / Monthly / Yearly

1Daily storage:      400M tweets × 1 KB = 400 GB/day
2Monthly storage:    400 GB × 30 = 12 TB/month
3Yearly storage:     12 TB × 12 = 144 TB/year
45-year storage:     144 TB × 5 = 720 TB

Step 4: Factor in Replication and Overhead

Real systems don't store one copy of anything.

1Replication factor: 3 (standard for most distributed systems)
2Storage with replication: 720 TB × 3 = 2.16 PB
3Index overhead: ~30%
4Total 5-year storage: 2.16 PB × 1.3 ≈ 2.8 PB

Step 5: Sanity Check

Does 2.8 PB over 5 years for Twitter's text data sound reasonable? Twitter (X) has reported petabyte-scale storage requirements, so yes — we're in the right ballpark.

Real-World Estimation: YouTube's Bandwidth

Let's do a more complex estimation that Google's own capacity planning teams have discussed.

1YouTube Statistics:
2- 2.5 billion monthly active users
3- Average session: 40 minutes/day
4- Average video bitrate: 5 Mbps (mix of qualities)

Daily bandwidth calculation

1Concurrent viewers (peak):
2  2.5B MAU × 0.3 (DAU ratio) × 0.1 (peak fraction)
3  = 75 million concurrent viewers
4
5Bandwidth:
6  75M viewers × 5 Mbps = 375 Petabits/second
7  = 375 Pbps ÷ 8 = ~47 PB/s
8
9Wait, that seems too high. Let me reconsider...
10
11Actually, not all users watch simultaneously.
12Peak concurrent = total_daily_minutes / minutes_in_day × peak_factor
13
14Total daily minutes: 750M DAU × 40 min = 30 billion minutes
15Average concurrent: 30B min / 1440 min = ~20 million
16Peak concurrent (2x average): ~40 million
17
18Bandwidth: 40M × 5 Mbps = 200 Tbps = 25 TB/s

YouTube has reported peak bandwidth in the hundreds of terabits per second range. Our estimate of 200 Tbps is reasonable.

Notice what happened: My first estimate was wrong, but the framework helped me catch it. I sanity-checked, found an error in my concurrent user assumption, and corrected it. This iterative process is the whole point.

The Powers of 2 Cheat Sheet

These come up constantly in capacity estimation:

12^10 = 1,024          ≈ 1 Thousand (1 KB)
22^20 = 1,048,576      ≈ 1 Million  (1 MB)
32^30 = 1,073,741,824  ≈ 1 Billion  (1 GB)
42^40 =                ≈ 1 Trillion (1 TB)
5
6Daily seconds:   86,400     ≈ 10^5
7Monthly seconds: 2,592,000  ≈ 2.5 × 10^6
8Yearly seconds:  31,536,000 ≈ 3 × 10^7

Quick QPS conversions:

11 million requests/day    = ~12 QPS
210 million requests/day   = ~120 QPS
3100 million requests/day  = ~1,200 QPS
41 billion requests/day    = ~12,000 QPS

Memorize these conversions. They let you go from "we have X million daily users" to "we need Y QPS" in seconds.

Common Estimation Mistakes

Mistake 1: Confusing Peak with Average

Your system needs to handle peak load, not average load. Peak is typically 2-10x average, depending on the application.

1E-commerce: Peak (Black Friday) = 10x average
2Social media: Peak (evening) = 3x average
3B2B SaaS: Peak (Monday morning) = 2x average

Always design for peak. Always estimate average first, then multiply.

Mistake 2: Forgetting the 80/20 Rule

80% of traffic typically hits 20% of your data. This has massive implications for caching:

1Total data: 10 TB
2Hot data (20%): 2 TB
3If 2 TB fits in cache → 80% cache hit rate
4Your database only handles 20% of read traffic

Mistake 3: Ignoring Write Amplification

When you write 1 KB to a database, the actual disk I/O is much higher:

Write-ahead log
MemTable flush
Compaction (LSM trees rewrite data multiple times)
Replication to other nodes
Index updates

Rule of thumb: Actual disk writes are 10-30x the logical write size for LSM-tree databases.

Mistake 4: Assuming Linear Scaling

Doubling your servers doesn't double your capacity. Coordination overhead, network bottlenecks, and shared resources mean you typically get 70-80% efficiency at scale.

11 server:    10,000 QPS
22 servers:   18,000 QPS (not 20,000)
310 servers:  70,000 QPS (not 100,000)
4100 servers: 500,000 QPS (not 1,000,000)

The Meta-Skill: Thinking in Orders of Magnitude

The real power of back-of-the-envelope calculations isn't producing exact numbers. It's developing intuition about scale.

When someone proposes a design, you should instantly know:

"That's a ~1,000 QPS problem" → single server with caching
"That's a ~100,000 QPS problem" → distributed system with load balancing
"That's a ~10,000,000 QPS problem" → multi-region with CDN and aggressive caching

This intuition prevents the most expensive mistake in engineering: building the wrong thing at the wrong scale.

A system designed for 10x your actual needs is fine (growth headroom). A system designed for 1,000x your actual needs is a maintenance nightmare that costs 100x more than necessary. A system designed for 0.1x your actual needs falls over in production.

Get within 10x of the right answer, and you'll make the right architectural decisions. That's all back-of-the-envelope calculations need to do.

The numbers and frameworks in this post draw from Jeff Dean's famous latency numbers, Google's capacity planning methodologies, AWS's back-of-the-envelope estimation guides, and real-world system design interview patterns.