A few years ago, I sat in a design review where a team proposed building a real-time analytics pipeline. They'd spent three weeks designing the architecture. Kafka, Flink, Elasticsearch, Redis — the whole zoo.
A senior architect asked one question: "How many events per second are we expecting?"
"About 50."
Fifty events per second. A single PostgreSQL instance with a basic index handles that without breaking a sweat. The team had designed a system 1,000x more complex than necessary because nobody did the math first.
This is why back-of-the-envelope calculations matter. They're not about getting exact numbers. They're about getting the order of magnitude right so you don't build a rocket ship when you need a bicycle.
The Numbers Every Engineer Should Memorize
Before you can estimate anything, you need reference points. Memorize these:
Latency Numbers
1L1 cache reference: 1 ns
2L2 cache reference: 4 ns
3Main memory reference: 100 ns
4SSD random read: 150 μs
5HDD random read: 10 ms
6Network round trip (same DC): 500 μs
7Network round trip (cross-country): 60 ms
8Network round trip (cross-ocean): 150 ms
The key insight: There's a 10x-100x gap between each level. Memory is 1,000x faster than SSD. SSD is 100x faster than HDD. Same datacenter is 100x faster than cross-country.
Throughput Numbers
1Sequential read from SSD: 1 GB/s
2Sequential read from HDD: 100 MB/s
3Network bandwidth (1 Gbps): 125 MB/s
4Network bandwidth (10 Gbps): 1.25 GB/s
Storage Numbers
11 ASCII character: 1 byte
2Average English word: 5 bytes
3Average tweet/message: 200 bytes
4Average JSON API response: 2 KB
5Average web page: 2 MB
6Average photo (compressed): 2 MB
71 minute of HD video: 150 MB
Capacity Numbers
1QPS a single web server handles: 1,000 - 10,000
2QPS a single database handles: 5,000 - 10,000
3QPS Redis handles (single node): 100,000 - 200,000
4Requests a load balancer handles: 100,000+
The Framework: 5 Steps to Any Estimate
I use the same framework every time, whether I'm estimating storage for a social media feed or throughput for a payment system.
Step 1: Define the Scale
Start with the number of users and their behavior.
1Example: Estimate Twitter's storage needs
2
3Users: 500 million total
4Daily active users: 200 million (40% DAU ratio)
5Tweets per user: 2 per day (average)
6Total tweets/day: 400 million
Step 2: Estimate the Data Size Per Unit
1Average tweet:
2 - Text: 280 chars → 280 bytes
3 - Metadata: user_id, timestamp, etc. → 200 bytes
4 - Media link: (30% have media) → 100 bytes average
5 Total per tweet: ~600 bytes ≈ 1 KB (round up for safety)
Pro tip: Always round up. It's better to overestimate capacity than underestimate. In system design, the cost of having too much capacity is low; the cost of too little is an outage.
Step 3: Calculate Daily / Monthly / Yearly
1Daily storage: 400M tweets × 1 KB = 400 GB/day
2Monthly storage: 400 GB × 30 = 12 TB/month
3Yearly storage: 12 TB × 12 = 144 TB/year
45-year storage: 144 TB × 5 = 720 TB
Step 4: Factor in Replication and Overhead
Real systems don't store one copy of anything.
1Replication factor: 3 (standard for most distributed systems)
2Storage with replication: 720 TB × 3 = 2.16 PB
3Index overhead: ~30%
4Total 5-year storage: 2.16 PB × 1.3 ≈ 2.8 PB
Step 5: Sanity Check
Does 2.8 PB over 5 years for Twitter's text data sound reasonable? Twitter (X) has reported petabyte-scale storage requirements, so yes — we're in the right ballpark.
Real-World Estimation: YouTube's Bandwidth
Let's do a more complex estimation that Google's own capacity planning teams have discussed.
1YouTube Statistics:
2- 2.5 billion monthly active users
3- Average session: 40 minutes/day
4- Average video bitrate: 5 Mbps (mix of qualities)
Daily bandwidth calculation
1Concurrent viewers (peak):
2 2.5B MAU × 0.3 (DAU ratio) × 0.1 (peak fraction)
3 = 75 million concurrent viewers
4
5Bandwidth:
6 75M viewers × 5 Mbps = 375 Petabits/second
7 = 375 Pbps ÷ 8 = ~47 PB/s
8
9Wait, that seems too high. Let me reconsider...
10
11Actually, not all users watch simultaneously.
12Peak concurrent = total_daily_minutes / minutes_in_day × peak_factor
13
14Total daily minutes: 750M DAU × 40 min = 30 billion minutes
15Average concurrent: 30B min / 1440 min = ~20 million
16Peak concurrent (2x average): ~40 million
17
18Bandwidth: 40M × 5 Mbps = 200 Tbps = 25 TB/s
YouTube has reported peak bandwidth in the hundreds of terabits per second range. Our estimate of 200 Tbps is reasonable.
Notice what happened: My first estimate was wrong, but the framework helped me catch it. I sanity-checked, found an error in my concurrent user assumption, and corrected it. This iterative process is the whole point.
The Powers of 2 Cheat Sheet
These come up constantly in capacity estimation:
12^10 = 1,024 ≈ 1 Thousand (1 KB)
22^20 = 1,048,576 ≈ 1 Million (1 MB)
32^30 = 1,073,741,824 ≈ 1 Billion (1 GB)
42^40 = ≈ 1 Trillion (1 TB)
5
6Daily seconds: 86,400 ≈ 10^5
7Monthly seconds: 2,592,000 ≈ 2.5 × 10^6
8Yearly seconds: 31,536,000 ≈ 3 × 10^7
Quick QPS conversions:
11 million requests/day = ~12 QPS
210 million requests/day = ~120 QPS
3100 million requests/day = ~1,200 QPS
41 billion requests/day = ~12,000 QPS
Memorize these conversions. They let you go from "we have X million daily users" to "we need Y QPS" in seconds.
Common Estimation Mistakes
Mistake 1: Confusing Peak with Average
Your system needs to handle peak load, not average load. Peak is typically 2-10x average, depending on the application.
1E-commerce: Peak (Black Friday) = 10x average
2Social media: Peak (evening) = 3x average
3B2B SaaS: Peak (Monday morning) = 2x average
Always design for peak. Always estimate average first, then multiply.
Mistake 2: Forgetting the 80/20 Rule
80% of traffic typically hits 20% of your data. This has massive implications for caching:
1Total data: 10 TB
2Hot data (20%): 2 TB
3If 2 TB fits in cache → 80% cache hit rate
4Your database only handles 20% of read traffic
Mistake 3: Ignoring Write Amplification
When you write 1 KB to a database, the actual disk I/O is much higher:
- Write-ahead log
- MemTable flush
- Compaction (LSM trees rewrite data multiple times)
- Replication to other nodes
- Index updates
Rule of thumb: Actual disk writes are 10-30x the logical write size for LSM-tree databases.
Mistake 4: Assuming Linear Scaling
Doubling your servers doesn't double your capacity. Coordination overhead, network bottlenecks, and shared resources mean you typically get 70-80% efficiency at scale.
11 server: 10,000 QPS
22 servers: 18,000 QPS (not 20,000)
310 servers: 70,000 QPS (not 100,000)
4100 servers: 500,000 QPS (not 1,000,000)
The Meta-Skill: Thinking in Orders of Magnitude
The real power of back-of-the-envelope calculations isn't producing exact numbers. It's developing intuition about scale.
When someone proposes a design, you should instantly know:
- "That's a ~1,000 QPS problem" → single server with caching
- "That's a ~100,000 QPS problem" → distributed system with load balancing
- "That's a ~10,000,000 QPS problem" → multi-region with CDN and aggressive caching
This intuition prevents the most expensive mistake in engineering: building the wrong thing at the wrong scale.
A system designed for 10x your actual needs is fine (growth headroom). A system designed for 1,000x your actual needs is a maintenance nightmare that costs 100x more than necessary. A system designed for 0.1x your actual needs falls over in production.
Get within 10x of the right answer, and you'll make the right architectural decisions. That's all back-of-the-envelope calculations need to do.
The numbers and frameworks in this post draw from Jeff Dean's famous latency numbers, Google's capacity planning methodologies, AWS's back-of-the-envelope estimation guides, and real-world system design interview patterns.