Most engineers use Cassandra, DynamoDB, or BigTable without understanding how they work internally. They treat them as magical key-value stores that scale infinitely.
Then something goes wrong — a hot partition, a read amplification issue, a compaction storm — and they have no mental model to diagnose it.
I've been that engineer. I've stared at Cassandra metrics wondering why read latency spiked 10x. The answer was always in the internals I'd never bothered to learn.
Let's fix that.
The Fundamental Trade-Off: LSM Trees vs B-Trees
Every database stores data on disk using a data structure. Traditional databases (PostgreSQL, MySQL) use B-Trees. Most NoSQL databases use Log-Structured Merge Trees (LSM Trees).
This single design choice explains almost every behavioral difference between SQL and NoSQL databases.
B-Trees: Optimize for Reads
1B-Tree Write:
21. Find the correct leaf node
32. Update the value in place
43. If the node is full, split it
5
6B-Tree Read:
71. Traverse from root to leaf (O(log n))
82. Read the value directly
B-Trees are read-optimized. Every value has exactly one location on disk. Reads are fast and predictable. But writes require random I/O — updating a value means seeking to its location on disk.
LSM Trees: Optimize for Writes
1LSM Tree Write:
21. Write to in-memory buffer (MemTable)
32. When MemTable is full, flush to disk as sorted file (SSTable)
43. Background compaction merges SSTables
5
6LSM Tree Read:
71. Check MemTable
82. Check each SSTable level (newest to oldest)
93. Return first match found
LSM Trees are write-optimized. Every write is sequential — you never update in place. You just append. This makes writes extremely fast.
The cost? Reads might need to check multiple SSTables. This is called read amplification, and it's the primary challenge of LSM-based databases.
How Cassandra Works Inside
Cassandra combines LSM Trees with a distributed architecture. Here's what happens when you write data:
The Write Path
1Client Write Request
2 │
3 ├── 1. Coordinator node receives request
4 │
5 ├── 2. Determine replica nodes using consistent hashing
6 │ (partition key → token → node mapping)
7 │
8 ├── 3. Send write to all replica nodes simultaneously
9 │
10 └── On each replica node:
11 ├── Write to commit log (durability)
12 ├── Write to MemTable (in-memory, sorted)
13 ├── Return acknowledgment
14 └── (Eventually) Flush MemTable to SSTable on disk
Key insight: Cassandra doesn't wait for all replicas to acknowledge. With a replication factor of 3 and consistency level QUORUM, it only needs 2 out of 3 nodes to acknowledge. This is why Cassandra writes are so fast — it trades strict consistency for availability.
The Read Path
1Client Read Request
2 │
3 ├── 1. Coordinator determines replica nodes
4 │
5 ├── 2. Send read to enough replicas for consistency level
6 │
7 └── On each replica node:
8 ├── Check Bloom filter (is this key possibly in this SSTable?)
9 ├── Check partition key cache
10 ├── Check MemTable
11 ├── Check SSTables (newest to oldest)
12 │ ├── Check partition index
13 │ ├── Check compression offset map
14 │ └── Read data from disk
15 └── Return result
The Bloom filter is critical. Without it, every read would need to check every SSTable. A Bloom filter tells you "this key is definitely NOT in this SSTable" with 100% accuracy, or "this key MIGHT be in this SSTable" with ~99% accuracy. This eliminates most unnecessary disk reads.
Compaction: The Hidden Performance Driver
As SSTables accumulate, read performance degrades (more files to check). Compaction merges SSTables to keep the count manageable.
Size-Tiered Compaction (default):
- Merges SSTables of similar sizes
- Good for write-heavy workloads
- Can temporarily double disk usage during compaction
Leveled Compaction:
- Organizes SSTables into levels of increasing size
- Guarantees each key exists in at most one SSTable per level
- Much better read performance
- Higher write amplification (data is rewritten more often)
Choosing the wrong compaction strategy is the #1 performance mistake I see with Cassandra. ScyllaDB (a Cassandra-compatible database) even introduced incremental compaction to address the trade-offs of both approaches.
How DynamoDB Works Inside
DynamoDB takes a fundamentally different approach from Cassandra, despite both being "NoSQL databases."
Single-Digit Millisecond Guarantee
DynamoDB's core promise is predictable, single-digit millisecond latency regardless of scale. They achieve this through:
Partition-based architecture: Every table is split into partitions based on the partition key. Each partition handles up to:
- 3,000 read capacity units (RCUs)
- 1,000 write capacity units (WCUs)
- 10 GB of data
Automatic splitting: When a partition exceeds limits, DynamoDB transparently splits it. You never manage partitions manually.
Request routing: A request router front-ends all DynamoDB traffic, directing each request to the correct partition. This router is itself a distributed, replicated system.
The Hot Partition Problem
DynamoDB's most common production issue is the hot partition. If your partition key distribution is skewed (e.g., all writes go to today's date), one partition handles disproportionate traffic while others sit idle.
1Bad partition key: date (2026-02-04)
2 → All today's writes hit ONE partition
3 → Partition throttled at 1,000 WCU
4 → Your 10,000 WCU table capacity is useless
5
6Good partition key: user_id
7 → Writes distributed across millions of users
8 → All partitions share load evenly
DynamoDB added adaptive capacity to partially mitigate this — it can temporarily boost a hot partition's capacity by borrowing from underutilized partitions. But this is a bandaid, not a solution. Design your partition keys well.
DynamoDB Streams: CDC Built In
One of DynamoDB's most powerful features is DynamoDB Streams — a change data capture (CDC) system built directly into the database. Every write to a DynamoDB table can trigger a stream event, enabling:
- Real-time replication to other systems
- Event-driven architectures (Lambda triggers)
- Cross-region replication (Global Tables)
This is the same pattern Netflix uses for cache invalidation, but built into the database itself.
How BigTable Works Inside
Google's BigTable is the grandfather of modern NoSQL databases. Cassandra and HBase both drew heavily from its design. Understanding BigTable helps you understand the entire family.
The Architecture
1┌─────────────────────┐
2│ Client API │
3└──────────┬──────────┘
4 │
5┌──────────┴──────────┐
6│ Master Server │ ← Manages tablet assignments
7│ (Chubby lock svc) │ and load balancing
8└──────────┬──────────┘
9 │
10 ┌──────┴──────┐
11 │ Tablet │ ← Each serves a range of rows
12 │ Servers │ (sorted by row key)
13 └──────┬──────┘
14 │
15┌──────────┴──────────┐
16│ Google File System │ ← Distributed storage layer
17│ (GFS/Colossus) │
18└─────────────────────┘
Tablets are the unit of distribution. Each tablet contains a contiguous range of rows, sorted by row key. As a tablet grows, it splits. The master server assigns tablets to tablet servers for load balancing.
The Column Family Model
BigTable organizes data in a unique way:
1Row Key → Column Family:Column Qualifier → Timestamp → Value
2
3Example:
4"user:12345" → "profile:name" → 2026-02-04T10:00 → "Alice"
5"user:12345" → "profile:email" → 2026-02-04T10:00 → "alice@example.com"
6"user:12345" → "activity:login" → 2026-02-04T09:00 → "success"
7"user:12345" → "activity:login" → 2026-02-03T15:00 → "failed"
Multiple versions of a cell can exist (identified by timestamp). This is incredibly powerful for time-series data — you get versioning built into the storage model.
HyperLogLog: Counting Things at Scale
One algorithm that appears across all these databases is HyperLogLog — a probabilistic data structure for counting distinct elements.
Imagine you need to count unique visitors to a website. The naive approach (storing every visitor ID in a set) requires memory proportional to the number of unique visitors. At billions of visitors, that's terabytes.
HyperLogLog counts billions of unique elements using approximately 12 KB of memory with ~2% error rate.
1Standard approach: Count 1 billion unique items → ~8 GB memory
2HyperLogLog: Count 1 billion unique items → ~12 KB memory
3 (with ~2% accuracy trade-off)
Redis, BigTable, and most analytics databases implement HyperLogLog natively. Every time you see "approximate distinct count" in a dashboard, HyperLogLog is probably behind it.
Choosing Between Them
After deep-diving into all three, here's my decision framework:
| Factor | Cassandra | DynamoDB | BigTable |
|---|---|---|---|
| Best for | Write-heavy, high throughput | Predictable latency, serverless | Large-scale analytics, time-series |
| Manage yourself? | Yes (or use Astra) | No (fully managed) | No (fully managed) |
| Data model | Wide column | Key-value / document | Wide column |
| Consistency | Tunable | Strong or eventual | Strong |
| Scale limit | Practically unlimited | Practically unlimited | Practically unlimited |
| Cost model | Infrastructure | Per-request | Per-node-hour |
Choose Cassandra when you need multi-cloud, open-source, and can invest in operational expertise.
Choose DynamoDB when you want zero operational overhead and your workload fits the partition key model.
Choose BigTable when you're in GCP and your workload involves time-series data, analytics, or large sequential scans.
The Real Lesson
The databases are different, but the underlying principles are the same: LSM Trees for write optimization, consistent hashing for data distribution, replication for fault tolerance, and compaction for maintaining read performance.
Learn the principles, and every new database becomes a variation on a theme you already understand.
This deep dive synthesizes information from the original Google BigTable paper, Amazon's DynamoDB papers, Apache Cassandra's architecture documentation, and ScyllaDB's compaction strategy research.