Distributed Databases Demystified: How Cassandra, DynamoDB, and BigTable Actually Work Under the Hood

February 4, 2026 (2w ago)

Most engineers use Cassandra, DynamoDB, or BigTable without understanding how they work internally. They treat them as magical key-value stores that scale infinitely.

Then something goes wrong — a hot partition, a read amplification issue, a compaction storm — and they have no mental model to diagnose it.

I've been that engineer. I've stared at Cassandra metrics wondering why read latency spiked 10x. The answer was always in the internals I'd never bothered to learn.

Let's fix that.

The Fundamental Trade-Off: LSM Trees vs B-Trees

Every database stores data on disk using a data structure. Traditional databases (PostgreSQL, MySQL) use B-Trees. Most NoSQL databases use Log-Structured Merge Trees (LSM Trees).

This single design choice explains almost every behavioral difference between SQL and NoSQL databases.

B-Trees: Optimize for Reads

1B-Tree Write:
21. Find the correct leaf node
32. Update the value in place
43. If the node is full, split it
5
6B-Tree Read:
71. Traverse from root to leaf (O(log n))
82. Read the value directly

B-Trees are read-optimized. Every value has exactly one location on disk. Reads are fast and predictable. But writes require random I/O — updating a value means seeking to its location on disk.

LSM Trees: Optimize for Writes

1LSM Tree Write:
21. Write to in-memory buffer (MemTable)
32. When MemTable is full, flush to disk as sorted file (SSTable)
43. Background compaction merges SSTables
5
6LSM Tree Read:
71. Check MemTable
82. Check each SSTable level (newest to oldest)
93. Return first match found

LSM Trees are write-optimized. Every write is sequential — you never update in place. You just append. This makes writes extremely fast.

The cost? Reads might need to check multiple SSTables. This is called read amplification, and it's the primary challenge of LSM-based databases.

How Cassandra Works Inside

Cassandra combines LSM Trees with a distributed architecture. Here's what happens when you write data:

The Write Path

1Client Write Request
23    ├── 1. Coordinator node receives request
45    ├── 2. Determine replica nodes using consistent hashing
6    │       (partition key → token → node mapping)
78    ├── 3. Send write to all replica nodes simultaneously
910    └── On each replica node:
11        ├── Write to commit log (durability)
12        ├── Write to MemTable (in-memory, sorted)
13        ├── Return acknowledgment
14        └── (Eventually) Flush MemTable to SSTable on disk

Key insight: Cassandra doesn't wait for all replicas to acknowledge. With a replication factor of 3 and consistency level QUORUM, it only needs 2 out of 3 nodes to acknowledge. This is why Cassandra writes are so fast — it trades strict consistency for availability.

The Read Path

1Client Read Request
23    ├── 1. Coordinator determines replica nodes
45    ├── 2. Send read to enough replicas for consistency level
67    └── On each replica node:
8        ├── Check Bloom filter (is this key possibly in this SSTable?)
9        ├── Check partition key cache
10        ├── Check MemTable
11        ├── Check SSTables (newest to oldest)
12        │     ├── Check partition index
13        │     ├── Check compression offset map
14        │     └── Read data from disk
15        └── Return result

The Bloom filter is critical. Without it, every read would need to check every SSTable. A Bloom filter tells you "this key is definitely NOT in this SSTable" with 100% accuracy, or "this key MIGHT be in this SSTable" with ~99% accuracy. This eliminates most unnecessary disk reads.

Compaction: The Hidden Performance Driver

As SSTables accumulate, read performance degrades (more files to check). Compaction merges SSTables to keep the count manageable.

Size-Tiered Compaction (default):

Leveled Compaction:

Choosing the wrong compaction strategy is the #1 performance mistake I see with Cassandra. ScyllaDB (a Cassandra-compatible database) even introduced incremental compaction to address the trade-offs of both approaches.

How DynamoDB Works Inside

DynamoDB takes a fundamentally different approach from Cassandra, despite both being "NoSQL databases."

Single-Digit Millisecond Guarantee

DynamoDB's core promise is predictable, single-digit millisecond latency regardless of scale. They achieve this through:

Partition-based architecture: Every table is split into partitions based on the partition key. Each partition handles up to:

Automatic splitting: When a partition exceeds limits, DynamoDB transparently splits it. You never manage partitions manually.

Request routing: A request router front-ends all DynamoDB traffic, directing each request to the correct partition. This router is itself a distributed, replicated system.

The Hot Partition Problem

DynamoDB's most common production issue is the hot partition. If your partition key distribution is skewed (e.g., all writes go to today's date), one partition handles disproportionate traffic while others sit idle.

1Bad partition key: date (2026-02-04)
2  → All today's writes hit ONE partition
3  → Partition throttled at 1,000 WCU
4  → Your 10,000 WCU table capacity is useless
5
6Good partition key: user_id
7  → Writes distributed across millions of users
8  → All partitions share load evenly

DynamoDB added adaptive capacity to partially mitigate this — it can temporarily boost a hot partition's capacity by borrowing from underutilized partitions. But this is a bandaid, not a solution. Design your partition keys well.

DynamoDB Streams: CDC Built In

One of DynamoDB's most powerful features is DynamoDB Streams — a change data capture (CDC) system built directly into the database. Every write to a DynamoDB table can trigger a stream event, enabling:

This is the same pattern Netflix uses for cache invalidation, but built into the database itself.

How BigTable Works Inside

Google's BigTable is the grandfather of modern NoSQL databases. Cassandra and HBase both drew heavily from its design. Understanding BigTable helps you understand the entire family.

The Architecture

1┌─────────────────────┐
2│     Client API      │
3└──────────┬──────────┘
45┌──────────┴──────────┐
6│   Master Server     │  ← Manages tablet assignments
7│  (Chubby lock svc)  │    and load balancing
8└──────────┬──────────┘
910    ┌──────┴──────┐
11    │   Tablet    │  ← Each serves a range of rows
12    │   Servers   │    (sorted by row key)
13    └──────┬──────┘
1415┌──────────┴──────────┐
16│  Google File System  │  ← Distributed storage layer
17│       (GFS/Colossus) │
18└─────────────────────┘

Tablets are the unit of distribution. Each tablet contains a contiguous range of rows, sorted by row key. As a tablet grows, it splits. The master server assigns tablets to tablet servers for load balancing.

The Column Family Model

BigTable organizes data in a unique way:

1Row Key → Column Family:Column Qualifier → Timestamp → Value
2
3Example:
4"user:12345" → "profile:name"    → 2026-02-04T10:00 → "Alice"
5"user:12345" → "profile:email"   → 2026-02-04T10:00 → "alice@example.com"
6"user:12345" → "activity:login"  → 2026-02-04T09:00 → "success"
7"user:12345" → "activity:login"  → 2026-02-03T15:00 → "failed"

Multiple versions of a cell can exist (identified by timestamp). This is incredibly powerful for time-series data — you get versioning built into the storage model.

HyperLogLog: Counting Things at Scale

One algorithm that appears across all these databases is HyperLogLog — a probabilistic data structure for counting distinct elements.

Imagine you need to count unique visitors to a website. The naive approach (storing every visitor ID in a set) requires memory proportional to the number of unique visitors. At billions of visitors, that's terabytes.

HyperLogLog counts billions of unique elements using approximately 12 KB of memory with ~2% error rate.

1Standard approach: Count 1 billion unique items → ~8 GB memory
2HyperLogLog:      Count 1 billion unique items → ~12 KB memory
3                  (with ~2% accuracy trade-off)

Redis, BigTable, and most analytics databases implement HyperLogLog natively. Every time you see "approximate distinct count" in a dashboard, HyperLogLog is probably behind it.

Choosing Between Them

After deep-diving into all three, here's my decision framework:

FactorCassandraDynamoDBBigTable
Best forWrite-heavy, high throughputPredictable latency, serverlessLarge-scale analytics, time-series
Manage yourself?Yes (or use Astra)No (fully managed)No (fully managed)
Data modelWide columnKey-value / documentWide column
ConsistencyTunableStrong or eventualStrong
Scale limitPractically unlimitedPractically unlimitedPractically unlimited
Cost modelInfrastructurePer-requestPer-node-hour

Choose Cassandra when you need multi-cloud, open-source, and can invest in operational expertise.

Choose DynamoDB when you want zero operational overhead and your workload fits the partition key model.

Choose BigTable when you're in GCP and your workload involves time-series data, analytics, or large sequential scans.

The Real Lesson

The databases are different, but the underlying principles are the same: LSM Trees for write optimization, consistent hashing for data distribution, replication for fault tolerance, and compaction for maintaining read performance.

Learn the principles, and every new database becomes a variation on a theme you already understand.


This deep dive synthesizes information from the original Google BigTable paper, Amazon's DynamoDB papers, Apache Cassandra's architecture documentation, and ScyllaDB's compaction strategy research.

Up next

Why Every Senior Engineer Should Master Back-of-the-Envelope Calculations

The skill that separates senior engineers from everyone else isn't coding — it's the ability to estimate system capacity in 5 minutes on a whiteboard. Here's how to build that muscle.