Chapter 7. Sharding

Clearly, we must break away from the sequential and not limit the computers. We must state definitions and provide for priorities and descriptions of data. We must state relationships, not procedures.

Grace Murray Hopper, Management and the Computer of the Future (1962)

Sharding and Partitioning
Pros and Cons of Sharding
Sharding for Multitenancy
- 3.1. Resource isolation
- 3.2. Permission isolation
- 3.3. Cell-based architecture
- 3.4. Per-tenant backup and restore
- 3.5. Regulatory compliance
- 3.6. Data residence
- 3.7. Gradual schema rollout
Sharding of Key-Value Data
Sharding by Key Range
- 5.1. Rebalancing key-range sharded data
Sharding by Hash of Key
- 6.1. Hash modulo number of nodes
- 6.2. Fixed number of shards
- 6.3. Sharding by hash range
Partitioning and Range Queries in Data Warehouses
- 7.1. Consistent hashing
Skewed Workloads and Relieving Hot Spots
Operations: Automatic or Manual Rebalancing
Request Routing
Sharding and Secondary Indexes

11.1. Local Secondary Indexes
11.2. Global Secondary Indexes

Summary

In plain English: Think of sharding like organizing a massive library. When one building can't hold all the books anymore, you create multiple branches across the city. Each branch holds different books (different shards), and you need a system to know which branch has which books.

In technical terms: Sharding splits a large dataset into smaller partitions, with each shard stored on a different node. Each piece of data belongs to exactly one shard, though shards are typically replicated for fault tolerance.

Why it matters: Sharding enables horizontal scaling—you can grow your database by adding more machines instead of needing a single massive server. This makes it possible to handle petabyte-scale datasets and millions of queries per second.

Combining Replication and Sharding

Shard 1

Leader

Follower A

Follower B

Shard 2

Leader

Follower A

Follower B

Each shard has one leader and multiple followers

Each node acts as leader for some shards, follower for others

A distributed database typically distributes data across nodes in two ways:

Replication — Having a copy of the same data on multiple nodes (covered in Chapter 6)
Sharding — Splitting data into smaller partitions stored on different nodes (this chapter)

Key characteristics of sharding:

Each piece of data (record, row, or document) belongs to exactly one shard
Each shard is essentially a small database of its own
Sharding is usually combined with replication for fault tolerance
A node may store multiple shards and be leader for some, follower for others

💡 Insight

The choice of sharding scheme is mostly independent of the choice of replication scheme. Everything from Chapter 6 about replication applies equally to replication of shards.

1. Sharding and Partitioning

The same concept, many names:

Database	Term Used
Kafka	partition
CockroachDB	range
HBase, TiDB	region
Bigtable, YugabyteDB	tablet
Cassandra, ScyllaDB, Riak	vnode
Couchbase	vBucket

PostgreSQL distinction:

Partitioning — Splitting a table into files on the same machine (fast deletes)
Sharding — Splitting across multiple machines

In most other systems, partitioning = sharding.

💡 Insight

The terminology is confusing, but the core idea is simple: break a big dataset into smaller pieces. Whether you call them shards, partitions, tablets, or vnodes, the fundamental challenge is the same—how do you split the data evenly while maintaining fast lookups?

Fun fact: The term "shard" may have originated from:

The game Ultima Online, where a magic crystal shattered into pieces (shards), each reflecting a game world
Or an acronym: System for Highly Available Replicated Data (a 1980s database)

⚠️ Don't confuse: Partitioning ≠ network partitions (netsplits). We cover network faults in Chapter 9.

2. Pros and Cons of Sharding

In plain English: Sharding is like choosing between one massive warehouse or many smaller warehouses distributed across the country. The distributed approach handles more volume, but adds complexity in tracking where everything is stored.

In technical terms: The primary reason for sharding is scalability when:

Data volume exceeds single-node capacity
Write throughput overwhelms one server

(If read throughput is the problem, use read scaling from Chapter 6 instead.)

Why it matters: Sharding enables horizontal scaling (scale-out):

Grow capacity by adding more machines, not bigger machines
Each shard handles a share of the load
Process data and queries in parallel

✅

When Sharding Helps

Data volume exceeds single machine capacity
Write throughput overwhelms one server
Need horizontal scaling for cost efficiency
Data naturally partitions (e.g., by tenant)

⚠️

When to Avoid Sharding

Single machine can handle the load
Queries frequently join across shards
Distributed transactions required
Team lacks operational experience

When to avoid sharding:

Replication works at any scale for fault tolerance
Single machines are powerful today—don't shard prematurely
Sharding adds significant complexity

The complexity cost:

You must choose a partition key upfront
All records with the same partition key → same shard
Knowing the shard = fast lookup; not knowing = search all shards
The sharding scheme is difficult to change later

💡 Insight

The "know your shard" principle is crucial. When you query by partition key (e.g., user_id = 123), the system routes directly to one shard. But when you query without it (e.g., "find all users named Alice"), you must scatter the query to all shards and gather results—potentially 100x slower.

Sharding works best for:

Key-value data — shard by key, queries go to one shard
Not ideal for relational data — secondary indexes and cross-shard joins are hard

Cross-shard writes are tricky:

Writes may need to update records in multiple shards
Single-node transactions are common; cross-shard transactions are slow
Distributed transactions (Chapter 8) may become a bottleneck

Special case: Single-machine sharding:

Some systems shard even on one machine (one process per CPU core)
Takes advantage of CPU parallelism and NUMA architecture
Examples: Redis, VoltDB, FoundationDB

3. Sharding for Multitenancy

In plain English: Think of a multitenant system like an apartment building. Each tenant (customer) has their own apartment (shard), with privacy from neighbors and independent utilities. If one tenant's plumbing breaks, it doesn't flood everyone else's apartment.

In technical terms: SaaS products are often multitenant:

Each tenant = one customer (e.g., a business)
Multiple users may log in under one tenant
Each tenant's dataset is separate from others

Why it matters: Sharding for multitenancy provides strong isolation—bugs, performance problems, or security issues in one tenant are less likely to affect others.

Implementation options:

One shard per tenant
Group small tenants into larger shards
Shards can be separate databases or portions of a logical database

Advantages of sharding for multitenancy:

3.1. Resource isolation

Expensive operations by one tenant don't affect others on different shards
Each shard has its own compute resources

3.2. Permission isolation

Bugs in access control are less likely to leak data between tenants
Physical separation = stronger security boundaries

3.3. Cell-based architecture

In plain English: A cell-based architecture is like having separate mini data centers for different customer groups. If one mini data center has a fire, only customers in that cell are affected.

In technical terms:

Shard not just data, but also application services
Each cell contains services + storage for a set of tenants
Cells run independently from each other

Why it matters: Limits blast radius—a bad deployment or bug affects only one cell, not the entire system.

3.4. Per-tenant backup and restore

Backup each tenant's shard separately
Restore one tenant without affecting others
Useful for accidental data deletion recovery

Each person's data in a separate shard
Data export = simple operation on their shard
Data deletion = simple operation on their shard

3.6. Data residence

Some tenants require data in specific jurisdictions
Region-aware databases can assign shards to specific regions

3.7. Gradual schema rollout

Roll out migrations one tenant at a time
Detect problems before they affect everyone
Reduces risk of system-wide failures

📊

Challenge: Large Tenants

Single tenant too big for one machine requires within-tenant sharding

🔍

Challenge: Small Tenants

Many tiny tenants create overhead; grouping them complicates rebalancing

🔗

Challenge: Cross-Tenant Features

Features connecting multiple tenants become harder with shard boundaries

Main challenges with multitenancy sharding:

Large tenants — If one tenant exceeds a single node, you need within-tenant sharding
Many small tenants — Per-tenant shards create overhead; grouping complicates rebalancing
Cross-tenant features — Queries joining multiple tenants require cross-shard operations

4. Sharding of Key-Value Data

The fundamental question: You have a large amount of data—how do you decide which records go on which nodes?

In plain English: Imagine you have 10 million user records to spread across 10 servers. Ideally, each server gets exactly 1 million users, and when you look up a user, you can quickly figure out which server has their data.

In technical terms: Our goal is to spread data and query load evenly:

10 nodes → handle 10x the data and throughput of 1 node
Adding/removing nodes → rebalance to maintain even distribution

Why it matters: Uneven distribution (skew) wastes resources. If 90% of your data lands on one shard, that shard becomes the bottleneck.

Key terminology:

Skewed — Unfair sharding where some shards have more load than others
Hot shard / hot spot — A shard with disproportionately high load
Hot key — A single key with exceptionally high traffic (e.g., celebrity account)

We need a sharding algorithm that:

Takes a partition key → tells us the shard
Partition key = primary key (or first part) in key-value stores
Partition key = some column (not necessarily PK) in relational DBs
Must support rebalancing to relieve hot spots

5. Sharding by Key Range

In plain English: Key range sharding is like organizing a library by book title. Books starting with A-D go on shelf 1, E-K on shelf 2, and so on. If you want "Harry Potter," you know exactly which shelf to check.

In technical terms:

Assign a contiguous range of keys to each shard (min to max)
Like encyclopedia volumes: A-B in Vol 1, C-D in Vol 2, etc.
Keys are sorted within each shard

Why it matters: Range queries are extremely efficient:

"Get all events from March 2025" → touches one shard
No need to scan all shards for time-based queries

Encyclopedia Volumes (Key Range Sharding)

Vol 1
A-B

Vol 2
C-D

Vol 3
E-G

Vol 12
T-Z

Ranges sized to balance data volume, not alphabetically even

Key range boundaries:

Ranges are not evenly spaced — they adapt to data distribution
Vol 1 might cover A-B, while Vol 12 covers T-Z (more letters, less data)
Boundaries can be set manually (Vitess) or automatically (Bigtable, HBase, MongoDB, CockroachDB)

Within each shard:

Keys stored in sorted order (B-tree or SSTables)
Enables efficient range scans
Concatenated keys = fetch related records in one query (e.g., sensor readings by month)

💡 Insight

Key range sharding has a critical vulnerability: sequential writes create hot spots. If you write logs with timestamp keys, all writes go to the "current time" shard while other shards sit idle. The solution: prefix timestamps with another dimension like sensor_id, spreading writes across shards at the cost of harder range queries.

The timestamp hot spot problem:

Timestamp keys → all writes go to "current month" shard
Other shards sit idle while one is overloaded

Solution: Prefix with another dimension:

Key = sensor_id + timestamp instead of just timestamp
Writes spread across shards (different sensors → different shards)
Tradeoff: Multi-sensor range queries now require one query per sensor

5.1. Rebalancing key-range sharded data

Initial setup options:

Start with no key ranges (database grows organically)
Pre-splitting (HBase, MongoDB): Configure initial shards on empty DB
- Requires knowing key distribution upfront

How key-range sharding grows:

Data grows → split shard at median key → two smaller shards
Distribute resulting shards across nodes
Data shrinks → merge adjacent small shards
Similar to B-tree splitting/merging

📦

Initial Shard

Shard grows to 10GB (threshold reached)

→

✂️

Split Triggered

System decides to split at median key

→

✅

Two New Shards

5GB each, distributed to different nodes

Automatic split triggers:

Shard reaches configured size (e.g., 10GB default in HBase)
Write throughput exceeds threshold (hot shard → split even if small)

Pros of key-range sharding:

✅ Shard count adapts to data volume
✅ Small data = few shards = low overhead
✅ Large data = many shards with configurable max size

Cons of key-range sharding:

❌ Splitting is expensive (rewrites all data to new files)
❌ Hot shards need splitting while already under load
❌ Risk of overload during split operation

6. Sharding by Hash of Key

In plain English: Hash sharding is like shuffling a deck of cards before dealing them out. Even if the cards were originally ordered, shuffling distributes them randomly.

In technical terms:

Don't care about key proximity? Hash first, then map to shard
Example: Tenant IDs don't need to be near each other

Why it matters: Hashing destroys key ordering (bad for range queries) but distributes load evenly (good for hot spots).

How hash functions work:

32-bit hash: any string → number between 0 and 2³² − 1
Similar inputs → evenly distributed outputs
Same input always → same output (deterministic)

Hash functions used for sharding:

Database	Hash Function
MongoDB	MD5
Cassandra, ScyllaDB	Murmur3

⚠️ Caution: Built-in language hash functions (Java's Object.hashCode(), Ruby's Object#hash) may produce different values across processes—not suitable for sharding.

6.1. Hash modulo number of nodes

In plain English: Taking hash(key) % 10 is like rolling a 10-sided die for each key—each key randomly gets assigned a number from 0-9, determining which of 10 nodes stores it.

In technical terms:

hash(key) % N → which of N nodes stores the key
Seems simple: 10 nodes = hash(key) % 10 = node 0-9

Why it matters: DON'T DO THIS! The mod N approach fails catastrophically:

Add or remove one node → nearly every key moves
Massive data migration for a simple cluster change

Hash Modulo Rebalancing Problem

Before: 3 Nodes

Node 0
hash % 3 = 0

Node 1
hash % 3 = 1

Node 2
hash % 3 = 2

Add Node 3↓

After: 4 Nodes

Node 0
hash % 4 = 0

Node 1
hash % 4 = 1

Node 2
hash % 4 = 2

Node 3
hash % 4 = 3

⚠️ Most keys move! hash=6 moves from Node 0 to Node 2, hash=9 moves from Node 0 to Node 1

The core problem:

N changes → most keys change shards
Easy to compute, but terrible for rebalancing
We need: An approach that moves data as little as possible

6.2. Fixed number of shards

In plain English: Instead of assigning keys directly to nodes, create many more "buckets" (shards) than you have nodes, then assign multiple buckets to each node. When you add a node, just reassign some buckets—the keys inside each bucket don't move.

In technical terms:

Create many more shards than nodes (e.g., 1,000 shards for 10 nodes)
Each node gets 100 shards
Key → hash(key) % 1,000 → shard number
System tracks shard → node mapping separately

Why it matters: Dramatically reduces rebalancing cost:

Move entire shards (cheap bulk copies)
Not individual keys (expensive random I/O)

Fixed Shards with Rebalancing

10 Nodes, 1000 Shards

Node 1
Shards 0-99

Node 2
Shards 100-199

Node 3
Shards 200-299

...

Add Node 11↓

11 Nodes, 1000 Shards

Node 1
Shards 0-90

Node 2
Shards 100-190

Node 11
Shards 91-99, 191-199...

✅ Only entire shards move, keys stay in same shard number

How rebalancing works:

Add node → reassign some shards to new node
Remove node → redistribute its shards to remaining nodes
Only shard → node mapping changes; keys don't move between shards
Transfer takes time; old mapping used during migration

💡 Insight

Choosing the right number of shards is an art. Too few shards and you can't scale past a certain number of nodes. Too many shards and overhead becomes significant. A common heuristic: start with 10-100x more shards than initial nodes, allowing room to grow.

Best practices for shard count:

Choose a number divisible by many factors (not just powers of 2)
Assign more shards to powerful nodes for weighted load distribution
Databases using this: Citus, Riak, Elasticsearch, Couchbase

Limitations:

Can't have more nodes than shards
Resharding is expensive (splits every shard, lots of disk I/O)
Some systems require downtime for resharding

The "shard size" dilemma:

Fixed fraction of total data per shard
Shards too large → expensive rebalancing and recovery
Shards too small → overhead becomes significant
Hard to get "just right" when data size varies

6.3. Sharding by hash range

In plain English: Combine the best of both worlds—use hash to scramble keys (avoiding hot spots), then use ranges of hash values instead of individual shards.

In technical terms:

Can't predict shard count? → need adaptive sharding
Key-range adapts but creates hot spots
Solution: Range of hash values instead of range of keys

Why it matters:

✅ Adapts to data growth (like key-range)
✅ Distributes load evenly (like hash)
❌ Loses efficient range queries over original keys

Hash Range Sharding

16-bit hash: 0 to 65,535

Shard 0
0-16,383

Shard 1
16,384-32,767

Shard 2
32,768-49,151

Shard 3
49,152-65,535

Keys "2025-01-15" and "2025-01-16" have random hashes → land in different shards

How it works:

Similar keys (e.g., consecutive timestamps) → uniformly distributed hashes
Assign hash ranges to shards: 0-16,383 → Shard 0, etc.
Shards split when too big (expensive, but adaptive)

Range queries with hash-range sharding:

Range over partition key? ❌ Keys scattered across shards
Range over secondary columns? ✅ Works if partition key is constant
Example: (user_id=123, timestamp BETWEEN ...) → all in same shard

7. Partitioning and Range Queries in Data Warehouses

Data warehouse terminology:

System	Partition Concept	Sorting Concept
BigQuery	partition key	cluster columns
Snowflake	micro-partitions (auto)	cluster keys
Delta Lake	manual or auto partitions	cluster keys

Benefits of clustering:

Improved range scan performance
Better compression
More efficient filtering

Hash-range sharding in practice:

Used by: YugabyteDB, DynamoDB, MongoDB (option)
Cassandra: 8 ranges per node (default)
ScyllaDB: 256 ranges per node
Random boundaries → some ranges larger, but multiple ranges per node balances out

Cassandra/ScyllaDB Hash Range Distribution

Hash space 0-1023 split into ranges with random boundaries

Node 1
0-125

Node 1
340-502

Node 1
780-890

Node 2
126-339

Node 2
503-600

Node 2
891-1023

Node 3
601-779

Multiple ranges per node balance out size differences

Adding/removing nodes:

Range boundaries added/removed, shards split/merged
New node gets fair share of data from existing nodes
Minimal data movement required

7.1. Consistent hashing

In plain English: Consistent hashing is like a smart shuffling algorithm that tries to keep cards in the same pile when you add or remove piles.

In technical terms: Two key properties:

Keys mapped roughly equally to each shard
When shard count changes, minimal keys move

Why it matters: "Consistent" means stable assignment, NOT ACID consistency or replica consistency.

💡 Insight

Despite its name, consistent hashing doesn't guarantee consistency in the ACID sense. It simply tries to keep a key in the same shard across cluster changes. Multiple algorithms achieve this goal—Cassandra's token ring, rendezvous hashing, and jump consistent hash are all "consistent hashing" variants with different tradeoffs.

Consistent hashing algorithms:

Algorithm	How it works
Cassandra/ScyllaDB	Token ring; split existing shards for new nodes
Rendezvous (HRW)	New node gets individual keys from all nodes
Jump consistent hash	Mathematically minimal key movement

8. Skewed Workloads and Relieving Hot Spots

In plain English: Imagine a celebrity tweets and gets 10 million replies. Even perfect sharding can't help if all those writes go to a single key (the celebrity's tweet ID).

In technical terms:

Consistent hashing distributes keys evenly
But load can still be skewed (some keys hotter than others)
Overloaded shards while others sit idle

Why it matters: Hot keys are a fundamental limit of sharding:

More nodes doesn't help if all traffic goes to one key
Must split the key itself (adds complexity)

Example: Celebrity post on social media → millions of comments on one key

🔥

Detect Hot Key

Celebrity tweet getting millions of writes

→

✂️

Split the Key

Append random number 0-99 to key

→

✅

Distribute Writes

100 keys across different shards

→

🔄

Read All Variants

Query all 100 keys and merge results

Solutions for hot keys:

Dedicated shard — Give hot key its own shard/machine
Key splitting — Append random suffix (0-99) to split writes across 100 keys

Key splitting tradeoffs:

Aspect	Writes	Reads
Load distribution	✅ Split across 100 shards	❌ Must query all 100 keys and merge
Complexity	✅ Simple append	❌ Requires bookkeeping of which keys are split

💡 Insight

Hot key handling is often application-specific. A social media platform might cache celebrity posts differently than regular posts. A voting system might use a different aggregation strategy for popular items. The database can detect hot keys, but the application usually needs to implement the mitigation.

Additional challenges:

Load changes over time (viral post → high load for days → calm)
Some keys hot for writes, others hot for reads → different strategies
Cloud services automate this: Amazon's "heat management" / "adaptive capacity"

9. Operations: Automatic or Manual Rebalancing

In plain English: Should your database automatically split shards and move them around, or should a human operator make those decisions? It's the difference between autopilot and manual flying.

In technical terms: Automatic vs. manual shard management?

Why it matters: Automatic rebalancing risks cascading failures:

One slow node triggers rebalancing
Rebalancing adds load to other nodes
Other nodes trigger more rebalancing
Spiral → cluster-wide outage

Approaches:

Fully automatic — No human interaction needed
Fully manual — Admin configures everything
Hybrid (Couchbase, Riak) — Auto-suggest, human commits

🤖

Automatic Rebalancing

Pro: Less operational work
Pro: Can auto-scale with load
Con: Unpredictable behavior
Con: Risk of cascading failures

👤

Manual Rebalancing

Pro: Predictable operations
Pro: Human prevents mistakes
Con: Slower to respond
Con: Requires operator expertise

Automatic rebalancing pros:

Less operational work
Auto-scale with workload (DynamoDB: add/remove shards in minutes)

Automatic rebalancing cons:

Unpredictable — rebalancing is expensive (rerouting + data movement)
Can overload network/nodes, harm other requests
If near max write throughput, splitting can't keep up with writes

The danger: automation + failure detection:

Node temporarily slow (overloaded)
Other nodes think it's dead
Auto-rebalance to move load away
Additional load makes things worse
Cascading failure!

Recommendation: Human in the loop is slower but prevents surprises.

10. Request Routing

In plain English: When a client wants to read key "user:12345", how does it know which server to contact? This is like calling directory assistance.

In technical terms: How does a client know which node (IP address, port) to connect to?

Why it matters: Request routing is the glue that makes sharding work. Without it, clients must query ALL nodes.

Key difference from service discovery:

Stateless services → any instance can handle any request
Sharded databases → only the node with that shard can handle the request

Three Request Routing Approaches

Approach 1: Client → Any Node → Correct Node

Client

→

Any Node

forward→

Correct Node

Approach 2: Client → Routing Tier → Correct Node

Client

→

Routing Tier

→

Correct Node

Approach 3: Smart Client → Correct Node

Smart Client

direct→

Correct Node

Three routing approaches:

Any node forwarding — Client contacts any node; node forwards to correct shard if needed
Routing tier — Dedicated shard-aware load balancer determines correct node
Smart client — Client knows shard → node mapping, connects directly

Key problems to solve:

Problem	Challenge
Who decides shard placement?	Need fault-tolerant coordinator (avoid split-brain)
How to learn about changes?	Routing component needs shard → node updates
Migration in progress?	Handle in-flight requests during shard moves

💡 Insight

Coordination services like ZooKeeper solve the routing problem by providing a single source of truth for shard assignments. All routing decisions (whether in clients, nodes, or routing tier) subscribe to ZooKeeper to learn about changes. This is essentially distributed consensus (Chapter 10) in action.

ZooKeeper-Based Routing

ZooKeeper Cluster
(consensus-based coordination)

subscribe↓

Routing Tier

Node 1

Node 2

All components get real-time updates when shard assignments change

Coordination services for shard routing:

System	Coordination Method
HBase, SolrCloud	ZooKeeper
Kubernetes	etcd
MongoDB	Config servers + mongos daemons
Kafka, YugabyteDB, TiDB	Built-in Raft consensus
Cassandra, ScyllaDB, Riak	Gossip protocol (weaker consistency)

How it works:

Nodes register in ZooKeeper/etcd
ZooKeeper maintains authoritative shard → node mapping
Clients/routing tier subscribe for updates
Changes are pushed in real-time

Gossip vs. consensus:

Gossip = weaker consistency, possible split-brain
Leaderless DBs can tolerate this (they have weak guarantees anyway)

Client IP discovery:

DNS is usually sufficient (IPs change less frequently than shard assignments)

Note: OLTP = single-shard queries. Analytics = parallel queries across all shards (Chapter 11).

11. Sharding and Secondary Indexes

In plain English: Primary key lookups are easy—you know the key, you know the shard. But what about queries like "find all red cars"? The red cars are scattered across all shards.

In technical terms:

Partition key = easy routing (one shard)
Secondary index = records scattered across all shards

Why it matters: No free lunch!

Local indexes → fast writes, slow reads (scatter-gather)
Global indexes → fast reads, slow writes (distributed updates)

Secondary indexes:

Not unique identifiers — search for values (find all red cars)
Common in relational DBs, document DBs, full-text search (Solr, Elasticsearch)
Two approaches: local vs. global

11.1. Local Secondary Indexes

In plain English: Local indexes are like each library branch maintaining its own card catalog—it only knows about books in that branch. To find all copies, you call every branch.

In technical terms:

Example: Used car website, partition by listing ID
Need to search by color and make
Each shard maintains indexes only for its own records
Postings list: color:red → [12, 87, 234] (IDs of red cars in this shard)

Local Secondary Indexes (Document-Partitioned)

Shard 0 (IDs 0-499)

Primary: car IDs 0-499

Index: color:red → [12, 87, 234]

Index: make:honda → [12, 156]

Shard 1 (IDs 500-999)

Primary: car IDs 500-999

Index: color:red → [543, 789]

Index: make:honda → [612]

Query "find red cars" must hit both shards and merge results [12, 87, 234, 543, 789]

⚠️ Warning: DIY secondary indexes in application code are dangerous:

Race conditions → data out of sync
Intermittent write failures → index inconsistency
See "The need for multi-object transactions"

How local indexes work:

Each shard = completely separate
Writes only touch one shard
Also called "document-partitioned index"

Reading from local indexes:

Query type	Behavior
Know partition key	Query only that shard ✅
Want all results	Query ALL shards, merge results ❌

💡 Insight

Local secondary indexes suffer from tail latency amplification. If you query 100 shards in parallel and 99 respond in 10ms but one takes 500ms, your overall query takes 500ms. This is why adding more shards doesn't improve query throughput for secondary index queries—you still hit every shard.

Scalability limitation:

More shards = more storage
But NOT more query throughput (every query hits every shard)

Used by: MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, VoltDB

11.2. Global Secondary Indexes

In plain English: Global indexes are like a central library catalog that knows about books in all branches. The catalog itself is split across multiple computers, but each entry points to ALL matching locations.

In technical terms:

Index covers data in ALL shards
Index itself must be sharded (too big for one node)
Can be sharded differently from primary data (term-partitioned)

Why it matters: Trade write complexity for read efficiency:

✅ Single lookup tells you all matching records
❌ Writes must update index shards on different nodes

Global Secondary Indexes (Term-Partitioned)

Data Shards

Shard 0
Cars 0-499

Shard 1
Cars 500-999

Index Shards

Index A-R
color:red → [12,87,234,543,789]
make:honda → [12,156,612]

Index S-Z
color:silver → [45,234,891]
make:toyota → [67,234]

Query "find red cars" hits only Index A-R, returns all matching IDs immediately

How global indexes are partitioned:

Index partitioned by term (term-partitioned)
Colors A-R → Index Shard 0, Colors S-Z → Index Shard 1
Looking for a term? → know which index shard to query

Query behavior:

Query Type	Global Index Behavior
Single condition (`color=red`)	✅ Query one index shard
Fetch actual records	❌ Still need to hit multiple data shards
Multiple conditions (AND)	❌ Terms on different shards, compute intersection

💡 Insight

Global secondary indexes often lag behind the primary data. When you write a car listing, updating the primary shard is fast (one network hop), but updating the global index shards asynchronously prevents the write from blocking. This means searches may briefly not find newly added items—eventual consistency strikes again.

Write challenges:

One record write → may update multiple index shards
Hard to keep index in sync
Option: distributed transactions (Chapter 8)

Used by: CockroachDB, TiDB, YugabyteDB, DynamoDB (both local and global)

When to use global indexes:

Read throughput >> write throughput
Postings lists are short

12. Summary

What we covered:

Different ways to split large datasets into smaller subsets (shards)
Why sharding is necessary when data exceeds single-machine capacity
The goal: spread data and query load evenly, avoiding hot spots
Key decisions: choosing sharding scheme and rebalancing strategies

Two Main Sharding Approaches

Key Range Sharding:

Keys are sorted; each shard owns keys from min to max
✅ Efficient range queries possible
❌ Risk of hot spots with sequential access patterns
Rebalancing: split large shards into two subranges

Hash Sharding:

Hash function applied to each key; shard owns range of hash values
✅ Distributes load more evenly
❌ Destroys key ordering, range queries inefficient
Rebalancing: typically create fixed shards upfront, move entire shards between nodes

📚

Key Range Sharding

Pro: Efficient range queries
Pro: Dynamic shard count
Con: Sequential writes create hot spots
Con: Expensive to split shards

🎲

Hash Sharding

Pro: Even load distribution
Pro: Predictable performance
Con: Range queries inefficient
Con: Fixed shard count (or expensive resharding)

Common pattern: Compound keys

Use first part of key as partition key (identifies shard)
Sort records within shard by the rest of the key
Result: efficient range queries within same partition key

Secondary Indexes

Local Secondary Indexes:

Stored in same shard as primary key/value
✅ Single shard update on write
❌ Reads require scatter-gather to all shards

Global Secondary Indexes:

Sharded separately by indexed values
❌ One write may update multiple index shards
✅ Read postings list from single shard (but fetching records still hits multiple shards)

Query Routing

Routes queries to appropriate shards
Coordination service (e.g., ZooKeeper) tracks shard-to-node assignments

💡 Insight

Sharding represents a fundamental tradeoff in distributed systems: you gain scalability at the cost of simplicity. Operations that were simple on a single machine (joins, secondary indexes, transactions) become complex across shards. The key to successful sharding is choosing partition keys that align with your query patterns—shard by tenant if queries are per-tenant, by user if queries are per-user, by timestamp if queries are time-range scans.

The Big Picture

Why sharding works:

Each shard operates mostly independently
This independence enables horizontal scaling

What gets harder:

Cross-shard writes become problematic
What if write to one shard succeeds but another fails?
Answer: Transactions (Chapter 8)

Previous: Chapter 6: Replication | Next: Chapter 8: Transactions

Table of Contents​

1. Sharding and Partitioning​

2. Pros and Cons of Sharding​

3. Sharding for Multitenancy​

3.1. Resource isolation​

3.2. Permission isolation​

3.3. Cell-based architecture​

3.4. Per-tenant backup and restore​

3.5. Regulatory compliance (GDPR)​

3.6. Data residence​

3.7. Gradual schema rollout​

4. Sharding of Key-Value Data​

5. Sharding by Key Range​

5.1. Rebalancing key-range sharded data​

6. Sharding by Hash of Key​

6.1. Hash modulo number of nodes​

6.2. Fixed number of shards​

6.3. Sharding by hash range​

7. Partitioning and Range Queries in Data Warehouses​

7.1. Consistent hashing​

8. Skewed Workloads and Relieving Hot Spots​

9. Operations: Automatic or Manual Rebalancing​

10. Request Routing​

11. Sharding and Secondary Indexes​

11.1. Local Secondary Indexes​

11.2. Global Secondary Indexes​

12. Summary​

Two Main Sharding Approaches​

Secondary Indexes​

Query Routing​

The Big Picture​

Table of Contents

1. Sharding and Partitioning

2. Pros and Cons of Sharding

3. Sharding for Multitenancy

3.1. Resource isolation

3.2. Permission isolation

3.3. Cell-based architecture

3.4. Per-tenant backup and restore

3.5. Regulatory compliance (GDPR)

3.6. Data residence

3.7. Gradual schema rollout

4. Sharding of Key-Value Data

5. Sharding by Key Range

5.1. Rebalancing key-range sharded data

6. Sharding by Hash of Key

6.1. Hash modulo number of nodes

6.2. Fixed number of shards

6.3. Sharding by hash range

7. Partitioning and Range Queries in Data Warehouses

7.1. Consistent hashing

8. Skewed Workloads and Relieving Hot Spots

9. Operations: Automatic or Manual Rebalancing

10. Request Routing

11. Sharding and Secondary Indexes

11.1. Local Secondary Indexes

11.2. Global Secondary Indexes

12. Summary

Two Main Sharding Approaches

Secondary Indexes

Query Routing

The Big Picture