Skip to main content

Chapter 7. Sharding

Clearly, we must break away from the sequential and not limit the computers. We must state definitions and provide for priorities and descriptions of data. We must state relationships, not procedures.

Grace Murray Hopper, Management and the Computer of the Future (1962)

Table of Contents​

  1. Sharding and Partitioning
  2. Pros and Cons of Sharding
  3. Sharding for Multitenancy
  4. Sharding of Key-Value Data
  5. Sharding by Key Range
  6. Sharding by Hash of Key
  7. Partitioning and Range Queries in Data Warehouses
  8. Skewed Workloads and Relieving Hot Spots
  9. Operations: Automatic or Manual Rebalancing
  10. Request Routing
  11. Sharding and Secondary Indexes
  1. Summary

In plain English: Think of sharding like organizing a massive library. When one building can't hold all the books anymore, you create multiple branches across the city. Each branch holds different books (different shards), and you need a system to know which branch has which books.

In technical terms: Sharding splits a large dataset into smaller partitions, with each shard stored on a different node. Each piece of data belongs to exactly one shard, though shards are typically replicated for fault tolerance.

Why it matters: Sharding enables horizontal scalingβ€”you can grow your database by adding more machines instead of needing a single massive server. This makes it possible to handle petabyte-scale datasets and millions of queries per second.


Combining Replication and Sharding
Shard 1
Leader
Follower A
Follower B
Shard 2
Leader
Follower A
Follower B

Each shard has one leader and multiple followers


Each node acts as leader for some shards, follower for others

A distributed database typically distributes data across nodes in two ways:

  • Replication β€” Having a copy of the same data on multiple nodes (covered in Chapter 6)
  • Sharding β€” Splitting data into smaller partitions stored on different nodes (this chapter)

Key characteristics of sharding:

  • Each piece of data (record, row, or document) belongs to exactly one shard
  • Each shard is essentially a small database of its own
  • Sharding is usually combined with replication for fault tolerance
  • A node may store multiple shards and be leader for some, follower for others

πŸ’‘ Insight

The choice of sharding scheme is mostly independent of the choice of replication scheme. Everything from Chapter 6 about replication applies equally to replication of shards.


1. Sharding and Partitioning​

The same concept, many names:

DatabaseTerm Used
Kafkapartition
CockroachDBrange
HBase, TiDBregion
Bigtable, YugabyteDBtablet
Cassandra, ScyllaDB, Riakvnode
CouchbasevBucket

PostgreSQL distinction:

  • Partitioning β€” Splitting a table into files on the same machine (fast deletes)
  • Sharding β€” Splitting across multiple machines

In most other systems, partitioning = sharding.

πŸ’‘ Insight

The terminology is confusing, but the core idea is simple: break a big dataset into smaller pieces. Whether you call them shards, partitions, tablets, or vnodes, the fundamental challenge is the sameβ€”how do you split the data evenly while maintaining fast lookups?

Fun fact: The term "shard" may have originated from:

  • The game Ultima Online, where a magic crystal shattered into pieces (shards), each reflecting a game world
  • Or an acronym: System for Highly Available Replicated Data (a 1980s database)

⚠️ Don't confuse: Partitioning β‰  network partitions (netsplits). We cover network faults in Chapter 9.


2. Pros and Cons of Sharding​

In plain English: Sharding is like choosing between one massive warehouse or many smaller warehouses distributed across the country. The distributed approach handles more volume, but adds complexity in tracking where everything is stored.

In technical terms: The primary reason for sharding is scalability when:

  • Data volume exceeds single-node capacity
  • Write throughput overwhelms one server

(If read throughput is the problem, use read scaling from Chapter 6 instead.)

Why it matters: Sharding enables horizontal scaling (scale-out):

  • Grow capacity by adding more machines, not bigger machines
  • Each shard handles a share of the load
  • Process data and queries in parallel
βœ…
When Sharding Helps
  • Data volume exceeds single machine capacity
  • Write throughput overwhelms one server
  • Need horizontal scaling for cost efficiency
  • Data naturally partitions (e.g., by tenant)
⚠️
When to Avoid Sharding
  • Single machine can handle the load
  • Queries frequently join across shards
  • Distributed transactions required
  • Team lacks operational experience

When to avoid sharding:

  • Replication works at any scale for fault tolerance
  • Single machines are powerful todayβ€”don't shard prematurely
  • Sharding adds significant complexity

The complexity cost:

  • You must choose a partition key upfront
  • All records with the same partition key β†’ same shard
  • Knowing the shard = fast lookup; not knowing = search all shards
  • The sharding scheme is difficult to change later

πŸ’‘ Insight

The "know your shard" principle is crucial. When you query by partition key (e.g., user_id = 123), the system routes directly to one shard. But when you query without it (e.g., "find all users named Alice"), you must scatter the query to all shards and gather resultsβ€”potentially 100x slower.

Sharding works best for:

  • Key-value data β€” shard by key, queries go to one shard
  • Not ideal for relational data β€” secondary indexes and cross-shard joins are hard

Cross-shard writes are tricky:

  • Writes may need to update records in multiple shards
  • Single-node transactions are common; cross-shard transactions are slow
  • Distributed transactions (Chapter 8) may become a bottleneck

Special case: Single-machine sharding:

  • Some systems shard even on one machine (one process per CPU core)
  • Takes advantage of CPU parallelism and NUMA architecture
  • Examples: Redis, VoltDB, FoundationDB

3. Sharding for Multitenancy​

In plain English: Think of a multitenant system like an apartment building. Each tenant (customer) has their own apartment (shard), with privacy from neighbors and independent utilities. If one tenant's plumbing breaks, it doesn't flood everyone else's apartment.

In technical terms: SaaS products are often multitenant:

  • Each tenant = one customer (e.g., a business)
  • Multiple users may log in under one tenant
  • Each tenant's dataset is separate from others

Why it matters: Sharding for multitenancy provides strong isolationβ€”bugs, performance problems, or security issues in one tenant are less likely to affect others.

Implementation options:

  • One shard per tenant
  • Group small tenants into larger shards
  • Shards can be separate databases or portions of a logical database

Advantages of sharding for multitenancy:

3.1. Resource isolation​

  • Expensive operations by one tenant don't affect others on different shards
  • Each shard has its own compute resources

3.2. Permission isolation​

  • Bugs in access control are less likely to leak data between tenants
  • Physical separation = stronger security boundaries

3.3. Cell-based architecture​

In plain English: A cell-based architecture is like having separate mini data centers for different customer groups. If one mini data center has a fire, only customers in that cell are affected.

In technical terms:

  • Shard not just data, but also application services
  • Each cell contains services + storage for a set of tenants
  • Cells run independently from each other

Why it matters: Limits blast radiusβ€”a bad deployment or bug affects only one cell, not the entire system.

3.4. Per-tenant backup and restore​

  • Backup each tenant's shard separately
  • Restore one tenant without affecting others
  • Useful for accidental data deletion recovery

3.5. Regulatory compliance (GDPR)​

  • Each person's data in a separate shard
  • Data export = simple operation on their shard
  • Data deletion = simple operation on their shard

3.6. Data residence​

  • Some tenants require data in specific jurisdictions
  • Region-aware databases can assign shards to specific regions

3.7. Gradual schema rollout​

  • Roll out migrations one tenant at a time
  • Detect problems before they affect everyone
  • Reduces risk of system-wide failures
πŸ“Š
Challenge: Large Tenants
Single tenant too big for one machine requires within-tenant sharding
πŸ”
Challenge: Small Tenants
Many tiny tenants create overhead; grouping them complicates rebalancing
πŸ”—
Challenge: Cross-Tenant Features
Features connecting multiple tenants become harder with shard boundaries

Main challenges with multitenancy sharding:

  • Large tenants β€” If one tenant exceeds a single node, you need within-tenant sharding
  • Many small tenants β€” Per-tenant shards create overhead; grouping complicates rebalancing
  • Cross-tenant features β€” Queries joining multiple tenants require cross-shard operations

4. Sharding of Key-Value Data​

The fundamental question: You have a large amount of dataβ€”how do you decide which records go on which nodes?

In plain English: Imagine you have 10 million user records to spread across 10 servers. Ideally, each server gets exactly 1 million users, and when you look up a user, you can quickly figure out which server has their data.

In technical terms: Our goal is to spread data and query load evenly:

  • 10 nodes β†’ handle 10x the data and throughput of 1 node
  • Adding/removing nodes β†’ rebalance to maintain even distribution

Why it matters: Uneven distribution (skew) wastes resources. If 90% of your data lands on one shard, that shard becomes the bottleneck.

Key terminology:

  • Skewed β€” Unfair sharding where some shards have more load than others
  • Hot shard / hot spot β€” A shard with disproportionately high load
  • Hot key β€” A single key with exceptionally high traffic (e.g., celebrity account)

We need a sharding algorithm that:

  • Takes a partition key β†’ tells us the shard
  • Partition key = primary key (or first part) in key-value stores
  • Partition key = some column (not necessarily PK) in relational DBs
  • Must support rebalancing to relieve hot spots

5. Sharding by Key Range​

In plain English: Key range sharding is like organizing a library by book title. Books starting with A-D go on shelf 1, E-K on shelf 2, and so on. If you want "Harry Potter," you know exactly which shelf to check.

In technical terms:

  • Assign a contiguous range of keys to each shard (min to max)
  • Like encyclopedia volumes: A-B in Vol 1, C-D in Vol 2, etc.
  • Keys are sorted within each shard

Why it matters: Range queries are extremely efficient:

  • "Get all events from March 2025" β†’ touches one shard
  • No need to scan all shards for time-based queries
Encyclopedia Volumes (Key Range Sharding)
Vol 1
A-B
Vol 2
C-D
Vol 3
E-G
Vol 12
T-Z

Ranges sized to balance data volume, not alphabetically even

Key range boundaries:

  • Ranges are not evenly spaced β€” they adapt to data distribution
  • Vol 1 might cover A-B, while Vol 12 covers T-Z (more letters, less data)
  • Boundaries can be set manually (Vitess) or automatically (Bigtable, HBase, MongoDB, CockroachDB)

Within each shard:

  • Keys stored in sorted order (B-tree or SSTables)
  • Enables efficient range scans
  • Concatenated keys = fetch related records in one query (e.g., sensor readings by month)

πŸ’‘ Insight

Key range sharding has a critical vulnerability: sequential writes create hot spots. If you write logs with timestamp keys, all writes go to the "current time" shard while other shards sit idle. The solution: prefix timestamps with another dimension like sensor_id, spreading writes across shards at the cost of harder range queries.

The timestamp hot spot problem:

  • Timestamp keys β†’ all writes go to "current month" shard
  • Other shards sit idle while one is overloaded

Solution: Prefix with another dimension:

  • Key = sensor_id + timestamp instead of just timestamp
  • Writes spread across shards (different sensors β†’ different shards)
  • Tradeoff: Multi-sensor range queries now require one query per sensor

5.1. Rebalancing key-range sharded data​

Initial setup options:

  • Start with no key ranges (database grows organically)
  • Pre-splitting (HBase, MongoDB): Configure initial shards on empty DB
    • Requires knowing key distribution upfront

How key-range sharding grows:

  • Data grows β†’ split shard at median key β†’ two smaller shards
  • Distribute resulting shards across nodes
  • Data shrinks β†’ merge adjacent small shards
  • Similar to B-tree splitting/merging
πŸ“¦
Initial Shard
Shard grows to 10GB (threshold reached)
β†’
βœ‚οΈ
Split Triggered
System decides to split at median key
β†’
βœ…
Two New Shards
5GB each, distributed to different nodes

Automatic split triggers:

  • Shard reaches configured size (e.g., 10GB default in HBase)
  • Write throughput exceeds threshold (hot shard β†’ split even if small)

Pros of key-range sharding:

  • βœ… Shard count adapts to data volume
  • βœ… Small data = few shards = low overhead
  • βœ… Large data = many shards with configurable max size

Cons of key-range sharding:

  • ❌ Splitting is expensive (rewrites all data to new files)
  • ❌ Hot shards need splitting while already under load
  • ❌ Risk of overload during split operation

6. Sharding by Hash of Key​

In plain English: Hash sharding is like shuffling a deck of cards before dealing them out. Even if the cards were originally ordered, shuffling distributes them randomly.

In technical terms:

  • Don't care about key proximity? Hash first, then map to shard
  • Example: Tenant IDs don't need to be near each other

Why it matters: Hashing destroys key ordering (bad for range queries) but distributes load evenly (good for hot spots).

How hash functions work:

  • 32-bit hash: any string β†’ number between 0 and 2Β³Β² βˆ’ 1
  • Similar inputs β†’ evenly distributed outputs
  • Same input always β†’ same output (deterministic)

Hash functions used for sharding:

DatabaseHash Function
MongoDBMD5
Cassandra, ScyllaDBMurmur3

⚠️ Caution: Built-in language hash functions (Java's Object.hashCode(), Ruby's Object#hash) may produce different values across processesβ€”not suitable for sharding.

6.1. Hash modulo number of nodes​

In plain English: Taking hash(key) % 10 is like rolling a 10-sided die for each keyβ€”each key randomly gets assigned a number from 0-9, determining which of 10 nodes stores it.

In technical terms:

  • hash(key) % N β†’ which of N nodes stores the key
  • Seems simple: 10 nodes = hash(key) % 10 = node 0-9

Why it matters: DON'T DO THIS! The mod N approach fails catastrophically:

  • Add or remove one node β†’ nearly every key moves
  • Massive data migration for a simple cluster change
Hash Modulo Rebalancing Problem
Before: 3 Nodes
Node 0
hash % 3 = 0
Node 1
hash % 3 = 1
Node 2
hash % 3 = 2
Add Node 3↓
After: 4 Nodes
Node 0
hash % 4 = 0
Node 1
hash % 4 = 1
Node 2
hash % 4 = 2
Node 3
hash % 4 = 3

⚠️ Most keys move! hash=6 moves from Node 0 to Node 2, hash=9 moves from Node 0 to Node 1

The core problem:

  • N changes β†’ most keys change shards
  • Easy to compute, but terrible for rebalancing
  • We need: An approach that moves data as little as possible

6.2. Fixed number of shards​

In plain English: Instead of assigning keys directly to nodes, create many more "buckets" (shards) than you have nodes, then assign multiple buckets to each node. When you add a node, just reassign some bucketsβ€”the keys inside each bucket don't move.

In technical terms:

  • Create many more shards than nodes (e.g., 1,000 shards for 10 nodes)
  • Each node gets 100 shards
  • Key β†’ hash(key) % 1,000 β†’ shard number
  • System tracks shard β†’ node mapping separately

Why it matters: Dramatically reduces rebalancing cost:

  • Move entire shards (cheap bulk copies)
  • Not individual keys (expensive random I/O)
Fixed Shards with Rebalancing
10 Nodes, 1000 Shards
Node 1
Shards 0-99
Node 2
Shards 100-199
Node 3
Shards 200-299
...
Add Node 11↓
11 Nodes, 1000 Shards
Node 1
Shards 0-90
Node 2
Shards 100-190
Node 11
Shards 91-99, 191-199...

βœ… Only entire shards move, keys stay in same shard number

How rebalancing works:

  • Add node β†’ reassign some shards to new node
  • Remove node β†’ redistribute its shards to remaining nodes
  • Only shard β†’ node mapping changes; keys don't move between shards
  • Transfer takes time; old mapping used during migration

πŸ’‘ Insight

Choosing the right number of shards is an art. Too few shards and you can't scale past a certain number of nodes. Too many shards and overhead becomes significant. A common heuristic: start with 10-100x more shards than initial nodes, allowing room to grow.

Best practices for shard count:

  • Choose a number divisible by many factors (not just powers of 2)
  • Assign more shards to powerful nodes for weighted load distribution
  • Databases using this: Citus, Riak, Elasticsearch, Couchbase

Limitations:

  • Can't have more nodes than shards
  • Resharding is expensive (splits every shard, lots of disk I/O)
  • Some systems require downtime for resharding

The "shard size" dilemma:

  • Fixed fraction of total data per shard
  • Shards too large β†’ expensive rebalancing and recovery
  • Shards too small β†’ overhead becomes significant
  • Hard to get "just right" when data size varies

6.3. Sharding by hash range​

In plain English: Combine the best of both worldsβ€”use hash to scramble keys (avoiding hot spots), then use ranges of hash values instead of individual shards.

In technical terms:

  • Can't predict shard count? β†’ need adaptive sharding
  • Key-range adapts but creates hot spots
  • Solution: Range of hash values instead of range of keys

Why it matters:

  • βœ… Adapts to data growth (like key-range)
  • βœ… Distributes load evenly (like hash)
  • ❌ Loses efficient range queries over original keys
Hash Range Sharding

16-bit hash: 0 to 65,535

Shard 0
0-16,383
Shard 1
16,384-32,767
Shard 2
32,768-49,151
Shard 3
49,152-65,535

Keys "2025-01-15" and "2025-01-16" have random hashes β†’ land in different shards

How it works:

  • Similar keys (e.g., consecutive timestamps) β†’ uniformly distributed hashes
  • Assign hash ranges to shards: 0-16,383 β†’ Shard 0, etc.
  • Shards split when too big (expensive, but adaptive)

Range queries with hash-range sharding:

  • Range over partition key? ❌ Keys scattered across shards
  • Range over secondary columns? βœ… Works if partition key is constant
  • Example: (user_id=123, timestamp BETWEEN ...) β†’ all in same shard

7. Partitioning and Range Queries in Data Warehouses​

Data warehouse terminology:

SystemPartition ConceptSorting Concept
BigQuerypartition keycluster columns
Snowflakemicro-partitions (auto)cluster keys
Delta Lakemanual or auto partitionscluster keys

Benefits of clustering:

  • Improved range scan performance
  • Better compression
  • More efficient filtering

Hash-range sharding in practice:

  • Used by: YugabyteDB, DynamoDB, MongoDB (option)
  • Cassandra: 8 ranges per node (default)
  • ScyllaDB: 256 ranges per node
  • Random boundaries β†’ some ranges larger, but multiple ranges per node balances out
Cassandra/ScyllaDB Hash Range Distribution

Hash space 0-1023 split into ranges with random boundaries

Node 1
0-125
Node 1
340-502
Node 1
780-890
Node 2
126-339
Node 2
503-600
Node 2
891-1023
Node 3
601-779

Multiple ranges per node balance out size differences

Adding/removing nodes:

  • Range boundaries added/removed, shards split/merged
  • New node gets fair share of data from existing nodes
  • Minimal data movement required

7.1. Consistent hashing​

In plain English: Consistent hashing is like a smart shuffling algorithm that tries to keep cards in the same pile when you add or remove piles.

In technical terms: Two key properties:

  1. Keys mapped roughly equally to each shard
  2. When shard count changes, minimal keys move

Why it matters: "Consistent" means stable assignment, NOT ACID consistency or replica consistency.

πŸ’‘ Insight

Despite its name, consistent hashing doesn't guarantee consistency in the ACID sense. It simply tries to keep a key in the same shard across cluster changes. Multiple algorithms achieve this goalβ€”Cassandra's token ring, rendezvous hashing, and jump consistent hash are all "consistent hashing" variants with different tradeoffs.

Consistent hashing algorithms:

AlgorithmHow it works
Cassandra/ScyllaDBToken ring; split existing shards for new nodes
Rendezvous (HRW)New node gets individual keys from all nodes
Jump consistent hashMathematically minimal key movement

8. Skewed Workloads and Relieving Hot Spots​

In plain English: Imagine a celebrity tweets and gets 10 million replies. Even perfect sharding can't help if all those writes go to a single key (the celebrity's tweet ID).

In technical terms:

  • Consistent hashing distributes keys evenly
  • But load can still be skewed (some keys hotter than others)
  • Overloaded shards while others sit idle

Why it matters: Hot keys are a fundamental limit of sharding:

  • More nodes doesn't help if all traffic goes to one key
  • Must split the key itself (adds complexity)

Example: Celebrity post on social media β†’ millions of comments on one key

πŸ”₯
Detect Hot Key
Celebrity tweet getting millions of writes
β†’
βœ‚οΈ
Split the Key
Append random number 0-99 to key
β†’
βœ…
Distribute Writes
100 keys across different shards
β†’
πŸ”„
Read All Variants
Query all 100 keys and merge results

Solutions for hot keys:

  1. Dedicated shard β€” Give hot key its own shard/machine
  2. Key splitting β€” Append random suffix (0-99) to split writes across 100 keys

Key splitting tradeoffs:

AspectWritesReads
Load distributionβœ… Split across 100 shards❌ Must query all 100 keys and merge
Complexityβœ… Simple append❌ Requires bookkeeping of which keys are split

πŸ’‘ Insight

Hot key handling is often application-specific. A social media platform might cache celebrity posts differently than regular posts. A voting system might use a different aggregation strategy for popular items. The database can detect hot keys, but the application usually needs to implement the mitigation.

Additional challenges:

  • Load changes over time (viral post β†’ high load for days β†’ calm)
  • Some keys hot for writes, others hot for reads β†’ different strategies
  • Cloud services automate this: Amazon's "heat management" / "adaptive capacity"

9. Operations: Automatic or Manual Rebalancing​

In plain English: Should your database automatically split shards and move them around, or should a human operator make those decisions? It's the difference between autopilot and manual flying.

In technical terms: Automatic vs. manual shard management?

Why it matters: Automatic rebalancing risks cascading failures:

  1. One slow node triggers rebalancing
  2. Rebalancing adds load to other nodes
  3. Other nodes trigger more rebalancing
  4. Spiral β†’ cluster-wide outage

Approaches:

  • Fully automatic β€” No human interaction needed
  • Fully manual β€” Admin configures everything
  • Hybrid (Couchbase, Riak) β€” Auto-suggest, human commits
πŸ€–
Automatic Rebalancing
  • Pro: Less operational work
  • Pro: Can auto-scale with load
  • Con: Unpredictable behavior
  • Con: Risk of cascading failures
πŸ‘€
Manual Rebalancing
  • Pro: Predictable operations
  • Pro: Human prevents mistakes
  • Con: Slower to respond
  • Con: Requires operator expertise

Automatic rebalancing pros:

  • Less operational work
  • Auto-scale with workload (DynamoDB: add/remove shards in minutes)

Automatic rebalancing cons:

  • Unpredictable β€” rebalancing is expensive (rerouting + data movement)
  • Can overload network/nodes, harm other requests
  • If near max write throughput, splitting can't keep up with writes

The danger: automation + failure detection:

  1. Node temporarily slow (overloaded)
  2. Other nodes think it's dead
  3. Auto-rebalance to move load away
  4. Additional load makes things worse
  5. Cascading failure!

Recommendation: Human in the loop is slower but prevents surprises.


10. Request Routing​

In plain English: When a client wants to read key "user:12345", how does it know which server to contact? This is like calling directory assistance.

In technical terms: How does a client know which node (IP address, port) to connect to?

Why it matters: Request routing is the glue that makes sharding work. Without it, clients must query ALL nodes.

Key difference from service discovery:

  • Stateless services β†’ any instance can handle any request
  • Sharded databases β†’ only the node with that shard can handle the request
Three Request Routing Approaches
Approach 1: Client β†’ Any Node β†’ Correct Node
Client
β†’
Any Node
forward→
Correct Node
Approach 2: Client β†’ Routing Tier β†’ Correct Node
Client
β†’
Routing Tier
β†’
Correct Node
Approach 3: Smart Client β†’ Correct Node
Smart Client
direct→
Correct Node

Three routing approaches:

  1. Any node forwarding β€” Client contacts any node; node forwards to correct shard if needed
  2. Routing tier β€” Dedicated shard-aware load balancer determines correct node
  3. Smart client β€” Client knows shard β†’ node mapping, connects directly

Key problems to solve:

ProblemChallenge
Who decides shard placement?Need fault-tolerant coordinator (avoid split-brain)
How to learn about changes?Routing component needs shard β†’ node updates
Migration in progress?Handle in-flight requests during shard moves

πŸ’‘ Insight

Coordination services like ZooKeeper solve the routing problem by providing a single source of truth for shard assignments. All routing decisions (whether in clients, nodes, or routing tier) subscribe to ZooKeeper to learn about changes. This is essentially distributed consensus (Chapter 10) in action.

ZooKeeper-Based Routing
ZooKeeper Cluster
(consensus-based coordination)
subscribe↓
subscribe↓
subscribe↓
Routing Tier
Node 1
Node 2

All components get real-time updates when shard assignments change

Coordination services for shard routing:

SystemCoordination Method
HBase, SolrCloudZooKeeper
Kubernetesetcd
MongoDBConfig servers + mongos daemons
Kafka, YugabyteDB, TiDBBuilt-in Raft consensus
Cassandra, ScyllaDB, RiakGossip protocol (weaker consistency)

How it works:

  • Nodes register in ZooKeeper/etcd
  • ZooKeeper maintains authoritative shard β†’ node mapping
  • Clients/routing tier subscribe for updates
  • Changes are pushed in real-time

Gossip vs. consensus:

  • Gossip = weaker consistency, possible split-brain
  • Leaderless DBs can tolerate this (they have weak guarantees anyway)

Client IP discovery:

  • DNS is usually sufficient (IPs change less frequently than shard assignments)

Note: OLTP = single-shard queries. Analytics = parallel queries across all shards (Chapter 11).


11. Sharding and Secondary Indexes​

In plain English: Primary key lookups are easyβ€”you know the key, you know the shard. But what about queries like "find all red cars"? The red cars are scattered across all shards.

In technical terms:

  • Partition key = easy routing (one shard)
  • Secondary index = records scattered across all shards

Why it matters: No free lunch!

  • Local indexes β†’ fast writes, slow reads (scatter-gather)
  • Global indexes β†’ fast reads, slow writes (distributed updates)

Secondary indexes:

  • Not unique identifiers β€” search for values (find all red cars)
  • Common in relational DBs, document DBs, full-text search (Solr, Elasticsearch)
  • Two approaches: local vs. global

11.1. Local Secondary Indexes​

In plain English: Local indexes are like each library branch maintaining its own card catalogβ€”it only knows about books in that branch. To find all copies, you call every branch.

In technical terms:

  • Example: Used car website, partition by listing ID
  • Need to search by color and make
  • Each shard maintains indexes only for its own records
  • Postings list: color:red β†’ [12, 87, 234] (IDs of red cars in this shard)
Local Secondary Indexes (Document-Partitioned)
Shard 0 (IDs 0-499)

Primary: car IDs 0-499

Index: color:red β†’ [12, 87, 234]

Index: make:honda β†’ [12, 156]

Shard 1 (IDs 500-999)

Primary: car IDs 500-999

Index: color:red β†’ [543, 789]

Index: make:honda β†’ [612]

Query "find red cars" must hit both shards and merge results [12, 87, 234, 543, 789]

⚠️ Warning: DIY secondary indexes in application code are dangerous:

  • Race conditions β†’ data out of sync
  • Intermittent write failures β†’ index inconsistency
  • See "The need for multi-object transactions"

How local indexes work:

  • Each shard = completely separate
  • Writes only touch one shard
  • Also called "document-partitioned index"

Reading from local indexes:

Query typeBehavior
Know partition keyQuery only that shard βœ…
Want all resultsQuery ALL shards, merge results ❌

πŸ’‘ Insight

Local secondary indexes suffer from tail latency amplification. If you query 100 shards in parallel and 99 respond in 10ms but one takes 500ms, your overall query takes 500ms. This is why adding more shards doesn't improve query throughput for secondary index queriesβ€”you still hit every shard.

Scalability limitation:

  • More shards = more storage
  • But NOT more query throughput (every query hits every shard)

Used by: MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, VoltDB

11.2. Global Secondary Indexes​

In plain English: Global indexes are like a central library catalog that knows about books in all branches. The catalog itself is split across multiple computers, but each entry points to ALL matching locations.

In technical terms:

  • Index covers data in ALL shards
  • Index itself must be sharded (too big for one node)
  • Can be sharded differently from primary data (term-partitioned)

Why it matters: Trade write complexity for read efficiency:

  • βœ… Single lookup tells you all matching records
  • ❌ Writes must update index shards on different nodes
Global Secondary Indexes (Term-Partitioned)
Data Shards
Shard 0
Cars 0-499
Shard 1
Cars 500-999
Index Shards

Index A-R
color:red β†’ [12,87,234,543,789]
make:honda β†’ [12,156,612]

Index S-Z
color:silver β†’ [45,234,891]
make:toyota β†’ [67,234]

Query "find red cars" hits only Index A-R, returns all matching IDs immediately

How global indexes are partitioned:

  • Index partitioned by term (term-partitioned)
  • Colors A-R β†’ Index Shard 0, Colors S-Z β†’ Index Shard 1
  • Looking for a term? β†’ know which index shard to query

Query behavior:

Query TypeGlobal Index Behavior
Single condition (color=red)βœ… Query one index shard
Fetch actual records❌ Still need to hit multiple data shards
Multiple conditions (AND)❌ Terms on different shards, compute intersection

πŸ’‘ Insight

Global secondary indexes often lag behind the primary data. When you write a car listing, updating the primary shard is fast (one network hop), but updating the global index shards asynchronously prevents the write from blocking. This means searches may briefly not find newly added itemsβ€”eventual consistency strikes again.

Write challenges:

  • One record write β†’ may update multiple index shards
  • Hard to keep index in sync
  • Option: distributed transactions (Chapter 8)

Used by: CockroachDB, TiDB, YugabyteDB, DynamoDB (both local and global)

When to use global indexes:

  • Read throughput >> write throughput
  • Postings lists are short

12. Summary​

What we covered:

  • Different ways to split large datasets into smaller subsets (shards)
  • Why sharding is necessary when data exceeds single-machine capacity
  • The goal: spread data and query load evenly, avoiding hot spots
  • Key decisions: choosing sharding scheme and rebalancing strategies

Two Main Sharding Approaches​

Key Range Sharding:

  • Keys are sorted; each shard owns keys from min to max
  • βœ… Efficient range queries possible
  • ❌ Risk of hot spots with sequential access patterns
  • Rebalancing: split large shards into two subranges

Hash Sharding:

  • Hash function applied to each key; shard owns range of hash values
  • βœ… Distributes load more evenly
  • ❌ Destroys key ordering, range queries inefficient
  • Rebalancing: typically create fixed shards upfront, move entire shards between nodes
πŸ“š
Key Range Sharding
  • Pro: Efficient range queries
  • Pro: Dynamic shard count
  • Con: Sequential writes create hot spots
  • Con: Expensive to split shards
🎲
Hash Sharding
  • Pro: Even load distribution
  • Pro: Predictable performance
  • Con: Range queries inefficient
  • Con: Fixed shard count (or expensive resharding)

Common pattern: Compound keys

  • Use first part of key as partition key (identifies shard)
  • Sort records within shard by the rest of the key
  • Result: efficient range queries within same partition key

Secondary Indexes​

Local Secondary Indexes:

  • Stored in same shard as primary key/value
  • βœ… Single shard update on write
  • ❌ Reads require scatter-gather to all shards

Global Secondary Indexes:

  • Sharded separately by indexed values
  • ❌ One write may update multiple index shards
  • βœ… Read postings list from single shard (but fetching records still hits multiple shards)

Query Routing​

  • Routes queries to appropriate shards
  • Coordination service (e.g., ZooKeeper) tracks shard-to-node assignments

πŸ’‘ Insight

Sharding represents a fundamental tradeoff in distributed systems: you gain scalability at the cost of simplicity. Operations that were simple on a single machine (joins, secondary indexes, transactions) become complex across shards. The key to successful sharding is choosing partition keys that align with your query patternsβ€”shard by tenant if queries are per-tenant, by user if queries are per-user, by timestamp if queries are time-range scans.


The Big Picture​

Why sharding works:

  • Each shard operates mostly independently
  • This independence enables horizontal scaling

What gets harder:

  • Cross-shard writes become problematic
  • What if write to one shard succeeds but another fails?
  • Answer: Transactions (Chapter 8)

Previous: Chapter 6: Replication | Next: Chapter 8: Transactions