Chapter 7. Sharding
Clearly, we must break away from the sequential and not limit the computers. We must state definitions and provide for priorities and descriptions of data. We must state relationships, not procedures.
Grace Murray Hopper, Management and the Computer of the Future (1962)
Table of Contentsβ
- Sharding and Partitioning
- Pros and Cons of Sharding
- Sharding for Multitenancy
- 3.1. Resource isolation
- 3.2. Permission isolation
- 3.3. Cell-based architecture
- 3.4. Per-tenant backup and restore
- 3.5. Regulatory compliance
- 3.6. Data residence
- 3.7. Gradual schema rollout
- Sharding of Key-Value Data
- Sharding by Key Range
- Sharding by Hash of Key
- Partitioning and Range Queries in Data Warehouses
- 7.1. Consistent hashing
- Skewed Workloads and Relieving Hot Spots
- Operations: Automatic or Manual Rebalancing
- Request Routing
- Sharding and Secondary Indexes
- 11.1. Local Secondary Indexes
- 11.2. Global Secondary Indexes
In plain English: Think of sharding like organizing a massive library. When one building can't hold all the books anymore, you create multiple branches across the city. Each branch holds different books (different shards), and you need a system to know which branch has which books.
In technical terms: Sharding splits a large dataset into smaller partitions, with each shard stored on a different node. Each piece of data belongs to exactly one shard, though shards are typically replicated for fault tolerance.
Why it matters: Sharding enables horizontal scalingβyou can grow your database by adding more machines instead of needing a single massive server. This makes it possible to handle petabyte-scale datasets and millions of queries per second.
Each shard has one leader and multiple followers
Each node acts as leader for some shards, follower for others
A distributed database typically distributes data across nodes in two ways:
- Replication β Having a copy of the same data on multiple nodes (covered in Chapter 6)
- Sharding β Splitting data into smaller partitions stored on different nodes (this chapter)
Key characteristics of sharding:
- Each piece of data (record, row, or document) belongs to exactly one shard
- Each shard is essentially a small database of its own
- Sharding is usually combined with replication for fault tolerance
- A node may store multiple shards and be leader for some, follower for others
π‘ Insight
The choice of sharding scheme is mostly independent of the choice of replication scheme. Everything from Chapter 6 about replication applies equally to replication of shards.
1. Sharding and Partitioningβ
The same concept, many names:
| Database | Term Used |
|---|---|
| Kafka | partition |
| CockroachDB | range |
| HBase, TiDB | region |
| Bigtable, YugabyteDB | tablet |
| Cassandra, ScyllaDB, Riak | vnode |
| Couchbase | vBucket |
PostgreSQL distinction:
- Partitioning β Splitting a table into files on the same machine (fast deletes)
- Sharding β Splitting across multiple machines
In most other systems, partitioning = sharding.
π‘ Insight
The terminology is confusing, but the core idea is simple: break a big dataset into smaller pieces. Whether you call them shards, partitions, tablets, or vnodes, the fundamental challenge is the sameβhow do you split the data evenly while maintaining fast lookups?
Fun fact: The term "shard" may have originated from:
- The game Ultima Online, where a magic crystal shattered into pieces (shards), each reflecting a game world
- Or an acronym: System for Highly Available Replicated Data (a 1980s database)
β οΈ Don't confuse: Partitioning β network partitions (netsplits). We cover network faults in Chapter 9.
2. Pros and Cons of Shardingβ
In plain English: Sharding is like choosing between one massive warehouse or many smaller warehouses distributed across the country. The distributed approach handles more volume, but adds complexity in tracking where everything is stored.
In technical terms: The primary reason for sharding is scalability when:
- Data volume exceeds single-node capacity
- Write throughput overwhelms one server
(If read throughput is the problem, use read scaling from Chapter 6 instead.)
Why it matters: Sharding enables horizontal scaling (scale-out):
- Grow capacity by adding more machines, not bigger machines
- Each shard handles a share of the load
- Process data and queries in parallel
- Data volume exceeds single machine capacity
- Write throughput overwhelms one server
- Need horizontal scaling for cost efficiency
- Data naturally partitions (e.g., by tenant)
- Single machine can handle the load
- Queries frequently join across shards
- Distributed transactions required
- Team lacks operational experience
When to avoid sharding:
- Replication works at any scale for fault tolerance
- Single machines are powerful todayβdon't shard prematurely
- Sharding adds significant complexity
The complexity cost:
- You must choose a partition key upfront
- All records with the same partition key β same shard
- Knowing the shard = fast lookup; not knowing = search all shards
- The sharding scheme is difficult to change later
π‘ Insight
The "know your shard" principle is crucial. When you query by partition key (e.g.,
user_id = 123), the system routes directly to one shard. But when you query without it (e.g., "find all users named Alice"), you must scatter the query to all shards and gather resultsβpotentially 100x slower.
Sharding works best for:
- Key-value data β shard by key, queries go to one shard
- Not ideal for relational data β secondary indexes and cross-shard joins are hard
Cross-shard writes are tricky:
- Writes may need to update records in multiple shards
- Single-node transactions are common; cross-shard transactions are slow
- Distributed transactions (Chapter 8) may become a bottleneck
Special case: Single-machine sharding:
- Some systems shard even on one machine (one process per CPU core)
- Takes advantage of CPU parallelism and NUMA architecture
- Examples: Redis, VoltDB, FoundationDB
3. Sharding for Multitenancyβ
In plain English: Think of a multitenant system like an apartment building. Each tenant (customer) has their own apartment (shard), with privacy from neighbors and independent utilities. If one tenant's plumbing breaks, it doesn't flood everyone else's apartment.
In technical terms: SaaS products are often multitenant:
- Each tenant = one customer (e.g., a business)
- Multiple users may log in under one tenant
- Each tenant's dataset is separate from others
Why it matters: Sharding for multitenancy provides strong isolationβbugs, performance problems, or security issues in one tenant are less likely to affect others.
Implementation options:
- One shard per tenant
- Group small tenants into larger shards
- Shards can be separate databases or portions of a logical database
Advantages of sharding for multitenancy:
3.1. Resource isolationβ
- Expensive operations by one tenant don't affect others on different shards
- Each shard has its own compute resources
3.2. Permission isolationβ
- Bugs in access control are less likely to leak data between tenants
- Physical separation = stronger security boundaries
3.3. Cell-based architectureβ
In plain English: A cell-based architecture is like having separate mini data centers for different customer groups. If one mini data center has a fire, only customers in that cell are affected.
In technical terms:
- Shard not just data, but also application services
- Each cell contains services + storage for a set of tenants
- Cells run independently from each other
Why it matters: Limits blast radiusβa bad deployment or bug affects only one cell, not the entire system.
3.4. Per-tenant backup and restoreβ
- Backup each tenant's shard separately
- Restore one tenant without affecting others
- Useful for accidental data deletion recovery
3.5. Regulatory compliance (GDPR)β
- Each person's data in a separate shard
- Data export = simple operation on their shard
- Data deletion = simple operation on their shard
3.6. Data residenceβ
- Some tenants require data in specific jurisdictions
- Region-aware databases can assign shards to specific regions
3.7. Gradual schema rolloutβ
- Roll out migrations one tenant at a time
- Detect problems before they affect everyone
- Reduces risk of system-wide failures
Main challenges with multitenancy sharding:
- Large tenants β If one tenant exceeds a single node, you need within-tenant sharding
- Many small tenants β Per-tenant shards create overhead; grouping complicates rebalancing
- Cross-tenant features β Queries joining multiple tenants require cross-shard operations
4. Sharding of Key-Value Dataβ
The fundamental question: You have a large amount of dataβhow do you decide which records go on which nodes?
In plain English: Imagine you have 10 million user records to spread across 10 servers. Ideally, each server gets exactly 1 million users, and when you look up a user, you can quickly figure out which server has their data.
In technical terms: Our goal is to spread data and query load evenly:
- 10 nodes β handle 10x the data and throughput of 1 node
- Adding/removing nodes β rebalance to maintain even distribution
Why it matters: Uneven distribution (skew) wastes resources. If 90% of your data lands on one shard, that shard becomes the bottleneck.
Key terminology:
- Skewed β Unfair sharding where some shards have more load than others
- Hot shard / hot spot β A shard with disproportionately high load
- Hot key β A single key with exceptionally high traffic (e.g., celebrity account)
We need a sharding algorithm that:
- Takes a partition key β tells us the shard
- Partition key = primary key (or first part) in key-value stores
- Partition key = some column (not necessarily PK) in relational DBs
- Must support rebalancing to relieve hot spots
5. Sharding by Key Rangeβ
In plain English: Key range sharding is like organizing a library by book title. Books starting with A-D go on shelf 1, E-K on shelf 2, and so on. If you want "Harry Potter," you know exactly which shelf to check.
In technical terms:
- Assign a contiguous range of keys to each shard (min to max)
- Like encyclopedia volumes: A-B in Vol 1, C-D in Vol 2, etc.
- Keys are sorted within each shard
Why it matters: Range queries are extremely efficient:
- "Get all events from March 2025" β touches one shard
- No need to scan all shards for time-based queries
A-B
C-D
E-G
T-Z
Ranges sized to balance data volume, not alphabetically even
Key range boundaries:
- Ranges are not evenly spaced β they adapt to data distribution
- Vol 1 might cover A-B, while Vol 12 covers T-Z (more letters, less data)
- Boundaries can be set manually (Vitess) or automatically (Bigtable, HBase, MongoDB, CockroachDB)
Within each shard:
- Keys stored in sorted order (B-tree or SSTables)
- Enables efficient range scans
- Concatenated keys = fetch related records in one query (e.g., sensor readings by month)
π‘ Insight
Key range sharding has a critical vulnerability: sequential writes create hot spots. If you write logs with timestamp keys, all writes go to the "current time" shard while other shards sit idle. The solution: prefix timestamps with another dimension like
sensor_id, spreading writes across shards at the cost of harder range queries.
The timestamp hot spot problem:
- Timestamp keys β all writes go to "current month" shard
- Other shards sit idle while one is overloaded
Solution: Prefix with another dimension:
- Key =
sensor_id + timestampinstead of justtimestamp - Writes spread across shards (different sensors β different shards)
- Tradeoff: Multi-sensor range queries now require one query per sensor
5.1. Rebalancing key-range sharded dataβ
Initial setup options:
- Start with no key ranges (database grows organically)
- Pre-splitting (HBase, MongoDB): Configure initial shards on empty DB
- Requires knowing key distribution upfront
How key-range sharding grows:
- Data grows β split shard at median key β two smaller shards
- Distribute resulting shards across nodes
- Data shrinks β merge adjacent small shards
- Similar to B-tree splitting/merging
Automatic split triggers:
- Shard reaches configured size (e.g., 10GB default in HBase)
- Write throughput exceeds threshold (hot shard β split even if small)
Pros of key-range sharding:
- β Shard count adapts to data volume
- β Small data = few shards = low overhead
- β Large data = many shards with configurable max size
Cons of key-range sharding:
- β Splitting is expensive (rewrites all data to new files)
- β Hot shards need splitting while already under load
- β Risk of overload during split operation
6. Sharding by Hash of Keyβ
In plain English: Hash sharding is like shuffling a deck of cards before dealing them out. Even if the cards were originally ordered, shuffling distributes them randomly.
In technical terms:
- Don't care about key proximity? Hash first, then map to shard
- Example: Tenant IDs don't need to be near each other
Why it matters: Hashing destroys key ordering (bad for range queries) but distributes load evenly (good for hot spots).
How hash functions work:
- 32-bit hash: any string β number between 0 and 2Β³Β² β 1
- Similar inputs β evenly distributed outputs
- Same input always β same output (deterministic)
Hash functions used for sharding:
| Database | Hash Function |
|---|---|
| MongoDB | MD5 |
| Cassandra, ScyllaDB | Murmur3 |
β οΈ Caution: Built-in language hash functions (Java's Object.hashCode(), Ruby's Object#hash) may produce different values across processesβnot suitable for sharding.
6.1. Hash modulo number of nodesβ
In plain English: Taking hash(key) % 10 is like rolling a 10-sided die for each keyβeach key randomly gets assigned a number from 0-9, determining which of 10 nodes stores it.
In technical terms:
hash(key) % Nβ which of N nodes stores the key- Seems simple: 10 nodes =
hash(key) % 10= node 0-9
Why it matters: DON'T DO THIS! The mod N approach fails catastrophically:
- Add or remove one node β nearly every key moves
- Massive data migration for a simple cluster change
hash % 3 = 0
hash % 3 = 1
hash % 3 = 2
hash % 4 = 0
hash % 4 = 1
hash % 4 = 2
hash % 4 = 3
β οΈ Most keys move! hash=6 moves from Node 0 to Node 2, hash=9 moves from Node 0 to Node 1
The core problem:
- N changes β most keys change shards
- Easy to compute, but terrible for rebalancing
- We need: An approach that moves data as little as possible
6.2. Fixed number of shardsβ
In plain English: Instead of assigning keys directly to nodes, create many more "buckets" (shards) than you have nodes, then assign multiple buckets to each node. When you add a node, just reassign some bucketsβthe keys inside each bucket don't move.
In technical terms:
- Create many more shards than nodes (e.g., 1,000 shards for 10 nodes)
- Each node gets 100 shards
- Key β
hash(key) % 1,000β shard number - System tracks shard β node mapping separately
Why it matters: Dramatically reduces rebalancing cost:
- Move entire shards (cheap bulk copies)
- Not individual keys (expensive random I/O)
Shards 0-99
Shards 100-199
Shards 200-299
Shards 0-90
Shards 100-190
Shards 91-99, 191-199...
β Only entire shards move, keys stay in same shard number
How rebalancing works:
- Add node β reassign some shards to new node
- Remove node β redistribute its shards to remaining nodes
- Only shard β node mapping changes; keys don't move between shards
- Transfer takes time; old mapping used during migration
π‘ Insight
Choosing the right number of shards is an art. Too few shards and you can't scale past a certain number of nodes. Too many shards and overhead becomes significant. A common heuristic: start with 10-100x more shards than initial nodes, allowing room to grow.
Best practices for shard count:
- Choose a number divisible by many factors (not just powers of 2)
- Assign more shards to powerful nodes for weighted load distribution
- Databases using this: Citus, Riak, Elasticsearch, Couchbase
Limitations:
- Can't have more nodes than shards
- Resharding is expensive (splits every shard, lots of disk I/O)
- Some systems require downtime for resharding
The "shard size" dilemma:
- Fixed fraction of total data per shard
- Shards too large β expensive rebalancing and recovery
- Shards too small β overhead becomes significant
- Hard to get "just right" when data size varies
6.3. Sharding by hash rangeβ
In plain English: Combine the best of both worldsβuse hash to scramble keys (avoiding hot spots), then use ranges of hash values instead of individual shards.
In technical terms:
- Can't predict shard count? β need adaptive sharding
- Key-range adapts but creates hot spots
- Solution: Range of hash values instead of range of keys
Why it matters:
- β Adapts to data growth (like key-range)
- β Distributes load evenly (like hash)
- β Loses efficient range queries over original keys
16-bit hash: 0 to 65,535
0-16,383
16,384-32,767
32,768-49,151
49,152-65,535
Keys "2025-01-15" and "2025-01-16" have random hashes β land in different shards
How it works:
- Similar keys (e.g., consecutive timestamps) β uniformly distributed hashes
- Assign hash ranges to shards: 0-16,383 β Shard 0, etc.
- Shards split when too big (expensive, but adaptive)
Range queries with hash-range sharding:
- Range over partition key? β Keys scattered across shards
- Range over secondary columns? β Works if partition key is constant
- Example:
(user_id=123, timestamp BETWEEN ...)β all in same shard
7. Partitioning and Range Queries in Data Warehousesβ
Data warehouse terminology:
| System | Partition Concept | Sorting Concept |
|---|---|---|
| BigQuery | partition key | cluster columns |
| Snowflake | micro-partitions (auto) | cluster keys |
| Delta Lake | manual or auto partitions | cluster keys |
Benefits of clustering:
- Improved range scan performance
- Better compression
- More efficient filtering
Hash-range sharding in practice:
- Used by: YugabyteDB, DynamoDB, MongoDB (option)
- Cassandra: 8 ranges per node (default)
- ScyllaDB: 256 ranges per node
- Random boundaries β some ranges larger, but multiple ranges per node balances out
Hash space 0-1023 split into ranges with random boundaries
0-125
340-502
780-890
126-339
503-600
891-1023
601-779
Multiple ranges per node balance out size differences
Adding/removing nodes:
- Range boundaries added/removed, shards split/merged
- New node gets fair share of data from existing nodes
- Minimal data movement required
7.1. Consistent hashingβ
In plain English: Consistent hashing is like a smart shuffling algorithm that tries to keep cards in the same pile when you add or remove piles.
In technical terms: Two key properties:
- Keys mapped roughly equally to each shard
- When shard count changes, minimal keys move
Why it matters: "Consistent" means stable assignment, NOT ACID consistency or replica consistency.
π‘ Insight
Despite its name, consistent hashing doesn't guarantee consistency in the ACID sense. It simply tries to keep a key in the same shard across cluster changes. Multiple algorithms achieve this goalβCassandra's token ring, rendezvous hashing, and jump consistent hash are all "consistent hashing" variants with different tradeoffs.
Consistent hashing algorithms:
| Algorithm | How it works |
|---|---|
| Cassandra/ScyllaDB | Token ring; split existing shards for new nodes |
| Rendezvous (HRW) | New node gets individual keys from all nodes |
| Jump consistent hash | Mathematically minimal key movement |
8. Skewed Workloads and Relieving Hot Spotsβ
In plain English: Imagine a celebrity tweets and gets 10 million replies. Even perfect sharding can't help if all those writes go to a single key (the celebrity's tweet ID).
In technical terms:
- Consistent hashing distributes keys evenly
- But load can still be skewed (some keys hotter than others)
- Overloaded shards while others sit idle
Why it matters: Hot keys are a fundamental limit of sharding:
- More nodes doesn't help if all traffic goes to one key
- Must split the key itself (adds complexity)
Example: Celebrity post on social media β millions of comments on one key
Solutions for hot keys:
- Dedicated shard β Give hot key its own shard/machine
- Key splitting β Append random suffix (0-99) to split writes across 100 keys
Key splitting tradeoffs:
| Aspect | Writes | Reads |
|---|---|---|
| Load distribution | β Split across 100 shards | β Must query all 100 keys and merge |
| Complexity | β Simple append | β Requires bookkeeping of which keys are split |
π‘ Insight
Hot key handling is often application-specific. A social media platform might cache celebrity posts differently than regular posts. A voting system might use a different aggregation strategy for popular items. The database can detect hot keys, but the application usually needs to implement the mitigation.
Additional challenges:
- Load changes over time (viral post β high load for days β calm)
- Some keys hot for writes, others hot for reads β different strategies
- Cloud services automate this: Amazon's "heat management" / "adaptive capacity"
9. Operations: Automatic or Manual Rebalancingβ
In plain English: Should your database automatically split shards and move them around, or should a human operator make those decisions? It's the difference between autopilot and manual flying.
In technical terms: Automatic vs. manual shard management?
Why it matters: Automatic rebalancing risks cascading failures:
- One slow node triggers rebalancing
- Rebalancing adds load to other nodes
- Other nodes trigger more rebalancing
- Spiral β cluster-wide outage
Approaches:
- Fully automatic β No human interaction needed
- Fully manual β Admin configures everything
- Hybrid (Couchbase, Riak) β Auto-suggest, human commits
- Pro: Less operational work
- Pro: Can auto-scale with load
- Con: Unpredictable behavior
- Con: Risk of cascading failures
- Pro: Predictable operations
- Pro: Human prevents mistakes
- Con: Slower to respond
- Con: Requires operator expertise
Automatic rebalancing pros:
- Less operational work
- Auto-scale with workload (DynamoDB: add/remove shards in minutes)
Automatic rebalancing cons:
- Unpredictable β rebalancing is expensive (rerouting + data movement)
- Can overload network/nodes, harm other requests
- If near max write throughput, splitting can't keep up with writes
The danger: automation + failure detection:
- Node temporarily slow (overloaded)
- Other nodes think it's dead
- Auto-rebalance to move load away
- Additional load makes things worse
- Cascading failure!
Recommendation: Human in the loop is slower but prevents surprises.
10. Request Routingβ
In plain English: When a client wants to read key "user:12345", how does it know which server to contact? This is like calling directory assistance.
In technical terms: How does a client know which node (IP address, port) to connect to?
Why it matters: Request routing is the glue that makes sharding work. Without it, clients must query ALL nodes.
Key difference from service discovery:
- Stateless services β any instance can handle any request
- Sharded databases β only the node with that shard can handle the request
Three routing approaches:
- Any node forwarding β Client contacts any node; node forwards to correct shard if needed
- Routing tier β Dedicated shard-aware load balancer determines correct node
- Smart client β Client knows shard β node mapping, connects directly
Key problems to solve:
| Problem | Challenge |
|---|---|
| Who decides shard placement? | Need fault-tolerant coordinator (avoid split-brain) |
| How to learn about changes? | Routing component needs shard β node updates |
| Migration in progress? | Handle in-flight requests during shard moves |
π‘ Insight
Coordination services like ZooKeeper solve the routing problem by providing a single source of truth for shard assignments. All routing decisions (whether in clients, nodes, or routing tier) subscribe to ZooKeeper to learn about changes. This is essentially distributed consensus (Chapter 10) in action.
(consensus-based coordination)
All components get real-time updates when shard assignments change
Coordination services for shard routing:
| System | Coordination Method |
|---|---|
| HBase, SolrCloud | ZooKeeper |
| Kubernetes | etcd |
| MongoDB | Config servers + mongos daemons |
| Kafka, YugabyteDB, TiDB | Built-in Raft consensus |
| Cassandra, ScyllaDB, Riak | Gossip protocol (weaker consistency) |
How it works:
- Nodes register in ZooKeeper/etcd
- ZooKeeper maintains authoritative shard β node mapping
- Clients/routing tier subscribe for updates
- Changes are pushed in real-time
Gossip vs. consensus:
- Gossip = weaker consistency, possible split-brain
- Leaderless DBs can tolerate this (they have weak guarantees anyway)
Client IP discovery:
- DNS is usually sufficient (IPs change less frequently than shard assignments)
Note: OLTP = single-shard queries. Analytics = parallel queries across all shards (Chapter 11).
11. Sharding and Secondary Indexesβ
In plain English: Primary key lookups are easyβyou know the key, you know the shard. But what about queries like "find all red cars"? The red cars are scattered across all shards.
In technical terms:
- Partition key = easy routing (one shard)
- Secondary index = records scattered across all shards
Why it matters: No free lunch!
- Local indexes β fast writes, slow reads (scatter-gather)
- Global indexes β fast reads, slow writes (distributed updates)
Secondary indexes:
- Not unique identifiers β search for values (find all red cars)
- Common in relational DBs, document DBs, full-text search (Solr, Elasticsearch)
- Two approaches: local vs. global
11.1. Local Secondary Indexesβ
In plain English: Local indexes are like each library branch maintaining its own card catalogβit only knows about books in that branch. To find all copies, you call every branch.
In technical terms:
- Example: Used car website, partition by listing ID
- Need to search by color and make
- Each shard maintains indexes only for its own records
- Postings list:
color:red β [12, 87, 234](IDs of red cars in this shard)
Primary: car IDs 0-499
Index: color:red β [12, 87, 234]
Index: make:honda β [12, 156]
Primary: car IDs 500-999
Index: color:red β [543, 789]
Index: make:honda β [612]
Query "find red cars" must hit both shards and merge results [12, 87, 234, 543, 789]
β οΈ Warning: DIY secondary indexes in application code are dangerous:
- Race conditions β data out of sync
- Intermittent write failures β index inconsistency
- See "The need for multi-object transactions"
How local indexes work:
- Each shard = completely separate
- Writes only touch one shard
- Also called "document-partitioned index"
Reading from local indexes:
| Query type | Behavior |
|---|---|
| Know partition key | Query only that shard β |
| Want all results | Query ALL shards, merge results β |
π‘ Insight
Local secondary indexes suffer from tail latency amplification. If you query 100 shards in parallel and 99 respond in 10ms but one takes 500ms, your overall query takes 500ms. This is why adding more shards doesn't improve query throughput for secondary index queriesβyou still hit every shard.
Scalability limitation:
- More shards = more storage
- But NOT more query throughput (every query hits every shard)
Used by: MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, VoltDB
11.2. Global Secondary Indexesβ
In plain English: Global indexes are like a central library catalog that knows about books in all branches. The catalog itself is split across multiple computers, but each entry points to ALL matching locations.
In technical terms:
- Index covers data in ALL shards
- Index itself must be sharded (too big for one node)
- Can be sharded differently from primary data (term-partitioned)
Why it matters: Trade write complexity for read efficiency:
- β Single lookup tells you all matching records
- β Writes must update index shards on different nodes
Cars 0-499
Cars 500-999
Index A-R
color:red β [12,87,234,543,789]
make:honda β [12,156,612]
Index S-Z
color:silver β [45,234,891]
make:toyota β [67,234]
Query "find red cars" hits only Index A-R, returns all matching IDs immediately
How global indexes are partitioned:
- Index partitioned by term (term-partitioned)
- Colors A-R β Index Shard 0, Colors S-Z β Index Shard 1
- Looking for a term? β know which index shard to query
Query behavior:
| Query Type | Global Index Behavior |
|---|---|
Single condition (color=red) | β Query one index shard |
| Fetch actual records | β Still need to hit multiple data shards |
| Multiple conditions (AND) | β Terms on different shards, compute intersection |
π‘ Insight
Global secondary indexes often lag behind the primary data. When you write a car listing, updating the primary shard is fast (one network hop), but updating the global index shards asynchronously prevents the write from blocking. This means searches may briefly not find newly added itemsβeventual consistency strikes again.
Write challenges:
- One record write β may update multiple index shards
- Hard to keep index in sync
- Option: distributed transactions (Chapter 8)
Used by: CockroachDB, TiDB, YugabyteDB, DynamoDB (both local and global)
When to use global indexes:
- Read throughput >> write throughput
- Postings lists are short
12. Summaryβ
What we covered:
- Different ways to split large datasets into smaller subsets (shards)
- Why sharding is necessary when data exceeds single-machine capacity
- The goal: spread data and query load evenly, avoiding hot spots
- Key decisions: choosing sharding scheme and rebalancing strategies
Two Main Sharding Approachesβ
Key Range Sharding:
- Keys are sorted; each shard owns keys from min to max
- β Efficient range queries possible
- β Risk of hot spots with sequential access patterns
- Rebalancing: split large shards into two subranges
Hash Sharding:
- Hash function applied to each key; shard owns range of hash values
- β Distributes load more evenly
- β Destroys key ordering, range queries inefficient
- Rebalancing: typically create fixed shards upfront, move entire shards between nodes
- Pro: Efficient range queries
- Pro: Dynamic shard count
- Con: Sequential writes create hot spots
- Con: Expensive to split shards
- Pro: Even load distribution
- Pro: Predictable performance
- Con: Range queries inefficient
- Con: Fixed shard count (or expensive resharding)
Common pattern: Compound keys
- Use first part of key as partition key (identifies shard)
- Sort records within shard by the rest of the key
- Result: efficient range queries within same partition key
Secondary Indexesβ
Local Secondary Indexes:
- Stored in same shard as primary key/value
- β Single shard update on write
- β Reads require scatter-gather to all shards
Global Secondary Indexes:
- Sharded separately by indexed values
- β One write may update multiple index shards
- β Read postings list from single shard (but fetching records still hits multiple shards)
Query Routingβ
- Routes queries to appropriate shards
- Coordination service (e.g., ZooKeeper) tracks shard-to-node assignments
π‘ Insight
Sharding represents a fundamental tradeoff in distributed systems: you gain scalability at the cost of simplicity. Operations that were simple on a single machine (joins, secondary indexes, transactions) become complex across shards. The key to successful sharding is choosing partition keys that align with your query patternsβshard by tenant if queries are per-tenant, by user if queries are per-user, by timestamp if queries are time-range scans.
The Big Pictureβ
Why sharding works:
- Each shard operates mostly independently
- This independence enables horizontal scaling
What gets harder:
- Cross-shard writes become problematic
- What if write to one shard succeeds but another fails?
- Answer: Transactions (Chapter 8)
Previous: Chapter 6: Replication | Next: Chapter 8: Transactions