Chapter 10. Consistency and Consensus
An ancient adage warns, "Never go to sea with two chronometers; take one or three."
Table of Contents
- Introduction
- Linearizability
- ID Generators and Logical Clocks
- 3.1. Logical Clocks
- 3.2. Linearizable ID Generators
- Consensus
- Summary
1. Introduction
In plain English: When you replicate data across multiple machines for fault tolerance, those copies can easily get out of sync. There are two philosophical approaches: either expose this complexity to developers (eventual consistency) or hide it behind strong guarantees (strong consistency). This chapter is about the strong consistency approach.
In technical terms: Lots of things can go wrong in distributed systems. If we want a service to continue working correctly despite those things going wrong, we need to find ways of tolerating faults. One of the best tools we have is replication—but having multiple copies of data opens up the risk of inconsistencies.
Why it matters: Strong consistency makes distributed systems behave like single-node systems, dramatically simplifying application logic. However, it comes with performance costs and availability trade-offs that you must understand to make informed architectural decisions.
At a high level, there are two competing philosophies for dealing with replication issues:
- Multi-leader and leaderless replication
- Higher availability and performance
- More complex application logic
- Works with offline-capable apps
- Single-leader or consensus-based
- Simpler application development
- Performance cost and reduced availability
- Requires reliable network communication
💡 Insight
The choice between eventual and strong consistency is fundamentally about where you place complexity: in the infrastructure (consensus algorithms) or in the application code (conflict resolution). Neither is inherently better—it depends on your requirements, network reliability, and whether your application can tolerate stale reads.
In this chapter we will dive deeper into the strongly consistent approach, looking at three areas:
- Linearizability — A precise definition of "strong consistency" that makes replicated systems behave like single-node systems
- ID Generators and Timestamps — How to generate unique, correctly ordered identifiers in distributed systems
- Consensus Algorithms — How distributed systems can achieve linearizability while remaining fault-tolerant
Along the way, we will see that there are fundamental limits on what is possible in a distributed system.
2. Linearizability
In plain English: Imagine a database that looks like a single magic box, even though it's actually many machines. When one person writes something, everyone else immediately sees that new value. No stale reads, no confusion. That's linearizability.
In technical terms: Linearizability (also known as atomic consistency, strong consistency, immediate consistency, or external consistency) is the idea of making a system appear as if there were only one copy of the data, and all operations on it are atomic. With this guarantee, even though there may be multiple replicas in reality, the application does not need to worry about them.
Why it matters: In a linearizable system, as soon as one client successfully completes a write, all clients reading from the database must be able to see the value just written. This recency guarantee prevents confusing anomalies where users see data go "back in time."
Example: Nonlinearizable Sports Website
If Aaliyah and Bryce had hit reload at the same time, different results would be unsurprising. However, Bryce knows that he hit reload after hearing Aaliyah exclaim the final score, and therefore he expects his query result to be at least as recent as Aaliyah's. The fact that his query returned a stale result is a violation of linearizability.
2.1. What Makes a System Linearizable?
Let's look at three clients concurrently reading and writing the same object x (could be one key in a key-value store, one row in a database, etc.).
Key operations:
read(x) ⇒ v— Client reads register x, database returns value vwrite(x, v) ⇒ r— Client sets register x to value v, database returns response r (ok or error)cas(x, vold, vnew) ⇒ r— Atomic compare-and-set: only updates if current value equalsvold
Linearizability rules:
- Before write completes: Reads return old value (0)
- After write completes: All reads must return new value (1)
- During write (concurrent): Reads may return either old or new value
- Once any read returns new value: All subsequent reads (on any client) must also return new value
💡 Insight
Think of linearizability as requiring an atomic "flip" moment during every write. Even if the write operation takes time to process, there must be some instant when the value switches from old to new. After that instant, no client can ever see the old value again—even if the write operation hasn't finished confirming to the writer.
The final read by Client B (returning 2) is not linearizable because Client A already read value 4 before B's read started. This is the same Aaliyah-and-Bryce situation: if one client sees a newer value, all subsequent reads must see at least that new value.
2.2. Linearizability Versus Serializability
These are easily confused but quite different!
In plain English: Serializability is about multi-object transactions behaving as if they ran one-at-a-time (even if they actually overlapped). Linearizability is about single-object operations appearing instantaneous and respecting real-time order.
Combining both: A database may provide both serializability and linearizability—this combination is known as strict serializability or strong one-copy serializability (strong-1SR). Single-node databases typically provide this. With distributed databases, it requires expensive coordination.
2.3. Relying on Linearizability
In what circumstances is linearizability useful?
- Must be linearizable—no split brain allowed
- ZooKeeper and etcd use consensus for this
- Implements distributed leases
- Fencing tokens prevent zombie leaders
- Username must be unique
- File path must be unique
- Similar to acquiring a "lock" on the username
- Requires atomic compare-and-set
- Web upload → queue → transcoder reads from storage
- If storage is non-linearizable, race condition possible
- Transcoder might fetch old version or nothing
- Push notifications + data fetch have same issue
Example: Video transcoding race condition
This problem arises because there are two different communication channels between the web server and transcoder: the file storage and the message queue. Without linearizability, race conditions between these channels are possible.
2.4. Implementing Linearizable Systems
Let's revisit replication methods from Chapter 6 and compare their linearizability:
Linearizability and Quorums
Intuitively, quorum reads and writes (w + r > n) seem like they should be linearizable. However, they are not:
The quorum condition is met (w + r > n), but this execution is not linearizable: B's request begins after A's completes, but B returns the old value while A returns the new value.
To make Dynamo-style quorums linearizable (at significant performance cost):
- Readers must perform synchronous read repair before returning
- Writers must read latest state from quorum to fetch the latest timestamp
- Even then, compare-and-set cannot be implemented (requires consensus)
💡 Insight
It is safest to assume that a leaderless system with Dynamo-style replication does not provide linearizability, even with quorum reads and writes.
2.5. The Cost of Linearizability
Consider a multi-region deployment with a network partition between regions:
With single-leader replication:
- Clients in the follower region cannot make writes (unavailable)
- Linearizable reads also unavailable (must contact leader)
- Only clients in the leader region can continue working normally
With multi-leader replication:
- Both regions can accept writes (available)
- Writes are queued and exchanged when network recovers
- But this is not linearizable
The CAP Theorem
In plain English: When your network is broken, you must choose: either maintain consistency (linearizability) but become unavailable in some regions, or remain available everywhere but give up consistency.
More precisely: The CAP theorem observes that if your application requires linearizability, and some replicas are disconnected due to a network problem, then those replicas cannot process requests (they become unavailable). This choice is sometimes known as CP (consistent under partitions).
If your application does not require linearizability, then each replica can process requests independently even when disconnected (AP: available under partitions).
Why CAP is unhelpful:
💡 Insight
CAP theorem has been superseded by more precise impossibility results. It's of mostly historical interest today. The real trade-off is not about network partitions (which are inevitable) but about choosing between linearizability and low latency.
Linearizability and Network Delays
Surprisingly, even RAM on modern multi-core CPUs is not linearizable! If a thread on one CPU core writes to a memory address, and a thread on another core reads the same address shortly after, it may not see the new value (unless a memory barrier is used).
Why? Every CPU core has its own cache and store buffer. Accessing cache is much faster than main memory, so this is essential for performance. But there are now multiple copies of data, asynchronously updated—linearizability is lost.
Key insight: The reason for dropping linearizability is performance, not fault tolerance.
The same is true of distributed databases that choose not to provide linearizable guarantees: they do so primarily to increase performance, not for fault tolerance.
Attiya and Welch proved: If you want linearizability, the response time of read and write requests is at least proportional to the uncertainty of delays in the network. In a network with highly variable delays (like most computer networks), linearizable reads and writes will inevitably be slow.
A faster algorithm for linearizability does not exist, but weaker consistency models can be much faster.
3. ID Generators and Logical Clocks
In plain English: In a single-node database, you can generate unique IDs by just incrementing a counter (1, 2, 3, ...). This is fast, compact, and the order tells you which records were created first. But in a distributed system, this simple approach doesn't work—you need more sophisticated techniques.
In technical terms: Many applications need to assign unique IDs to database records when they are created. In single-node databases, an auto-incrementing integer is common (64-bit or 32-bit). Another advantage is that the order of IDs reflects the order records were created.
Why it matters: Incorrectly ordered IDs can cause serious bugs, like showing a private photo to unauthorized viewers because its timestamp was assigned out of order relative to an account permission change.
Single-Node ID Generator Problems
A single-node ID generator is linearizable (each fetch-and-add operation atomically increments and returns the counter), but has issues:
Alternative ID Generation Schemes
All these schemes generate unique IDs but have much weaker ordering guarantees than single-node auto-increment. As discussed in Chapter 9, wall-clock timestamps can provide at best approximate ordering due to clock skew and jumps.
3.1. Logical Clocks
In plain English: Instead of measuring time in seconds, a logical clock measures time in events. It doesn't tell you "what time is it?" but it can tell you "which happened first?"
In technical terms: While a physical clock is a hardware device that counts seconds (or milliseconds, microseconds), a logical clock is an algorithm that counts events that have occurred. Timestamps from a logical clock can be compared to determine ordering.
Requirements for a logical clock:
- Compact — A few bytes in size
- Unique — Every timestamp is different
- Totally ordered — Any two timestamps can be compared
- Causally consistent — If operation A happened before B, then A's timestamp < B's timestamp
Lamport Timestamps
How it works:
Comparing Lamport timestamps:
(2, "bryce") > (1, "aaliyah") // Compare counter first
(1, "caleb") > (1, "aaliyah") // If equal, compare node ID
💡 Insight
Lamport timestamps capture the happens-before relationship: if A happened before B (in the causal sense), then A's timestamp will be less than B's. But the reverse is not true—if A's timestamp is less than B's, A might have happened before B, or they might be concurrent.
Hybrid Logical Clocks
Hybrid logical clocks combine physical time-of-day clocks with Lamport clock ordering:
- Counts microseconds like a normal clock
- Can find events by date/time
- Slowly drifts with underlying physical clock
- Increments on every timestamp generation
- Moves forward if remote timestamp is greater
- Ensures monotonic progress even if clock jumps backward
Benefits:
- Treats timestamp almost like conventional time-of-day
- Ordering is consistent with happens-before relation
- Doesn't depend on special hardware (atomic clocks, GPS)
- Only requires roughly synchronized clocks
Used by: CockroachDB, Apache Cassandra (in some modes)
Lamport/Hybrid Clocks vs. Vector Clocks
- Lamport/Hybrid clocks: Provide total order but can't detect concurrency
- Vector clocks: Can detect when events are concurrent but timestamps are much larger (one integer per node)
Use vector clocks when you need to detect concurrent writes (as in conflict resolution). Use Lamport/hybrid clocks for transaction IDs in snapshot isolation.
3.2. Linearizable ID Generators
Although Lamport and hybrid logical clocks provide useful ordering, that ordering is weaker than linearizable:
- Linearizability: If request A completed before request B began, then B must have higher ID (even if A and B never communicated)
- Lamport clocks: Can only ensure greater timestamps than what a node has seen, not what it hasn't seen
Example: Privacy violation with non-linearizable IDs
Implementing a Linearizable ID Generator
Simplest approach: Use a single node
Optimizations:
- Batch allocation: Write blocks of IDs (1-1000, then 1001-2000) to reduce disk writes
- Some IDs will be skipped if the node crashes, but no duplicates or out-of-order IDs
- Single node can handle large throughput since its job is simple
You cannot easily shard the ID generator—multiple shards independently handing out IDs can't guarantee linearizable ordering.
Alternative: Google Spanner approach
- Use physical clocks that return uncertainty interval (not single timestamp)
- Wait for uncertainty interval duration to elapse before returning
- Ensures if one request completes before another begins, the later has greater timestamp
- Requires tightly synchronized clocks (atomic clocks, GPS)
4. Consensus
In plain English: Consensus is about getting all the nodes in a distributed system to agree on something—who is the leader, what order events happened in, whether to commit a transaction. It's one of the most important and difficult problems in distributed computing.
In technical terms: Consensus is the fundamental problem underlying many distributed systems challenges: leader election, atomic commit, linearizable operations, and more. The best-known consensus algorithms are Viewstamped Replication, Paxos, Raft, and Zab.
Why it matters: Consensus algorithms enable automatic failover, preventing split brain while ensuring no committed data is lost. They're the foundation of coordination services like ZooKeeper and etcd, which in turn are used by countless distributed systems.
We have seen several examples of things that are easy with a single node but hard when you want fault tolerance:
All of these are instances of consensus.
4.1. The Impossibility of Consensus
The FLP result (Fischer, Lynch, Paterson) proves that there is no algorithm that is always able to reach consensus if there is a risk that a node may crash.
How can consensus algorithms exist?
The FLP result doesn't say we can never reach consensus—only that we can't guarantee that a consensus algorithm will always terminate. The proof assumes:
- Deterministic algorithm in the asynchronous system model
- No clocks or timeouts
Ways around the impossibility:
- Use timeouts to suspect crashed nodes (even if suspicion is sometimes wrong)
- Use random numbers (randomized algorithms)
💡 Insight
Although FLP proves theoretical impossibility, distributed systems can usually achieve consensus in practice by using timeouts and failure detectors. The FLP result is important theoretically but doesn't prevent building real consensus systems.
4.2. The Many Faces of Consensus
Consensus can be expressed in several different ways—and surprisingly, these are all equivalent:
If you have an algorithm that solves one of these problems, you can convert it into a solution for any of the others!
Single-Value Consensus
Formal properties:
The first three are safety properties (bad things don't happen). Termination is a liveness property (good things eventually happen).
Example uses:
- Multiple nodes race to become leader → consensus decides which wins
- Multiple people book the last airplane seat → consensus decides which succeeds
Shared Logs (Total Order Broadcast)
In plain English: A shared log is like a notebook that everyone can write in, but once something is written, it's permanent and everyone sees the same sequence of entries in the same order.
Properties:
Implementing consensus from shared log:
- Every node that wants to propose a value requests it to be added to the log
- Whichever value appears first in the log is the decided value
- Since all nodes read the same order, they all agree
Implementing shared log from consensus:
- For every log slot, run a consensus instance to decide what goes in that slot
- When consensus decides for a slot, append it to the log
- If a value wasn't chosen, retry by proposing it for a later slot
Atomic Commitment
Properties (similar to consensus but with key difference):
- Uniform agreement: No two nodes decide differently (commit or abort)
- Integrity: Cannot change decision once made
- Validity: If decides commit, all nodes voted commit; if any voted abort, must abort
- Non-triviality: If all vote commit and no timeouts, must commit (can't always abort)
- Termination: Every non-crashed node eventually decides
Key difference from consensus: With consensus, any proposed value is acceptable. With atomic commit, must abort if any participant voted abort.
Implementing from consensus:
- Every node sends its vote (commit/abort) to all others
- Nodes that receive "commit" from all propose "commit" to consensus
- Nodes that receive "abort" or timeout propose "abort"
- Consensus decides; all nodes commit or abort accordingly
4.3. Consensus in Practice
Most consensus systems provide shared logs (total order broadcast):
- Raft, Viewstamped Replication, Zab: Shared logs directly
- Paxos: Single-value consensus, but Multi-Paxos provides shared log
- Used by: ZooKeeper (Zab), etcd (Raft), Consul (Raft)
Using Shared Logs
From Single-Leader Replication to Consensus
Challenge: Single-leader is easy but how to provide fault tolerance through automatic failover?
Solution: Epochs and two rounds of voting
Key properties:
- Within each epoch, leader is unique
- Higher epoch prevails in conflicts
- Before appending, leader checks no higher epoch exists (via quorum vote)
- Two rounds of voting look like 2PC but very different: consensus only needs quorum (not all nodes) and any node can start election
Subtleties:
- New leader catching up: Raft requires new leader to have up-to-date log; Paxos allows any node but requires catching up first
- Linearizable reads: Must go through quorum vote (or lease-based optimization)
- Reconfiguration: Can add/remove nodes dynamically (useful for multi-region expansion)
💡 Insight
Consensus algorithms are essentially "single-leader replication done right"—with automatic leader election, no split brain possible, and no committed data lost during failover. The complexity comes from handling all the edge cases correctly.
Pros and Cons of Consensus
Advantages:
- Automatic failover without data loss
- Prevents split brain
- Ensures linearizability
- Battle-tested algorithms with formal proofs
Disadvantages:
For systems that want high availability but don't want the cost of consensus, the alternative is weaker consistency models (leaderless or multi-leader replication).
4.4. Coordination Services
In plain English: Coordination services like ZooKeeper and etcd are specialized databases designed to help distributed systems coordinate with each other—not for storing your application data, but for storing tiny amounts of critical coordination metadata.
Examples: ZooKeeper, etcd, Consul
Why not a regular database? Coordination services combine consensus with features specifically useful for distributed coordination:
Common Use Cases
Design characteristics:
- Hold small amounts of data (fits in memory)
- Data is replicated to 3-5 nodes using consensus
- Fixed set of nodes regardless of application cluster size
- Slow-changing data (changes on timescale of minutes/hours)
- For fast-changing state, use conventional databases instead
Service discovery note: Using consensus for service discovery is often overkill. DNS-based service discovery with caching is usually sufficient since linearizability isn't needed for finding services.
ZooKeeper observers: Read-only replicas that don't participate in voting; provide high read throughput and availability for non-linearizable reads (useful for service discovery).
5. Summary
In this chapter we examined strong consistency in fault-tolerant systems: what it is and how to achieve it.
Linearizability:
- Makes replicated data appear as a single copy with atomic operations
- Useful for avoiding race conditions (e.g., uniqueness constraints, leader election)
- Has downsides: slow, especially with large network delays
- Many replication algorithms don't provide it (including Dynamo-style quorums)
ID Generators and Logical Clocks:
- Single-node auto-increment is linearizable but not fault-tolerant
- Distributed ID schemes (UUIDs, timestamp-based) don't guarantee linearizable ordering
- Lamport clocks provide causally consistent ordering without linearizability
- Hybrid logical clocks combine physical time with causal ordering
- Linearizable ID generators require single-node with replication or Spanner-style synchronized clocks
Consensus:
A wide range of problems are equivalent to consensus:
All are straightforward on a single node, but challenging to make fault-tolerant in a distributed setting.
Consensus algorithms:
- Raft, Paxos, Viewstamped Replication, Zab — Widely used, battle-tested algorithms
- Essentially single-leader replication with automatic leader election and failover
- Ensure no committed writes lost and no split brain
- Every write and linearizable read requires quorum vote
- ZooKeeper, etcd, Consul built on consensus; provide locks, leases, failure detection, notifications
💡 Insight
Consensus is not always the right tool. For systems that don't need strong consistency, weaker models (leaderless, multi-leader) offer better availability and performance. The decision comes down to your application requirements: do you need linearizability, or can you tolerate eventual consistency?
When not to use consensus:
- Strong consistency not needed
- High availability more important than consistency
- Latency-sensitive applications
- Wide-area networks with variable delays
For such cases, leaderless or multi-leader replication with weaker consistency models is often more appropriate. The logical clocks we discussed (Lamport, hybrid) are helpful in that context.
Consensus algorithms are complicated and subtle, but they are supported by a rich body of theory developed since the 1980s. This theory makes it possible to build systems that tolerate all the faults discussed in Chapter 9 while ensuring data is not corrupted.
Previous: Chapter 9 | Next: Chapter 11