Skip to main content

Chapter 10. Consistency and Consensus

An ancient adage warns, "Never go to sea with two chronometers; take one or three."

Table of Contents

  1. Introduction
  2. Linearizability
  3. ID Generators and Logical Clocks
  4. Consensus
  5. Summary

1. Introduction

In plain English: When you replicate data across multiple machines for fault tolerance, those copies can easily get out of sync. There are two philosophical approaches: either expose this complexity to developers (eventual consistency) or hide it behind strong guarantees (strong consistency). This chapter is about the strong consistency approach.

In technical terms: Lots of things can go wrong in distributed systems. If we want a service to continue working correctly despite those things going wrong, we need to find ways of tolerating faults. One of the best tools we have is replication—but having multiple copies of data opens up the risk of inconsistencies.

Why it matters: Strong consistency makes distributed systems behave like single-node systems, dramatically simplifying application logic. However, it comes with performance costs and availability trade-offs that you must understand to make informed architectural decisions.

At a high level, there are two competing philosophies for dealing with replication issues:

🔄
Eventual Consistency
Applications must handle inconsistencies and conflicts explicitly
  • Multi-leader and leaderless replication
  • Higher availability and performance
  • More complex application logic
  • Works with offline-capable apps
🔒
Strong Consistency
System behaves as if it were single-node; hides replication complexity
  • Single-leader or consensus-based
  • Simpler application development
  • Performance cost and reduced availability
  • Requires reliable network communication

💡 Insight

The choice between eventual and strong consistency is fundamentally about where you place complexity: in the infrastructure (consensus algorithms) or in the application code (conflict resolution). Neither is inherently better—it depends on your requirements, network reliability, and whether your application can tolerate stale reads.

In this chapter we will dive deeper into the strongly consistent approach, looking at three areas:

  1. Linearizability — A precise definition of "strong consistency" that makes replicated systems behave like single-node systems
  2. ID Generators and Timestamps — How to generate unique, correctly ordered identifiers in distributed systems
  3. Consensus Algorithms — How distributed systems can achieve linearizability while remaining fault-tolerant

Along the way, we will see that there are fundamental limits on what is possible in a distributed system.


2. Linearizability

In plain English: Imagine a database that looks like a single magic box, even though it's actually many machines. When one person writes something, everyone else immediately sees that new value. No stale reads, no confusion. That's linearizability.

In technical terms: Linearizability (also known as atomic consistency, strong consistency, immediate consistency, or external consistency) is the idea of making a system appear as if there were only one copy of the data, and all operations on it are atomic. With this guarantee, even though there may be multiple replicas in reality, the application does not need to worry about them.

Why it matters: In a linearizable system, as soon as one client successfully completes a write, all clients reading from the database must be able to see the value just written. This recency guarantee prevents confusing anomalies where users see data go "back in time."

Example: Nonlinearizable Sports Website

Nonlinearizable System Causing Confusion
1
Game ends, final score announced
Score becomes official: Warriors 103, Lakers 101
2
Aaliyah refreshes (hits Replica A)
Sees final score, excitedly tells Bryce
3
Bryce refreshes (hits Replica B)
Sees game still ongoing—Replica B is lagging!
4
Linearizability violation
Bryce knows he reloaded AFTER Aaliyah, but sees older data

If Aaliyah and Bryce had hit reload at the same time, different results would be unsurprising. However, Bryce knows that he hit reload after hearing Aaliyah exclaim the final score, and therefore he expects his query result to be at least as recent as Aaliyah's. The fact that his query returned a stale result is a violation of linearizability.

2.1. What Makes a System Linearizable?

Let's look at three clients concurrently reading and writing the same object x (could be one key in a key-value store, one row in a database, etc.).

Linearizability Basics
Client A
read(x) ⇒ 0
read(x) ⇒ 1
Client B
read(x) ⇒ 1
read(x) ⇒ 2 ❌
Client C
write(x, 1)

Key operations:

  • read(x) ⇒ v — Client reads register x, database returns value v
  • write(x, v) ⇒ r — Client sets register x to value v, database returns response r (ok or error)
  • cas(x, vold, vnew) ⇒ r — Atomic compare-and-set: only updates if current value equals vold

Linearizability rules:

  1. Before write completes: Reads return old value (0)
  2. After write completes: All reads must return new value (1)
  3. During write (concurrent): Reads may return either old or new value
  4. Once any read returns new value: All subsequent reads (on any client) must also return new value

💡 Insight

Think of linearizability as requiring an atomic "flip" moment during every write. Even if the write operation takes time to process, there must be some instant when the value switches from old to new. After that instant, no client can ever see the old value again—even if the write operation hasn't finished confirming to the writer.

The final read by Client B (returning 2) is not linearizable because Client A already read value 4 before B's read started. This is the same Aaliyah-and-Bryce situation: if one client sees a newer value, all subsequent reads must see at least that new value.

2.2. Linearizability Versus Serializability

These are easily confused but quite different!

Serializability
Linearizability
Scope
Transactions across multiple objects
Single object reads and writes
Guarantee
Transactions behave as if executed serially
Operations appear to take effect instantaneously
Order requirement
Any serial order is acceptable
Must respect real-time order
Prevents
Write skew, phantoms (multi-object anomalies)
Stale reads (single-object recency)
Example system
Serializable Snapshot Isolation (SSI)
Single-leader replication on the leader

In plain English: Serializability is about multi-object transactions behaving as if they ran one-at-a-time (even if they actually overlapped). Linearizability is about single-object operations appearing instantaneous and respecting real-time order.

Combining both: A database may provide both serializability and linearizability—this combination is known as strict serializability or strong one-copy serializability (strong-1SR). Single-node databases typically provide this. With distributed databases, it requires expensive coordination.

2.3. Relying on Linearizability

In what circumstances is linearizability useful?

🔐
Locking and Leader Election
Ensuring only one node becomes the leader
  • Must be linearizable—no split brain allowed
  • ZooKeeper and etcd use consensus for this
  • Implements distributed leases
  • Fencing tokens prevent zombie leaders
🎯
Constraints and Uniqueness
Enforcing uniqueness constraints
  • Username must be unique
  • File path must be unique
  • Similar to acquiring a "lock" on the username
  • Requires atomic compare-and-set
📡
Cross-Channel Timing
Multiple communication channels between components
  • Web upload → queue → transcoder reads from storage
  • If storage is non-linearizable, race condition possible
  • Transcoder might fetch old version or nothing
  • Push notifications + data fetch have same issue

Example: Video transcoding race condition

Cross-Channel Race Condition
1
User uploads video.mp4
Web server writes to file storage
2
Web server enqueues "transcode video.mp4"
Message goes to queue (faster than storage replication)
3
Transcoder receives message
Fetches video.mp4 from file storage
4
Storage returns old version or 404
Replication not yet complete—inconsistency!

This problem arises because there are two different communication channels between the web server and transcoder: the file storage and the message queue. Without linearizability, race conditions between these channels are possible.

2.4. Implementing Linearizable Systems

Let's revisit replication methods from Chapter 6 and compare their linearizability:

Replication Method
Linearizability
Single-leader replication
Potentially linearizable
Only if all reads/writes go to leader; follower reads are stale; failover can violate linearizability
Consensus algorithms
Linearizable
Designed to prevent split brain; but reads without leader check may be stale
Multi-leader replication
Not linearizable
Concurrent writes on multiple nodes create conflicts; asynchronous replication
Leaderless (Dynamo-style)
Probably not linearizable
Even with quorums—LWW clocks violate it; race conditions with variable network delays

Linearizability and Quorums

Intuitively, quorum reads and writes (w + r > n) seem like they should be linearizable. However, they are not:

Nonlinearizable Execution Despite Quorum
Replica 1
Replica 2
Replica 3
Writer: x=1
x=1 ✓
x=1 ✓
x=0 (slow)
Client A reads
Quorum: Replica 1,2
Returns x=1 ✓
Client B reads
Quorum: Replica 2,3
Returns x=0 ❌

The quorum condition is met (w + r > n), but this execution is not linearizable: B's request begins after A's completes, but B returns the old value while A returns the new value.

To make Dynamo-style quorums linearizable (at significant performance cost):

  • Readers must perform synchronous read repair before returning
  • Writers must read latest state from quorum to fetch the latest timestamp
  • Even then, compare-and-set cannot be implemented (requires consensus)

💡 Insight

It is safest to assume that a leaderless system with Dynamo-style replication does not provide linearizability, even with quorum reads and writes.

2.5. The Cost of Linearizability

Consider a multi-region deployment with a network partition between regions:

Network Partition: Linearizability vs. Availability
West Region
Leader
Clients can write
❌ NETWORK PARTITION ❌
East Region
Follower
Clients cannot write!

With single-leader replication:

  • Clients in the follower region cannot make writes (unavailable)
  • Linearizable reads also unavailable (must contact leader)
  • Only clients in the leader region can continue working normally

With multi-leader replication:

  • Both regions can accept writes (available)
  • Writes are queued and exchanged when network recovers
  • But this is not linearizable

The CAP Theorem

In plain English: When your network is broken, you must choose: either maintain consistency (linearizability) but become unavailable in some regions, or remain available everywhere but give up consistency.

More precisely: The CAP theorem observes that if your application requires linearizability, and some replicas are disconnected due to a network problem, then those replicas cannot process requests (they become unavailable). This choice is sometimes known as CP (consistent under partitions).

If your application does not require linearizability, then each replica can process requests independently even when disconnected (AP: available under partitions).

Why CAP is unhelpful:

CAP Misconception
Reality
Network partitions
Presented as a "choice"
Actually inevitable—they will happen
Better phrasing
"Pick 2 out of 3: C, A, P"
"Consistent OR Available when Partitioned"
Scope
Only linearizability and partitions
Ignores network delays, dead nodes, other trade-offs
Practical value
Historically influential (triggered NoSQL)
Little help for designing real systems

💡 Insight

CAP theorem has been superseded by more precise impossibility results. It's of mostly historical interest today. The real trade-off is not about network partitions (which are inevitable) but about choosing between linearizability and low latency.

Linearizability and Network Delays

Surprisingly, even RAM on modern multi-core CPUs is not linearizable! If a thread on one CPU core writes to a memory address, and a thread on another core reads the same address shortly after, it may not see the new value (unless a memory barrier is used).

Why? Every CPU core has its own cache and store buffer. Accessing cache is much faster than main memory, so this is essential for performance. But there are now multiple copies of data, asynchronously updated—linearizability is lost.

Key insight: The reason for dropping linearizability is performance, not fault tolerance.

The same is true of distributed databases that choose not to provide linearizable guarantees: they do so primarily to increase performance, not for fault tolerance.

Attiya and Welch proved: If you want linearizability, the response time of read and write requests is at least proportional to the uncertainty of delays in the network. In a network with highly variable delays (like most computer networks), linearizable reads and writes will inevitably be slow.

A faster algorithm for linearizability does not exist, but weaker consistency models can be much faster.


3. ID Generators and Logical Clocks

In plain English: In a single-node database, you can generate unique IDs by just incrementing a counter (1, 2, 3, ...). This is fast, compact, and the order tells you which records were created first. But in a distributed system, this simple approach doesn't work—you need more sophisticated techniques.

In technical terms: Many applications need to assign unique IDs to database records when they are created. In single-node databases, an auto-incrementing integer is common (64-bit or 32-bit). Another advantage is that the order of IDs reflects the order records were created.

Why it matters: Incorrectly ordered IDs can cause serious bugs, like showing a private photo to unauthorized viewers because its timestamp was assigned out of order relative to an account permission change.

Single-Node ID Generator Problems

A single-node ID generator is linearizable (each fetch-and-add operation atomically increments and returns the counter), but has issues:

Single Point of Failure
If the node fails, no IDs can be generated until recovery
🌍
Geographic Latency
Creating a record in another region requires round-trip to ID generator
🚦
Bottleneck
Single node can become overwhelmed with high write throughput

Alternative ID Generation Schemes

Scheme
Trade-offs
Sharded assignment
Node A: even numbers, Node B: odd numbers
Compact but loses ordering (ID 17 might be earlier than ID 16)
Preallocated blocks
Node A: 1-1000, Node B: 1001-2000
Reduces contention but loses ordering
Random UUIDs (v4)
128-bit random number
No coordination needed but random order, larger size
Timestamp-based
Wall-clock time + shard number + sequence
Approximate ordering but clock skew causes issues (Version 7 UUIDs, Snowflake, MongoDB ObjectID)

All these schemes generate unique IDs but have much weaker ordering guarantees than single-node auto-increment. As discussed in Chapter 9, wall-clock timestamps can provide at best approximate ordering due to clock skew and jumps.

3.1. Logical Clocks

In plain English: Instead of measuring time in seconds, a logical clock measures time in events. It doesn't tell you "what time is it?" but it can tell you "which happened first?"

In technical terms: While a physical clock is a hardware device that counts seconds (or milliseconds, microseconds), a logical clock is an algorithm that counts events that have occurred. Timestamps from a logical clock can be compared to determine ordering.

Requirements for a logical clock:

  1. Compact — A few bytes in size
  2. Unique — Every timestamp is different
  3. Totally ordered — Any two timestamps can be compared
  4. Causally consistent — If operation A happened before B, then A's timestamp < B's timestamp

Lamport Timestamps

How it works:

Lamport Clock Example
1
Each node has unique ID and counter
Counter starts at 0, increments on every operation
2
Timestamp = (counter, node_id)
Example: (5, "alice"), (12, "bob")
3
On local operation: increment counter
Attach new counter value to operation
4
On receiving message: max(local, received) + 1
Ensures causal consistency across nodes

Comparing Lamport timestamps:

(2, "bryce") > (1, "aaliyah")   // Compare counter first
(1, "caleb") > (1, "aaliyah") // If equal, compare node ID

💡 Insight

Lamport timestamps capture the happens-before relationship: if A happened before B (in the causal sense), then A's timestamp will be less than B's. But the reverse is not true—if A's timestamp is less than B's, A might have happened before B, or they might be concurrent.

Hybrid Logical Clocks

Hybrid logical clocks combine physical time-of-day clocks with Lamport clock ordering:

Physical Time Component
  • Counts microseconds like a normal clock
  • Can find events by date/time
  • Slowly drifts with underlying physical clock
Logical Counter
  • Increments on every timestamp generation
  • Moves forward if remote timestamp is greater
  • Ensures monotonic progress even if clock jumps backward

Benefits:

  • Treats timestamp almost like conventional time-of-day
  • Ordering is consistent with happens-before relation
  • Doesn't depend on special hardware (atomic clocks, GPS)
  • Only requires roughly synchronized clocks

Used by: CockroachDB, Apache Cassandra (in some modes)

Lamport/Hybrid Clocks vs. Vector Clocks

  • Lamport/Hybrid clocks: Provide total order but can't detect concurrency
  • Vector clocks: Can detect when events are concurrent but timestamps are much larger (one integer per node)

Use vector clocks when you need to detect concurrent writes (as in conflict resolution). Use Lamport/hybrid clocks for transaction IDs in snapshot isolation.

3.2. Linearizable ID Generators

Although Lamport and hybrid logical clocks provide useful ordering, that ordering is weaker than linearizable:

  • Linearizability: If request A completed before request B began, then B must have higher ID (even if A and B never communicated)
  • Lamport clocks: Can only ensure greater timestamps than what a node has seen, not what it hasn't seen

Example: Privacy violation with non-linearizable IDs

Non-Linearizable ID Generator Problem
1
User A: Set account to private (timestamp: 105)
Request goes to accounts database
2
User A: Upload embarrassing photo (timestamp: 104)
Photos database counter was slightly behind—assigns earlier timestamp!
3
Viewer reads with snapshot timestamp 104.5
Sees photo (104 < 104.5) but not privacy setting (105 > 104.5)
4
Result: Unauthorized viewer sees private photo
Linearizable ID generator would have prevented this

Implementing a Linearizable ID Generator

Simplest approach: Use a single node

Single-Node ID Generator with Fault Tolerance
ID Generator (Leader)
Counter: 1,247,832
Persisted + Replicated
Failover
Replica (Follower)
Ready to take over

Optimizations:

  • Batch allocation: Write blocks of IDs (1-1000, then 1001-2000) to reduce disk writes
  • Some IDs will be skipped if the node crashes, but no duplicates or out-of-order IDs
  • Single node can handle large throughput since its job is simple

You cannot easily shard the ID generator—multiple shards independently handing out IDs can't guarantee linearizable ordering.

Alternative: Google Spanner approach

  • Use physical clocks that return uncertainty interval (not single timestamp)
  • Wait for uncertainty interval duration to elapse before returning
  • Ensures if one request completes before another begins, the later has greater timestamp
  • Requires tightly synchronized clocks (atomic clocks, GPS)

4. Consensus

In plain English: Consensus is about getting all the nodes in a distributed system to agree on something—who is the leader, what order events happened in, whether to commit a transaction. It's one of the most important and difficult problems in distributed computing.

In technical terms: Consensus is the fundamental problem underlying many distributed systems challenges: leader election, atomic commit, linearizable operations, and more. The best-known consensus algorithms are Viewstamped Replication, Paxos, Raft, and Zab.

Why it matters: Consensus algorithms enable automatic failover, preventing split brain while ensuring no committed data is lost. They're the foundation of coordination services like ZooKeeper and etcd, which in turn are used by countless distributed systems.

We have seen several examples of things that are easy with a single node but hard when you want fault tolerance:

👑
Leader Failover
Single leader is simple, but how to fail over safely without split brain?
🔢
Linearizable ID Generator
Counter with fetch-and-add on one node, but what if it crashes?
⚛️
Compare-and-Set
Useful for locks and leases, trivial on one node, hard to make fault-tolerant

All of these are instances of consensus.

4.1. The Impossibility of Consensus

The FLP result (Fischer, Lynch, Paterson) proves that there is no algorithm that is always able to reach consensus if there is a risk that a node may crash.

How can consensus algorithms exist?

The FLP result doesn't say we can never reach consensus—only that we can't guarantee that a consensus algorithm will always terminate. The proof assumes:

  • Deterministic algorithm in the asynchronous system model
  • No clocks or timeouts

Ways around the impossibility:

  1. Use timeouts to suspect crashed nodes (even if suspicion is sometimes wrong)
  2. Use random numbers (randomized algorithms)

💡 Insight

Although FLP proves theoretical impossibility, distributed systems can usually achieve consensus in practice by using timeouts and failure detectors. The FLP result is important theoretically but doesn't prevent building real consensus systems.

4.2. The Many Faces of Consensus

Consensus can be expressed in several different ways—and surprisingly, these are all equivalent:

Problem
Use Case
Single-value consensus
Decide on one value (e.g., who is the leader)
Similar to atomic compare-and-set
Shared logs (total order broadcast)
Agree on order of log entries
Used for state machine replication, event sourcing
Atomic commitment
All nodes commit or all abort a transaction
Two-phase commit but fault-tolerant

If you have an algorithm that solves one of these problems, you can convert it into a solution for any of the others!

Single-Value Consensus

Formal properties:

Uniform Agreement
No two nodes decide differently
Integrity
Once decided, cannot change mind
Validity
If node decides v, then v was proposed by some node
Termination
Every non-crashed node eventually decides

The first three are safety properties (bad things don't happen). Termination is a liveness property (good things eventually happen).

Example uses:

  • Multiple nodes race to become leader → consensus decides which wins
  • Multiple people book the last airplane seat → consensus decides which succeeds

Shared Logs (Total Order Broadcast)

In plain English: A shared log is like a notebook that everyone can write in, but once something is written, it's permanent and everyone sees the same sequence of entries in the same order.

Properties:

Shared Log Properties
1
Eventual Append
If a node requests value to be added, it eventually reads it back (unless it crashes)
2
Reliable Delivery
If one node reads a log entry, all nodes eventually read it
3
Append-Only
Entries are immutable; new entries only added at the end
4
Agreement
All nodes read the same entries in the same order

Implementing consensus from shared log:

  • Every node that wants to propose a value requests it to be added to the log
  • Whichever value appears first in the log is the decided value
  • Since all nodes read the same order, they all agree

Implementing shared log from consensus:

  • For every log slot, run a consensus instance to decide what goes in that slot
  • When consensus decides for a slot, append it to the log
  • If a value wasn't chosen, retry by proposing it for a later slot

Atomic Commitment

Properties (similar to consensus but with key difference):

  • Uniform agreement: No two nodes decide differently (commit or abort)
  • Integrity: Cannot change decision once made
  • Validity: If decides commit, all nodes voted commit; if any voted abort, must abort
  • Non-triviality: If all vote commit and no timeouts, must commit (can't always abort)
  • Termination: Every non-crashed node eventually decides

Key difference from consensus: With consensus, any proposed value is acceptable. With atomic commit, must abort if any participant voted abort.

Implementing from consensus:

  1. Every node sends its vote (commit/abort) to all others
  2. Nodes that receive "commit" from all propose "commit" to consensus
  3. Nodes that receive "abort" or timeout propose "abort"
  4. Consensus decides; all nodes commit or abort accordingly

4.3. Consensus in Practice

Most consensus systems provide shared logs (total order broadcast):

  • Raft, Viewstamped Replication, Zab: Shared logs directly
  • Paxos: Single-value consensus, but Multi-Paxos provides shared log
  • Used by: ZooKeeper (Zab), etcd (Raft), Consul (Raft)

Using Shared Logs

🔄
State Machine Replication
Every log entry = write to database; replicas process same writes in same order → consistent state
🔒
Serializable Transactions
Every log entry = transaction as stored procedure; all nodes execute in same order → serializable
🔀
Derive Other Consensus
CAS: decide first value in log; Fetch-and-add: sum of log entries; Fencing tokens: sequence number

From Single-Leader Replication to Consensus

Challenge: Single-leader is easy but how to provide fault tolerance through automatic failover?

Solution: Epochs and two rounds of voting

Consensus Algorithm Structure
1
Epoch Numbers
Each leader election gets new epoch (term/view/ballot number)
2
Vote 1: Elect Leader
Nodes vote for new leader; requires quorum
3
Vote 2: Append Log Entry
Leader proposes entry; requires quorum vote
4
Quorum Overlap
Same node participates in both votes → ensures no higher-epoch leader exists

Key properties:

  • Within each epoch, leader is unique
  • Higher epoch prevails in conflicts
  • Before appending, leader checks no higher epoch exists (via quorum vote)
  • Two rounds of voting look like 2PC but very different: consensus only needs quorum (not all nodes) and any node can start election

Subtleties:

  • New leader catching up: Raft requires new leader to have up-to-date log; Paxos allows any node but requires catching up first
  • Linearizable reads: Must go through quorum vote (or lease-based optimization)
  • Reconfiguration: Can add/remove nodes dynamically (useful for multi-region expansion)

💡 Insight

Consensus algorithms are essentially "single-leader replication done right"—with automatic leader election, no split brain possible, and no committed data lost during failover. The complexity comes from handling all the edge cases correctly.

Pros and Cons of Consensus

Advantages:

  • Automatic failover without data loss
  • Prevents split brain
  • Ensures linearizability
  • Battle-tested algorithms with formal proofs

Disadvantages:

📊
Requires Strict Majority
3 nodes for 1 failure, 5 nodes for 2 failures; cannot scale throughput by adding nodes
🌐
Quorum Communication
Every operation needs majority response; slow across geographic regions
⏱️
Timeout Sensitivity
Variable network delays make timeouts hard to tune; too large = slow recovery; too small = unnecessary elections
⚠️
Edge Cases
Raft has issues with unreliable network links causing continuous leadership churn

For systems that want high availability but don't want the cost of consensus, the alternative is weaker consistency models (leaderless or multi-leader replication).

4.4. Coordination Services

In plain English: Coordination services like ZooKeeper and etcd are specialized databases designed to help distributed systems coordinate with each other—not for storing your application data, but for storing tiny amounts of critical coordination metadata.

Examples: ZooKeeper, etcd, Consul

Why not a regular database? Coordination services combine consensus with features specifically useful for distributed coordination:

🔐
Locks and Leases
Atomic CAS for distributed locks; only one node acquires lease
🛡️
Fencing Support
Monotonically increasing IDs (zxid, cversion) prevent zombie processes
💓
Failure Detection
Heartbeats and session timeouts; ephemeral nodes disappear when client dies
🔔
Change Notifications
Clients subscribe to key changes; no need to poll

Common Use Cases

Use Case
How It Works
Managing Configuration
Store config as key-value pairs
Processes load on startup + subscribe to changes
Leader Election
Several instances need to pick one leader
Use CAS to acquire lease; ephemeral node for failure detection
Allocating Work
Decide which shard assigned to which node
Atomic operations + notifications enable automatic rebalancing
Service Discovery
Find IP addresses of services
Services register on startup; clients subscribe; often overkill (DNS works too)

Design characteristics:

  • Hold small amounts of data (fits in memory)
  • Data is replicated to 3-5 nodes using consensus
  • Fixed set of nodes regardless of application cluster size
  • Slow-changing data (changes on timescale of minutes/hours)
  • For fast-changing state, use conventional databases instead

Service discovery note: Using consensus for service discovery is often overkill. DNS-based service discovery with caching is usually sufficient since linearizability isn't needed for finding services.

ZooKeeper observers: Read-only replicas that don't participate in voting; provide high read throughput and availability for non-linearizable reads (useful for service discovery).


5. Summary

In this chapter we examined strong consistency in fault-tolerant systems: what it is and how to achieve it.

Linearizability:

Linearizability Properties
Appears as single copy
All operations atomic
Recency guarantee
  • Makes replicated data appear as a single copy with atomic operations
  • Useful for avoiding race conditions (e.g., uniqueness constraints, leader election)
  • Has downsides: slow, especially with large network delays
  • Many replication algorithms don't provide it (including Dynamo-style quorums)

ID Generators and Logical Clocks:

  • Single-node auto-increment is linearizable but not fault-tolerant
  • Distributed ID schemes (UUIDs, timestamp-based) don't guarantee linearizable ordering
  • Lamport clocks provide causally consistent ordering without linearizability
  • Hybrid logical clocks combine physical time with causal ordering
  • Linearizable ID generators require single-node with replication or Spanner-style synchronized clocks

Consensus:

A wide range of problems are equivalent to consensus:

Compare-and-Set
Atomically decide whether to update based on current value
Locks and Leases
Decide which client acquires the lock
Uniqueness Constraints
Decide which transaction succeeds when creating conflicting records
Shared Logs
Decide order of log entries (total order broadcast)
Atomic Commit
All nodes decide same outcome for distributed transaction
Fetch-and-Add
Decide order of counter increments (consensus number = 2)

All are straightforward on a single node, but challenging to make fault-tolerant in a distributed setting.

Consensus algorithms:

  • Raft, Paxos, Viewstamped Replication, Zab — Widely used, battle-tested algorithms
  • Essentially single-leader replication with automatic leader election and failover
  • Ensure no committed writes lost and no split brain
  • Every write and linearizable read requires quorum vote
  • ZooKeeper, etcd, Consul built on consensus; provide locks, leases, failure detection, notifications

💡 Insight

Consensus is not always the right tool. For systems that don't need strong consistency, weaker models (leaderless, multi-leader) offer better availability and performance. The decision comes down to your application requirements: do you need linearizability, or can you tolerate eventual consistency?

When not to use consensus:

  • Strong consistency not needed
  • High availability more important than consistency
  • Latency-sensitive applications
  • Wide-area networks with variable delays

For such cases, leaderless or multi-leader replication with weaker consistency models is often more appropriate. The logical clocks we discussed (Lamport, hybrid) are helpful in that context.

Consensus algorithms are complicated and subtle, but they are supported by a rich body of theory developed since the 1980s. This theory makes it possible to build systems that tolerate all the faults discussed in Chapter 9 while ensuring data is not corrupted.


Previous: Chapter 9 | Next: Chapter 11