Chapter 10. Consistency and Consensus

An ancient adage warns, "Never go to sea with two chronometers; take one or three."

Introduction
Linearizability
- 2.1. What Makes a System Linearizable?
- 2.2. Linearizability Versus Serializability
- 2.3. Relying on Linearizability
- 2.4. Implementing Linearizable Systems
- 2.5. The Cost of Linearizability
ID Generators and Logical Clocks
- 3.1. Logical Clocks
- 3.2. Linearizable ID Generators
Consensus
- 4.1. The Impossibility of Consensus
- 4.2. The Many Faces of Consensus
- 4.3. Consensus in Practice
- 4.4. Coordination Services
Summary

1. Introduction

In plain English: When you replicate data across multiple machines for fault tolerance, those copies can easily get out of sync. There are two philosophical approaches: either expose this complexity to developers (eventual consistency) or hide it behind strong guarantees (strong consistency). This chapter is about the strong consistency approach.

In technical terms: Lots of things can go wrong in distributed systems. If we want a service to continue working correctly despite those things going wrong, we need to find ways of tolerating faults. One of the best tools we have is replication—but having multiple copies of data opens up the risk of inconsistencies.

Why it matters: Strong consistency makes distributed systems behave like single-node systems, dramatically simplifying application logic. However, it comes with performance costs and availability trade-offs that you must understand to make informed architectural decisions.

At a high level, there are two competing philosophies for dealing with replication issues:

🔄

Eventual Consistency

Applications must handle inconsistencies and conflicts explicitly

Multi-leader and leaderless replication
Higher availability and performance
More complex application logic
Works with offline-capable apps

🔒

Strong Consistency

System behaves as if it were single-node; hides replication complexity

Single-leader or consensus-based
Simpler application development
Performance cost and reduced availability
Requires reliable network communication

💡 Insight

The choice between eventual and strong consistency is fundamentally about where you place complexity: in the infrastructure (consensus algorithms) or in the application code (conflict resolution). Neither is inherently better—it depends on your requirements, network reliability, and whether your application can tolerate stale reads.

In this chapter we will dive deeper into the strongly consistent approach, looking at three areas:

Linearizability — A precise definition of "strong consistency" that makes replicated systems behave like single-node systems
ID Generators and Timestamps — How to generate unique, correctly ordered identifiers in distributed systems
Consensus Algorithms — How distributed systems can achieve linearizability while remaining fault-tolerant

Along the way, we will see that there are fundamental limits on what is possible in a distributed system.

2. Linearizability

In plain English: Imagine a database that looks like a single magic box, even though it's actually many machines. When one person writes something, everyone else immediately sees that new value. No stale reads, no confusion. That's linearizability.

In technical terms: Linearizability (also known as atomic consistency, strong consistency, immediate consistency, or external consistency) is the idea of making a system appear as if there were only one copy of the data, and all operations on it are atomic. With this guarantee, even though there may be multiple replicas in reality, the application does not need to worry about them.

Why it matters: In a linearizable system, as soon as one client successfully completes a write, all clients reading from the database must be able to see the value just written. This recency guarantee prevents confusing anomalies where users see data go "back in time."

Example: Nonlinearizable Sports Website

Nonlinearizable System Causing Confusion

Game ends, final score announced

Score becomes official: Warriors 103, Lakers 101

→

Aaliyah refreshes (hits Replica A)

Sees final score, excitedly tells Bryce

→

Bryce refreshes (hits Replica B)

Sees game still ongoing—Replica B is lagging!

→

Linearizability violation

Bryce knows he reloaded AFTER Aaliyah, but sees older data

If Aaliyah and Bryce had hit reload at the same time, different results would be unsurprising. However, Bryce knows that he hit reload after hearing Aaliyah exclaim the final score, and therefore he expects his query result to be at least as recent as Aaliyah's. The fact that his query returned a stale result is a violation of linearizability.

2.1. What Makes a System Linearizable?

Let's look at three clients concurrently reading and writing the same object x (could be one key in a key-value store, one row in a database, etc.).

Linearizability Basics

Client A

read(x) ⇒ 0

read(x) ⇒ 1

Client B

read(x) ⇒ 1

read(x) ⇒ 2 ❌

Client C

write(x, 1)

Key operations:

read(x) ⇒ v — Client reads register x, database returns value v
write(x, v) ⇒ r — Client sets register x to value v, database returns response r (ok or error)
cas(x, vold, vnew) ⇒ r — Atomic compare-and-set: only updates if current value equals vold

Linearizability rules:

Before write completes: Reads return old value (0)
After write completes: All reads must return new value (1)
During write (concurrent): Reads may return either old or new value
Once any read returns new value: All subsequent reads (on any client) must also return new value

💡 Insight

Think of linearizability as requiring an atomic "flip" moment during every write. Even if the write operation takes time to process, there must be some instant when the value switches from old to new. After that instant, no client can ever see the old value again—even if the write operation hasn't finished confirming to the writer.

The final read by Client B (returning 2) is not linearizable because Client A already read value 4 before B's read started. This is the same Aaliyah-and-Bryce situation: if one client sees a newer value, all subsequent reads must see at least that new value.

2.2. Linearizability Versus Serializability

These are easily confused but quite different!

Serializability

Linearizability

Scope

Transactions across multiple objects

Single object reads and writes

Guarantee

Transactions behave as if executed serially

Operations appear to take effect instantaneously

Order requirement

Any serial order is acceptable

Must respect real-time order

Prevents

Write skew, phantoms (multi-object anomalies)

Stale reads (single-object recency)

Example system

Serializable Snapshot Isolation (SSI)

Single-leader replication on the leader

In plain English: Serializability is about multi-object transactions behaving as if they ran one-at-a-time (even if they actually overlapped). Linearizability is about single-object operations appearing instantaneous and respecting real-time order.

Combining both: A database may provide both serializability and linearizability—this combination is known as strict serializability or strong one-copy serializability (strong-1SR). Single-node databases typically provide this. With distributed databases, it requires expensive coordination.

2.3. Relying on Linearizability

In what circumstances is linearizability useful?

🔐

Locking and Leader Election

Ensuring only one node becomes the leader

Must be linearizable—no split brain allowed
ZooKeeper and etcd use consensus for this
Implements distributed leases
Fencing tokens prevent zombie leaders

🎯

Constraints and Uniqueness

Enforcing uniqueness constraints

Username must be unique
File path must be unique
Similar to acquiring a "lock" on the username
Requires atomic compare-and-set

📡

Cross-Channel Timing

Multiple communication channels between components

Web upload → queue → transcoder reads from storage
If storage is non-linearizable, race condition possible
Transcoder might fetch old version or nothing
Push notifications + data fetch have same issue

Example: Video transcoding race condition

Cross-Channel Race Condition

User uploads video.mp4

Web server writes to file storage

→

Web server enqueues "transcode video.mp4"

Message goes to queue (faster than storage replication)

→

Transcoder receives message

Fetches video.mp4 from file storage

→

Storage returns old version or 404

Replication not yet complete—inconsistency!

This problem arises because there are two different communication channels between the web server and transcoder: the file storage and the message queue. Without linearizability, race conditions between these channels are possible.

2.4. Implementing Linearizable Systems

Let's revisit replication methods from Chapter 6 and compare their linearizability:

Replication Method

Linearizability

Single-leader replication

Potentially linearizable

Only if all reads/writes go to leader; follower reads are stale; failover can violate linearizability

Consensus algorithms

Linearizable

Designed to prevent split brain; but reads without leader check may be stale

Multi-leader replication

Not linearizable

Concurrent writes on multiple nodes create conflicts; asynchronous replication

Leaderless (Dynamo-style)

Probably not linearizable

Even with quorums—LWW clocks violate it; race conditions with variable network delays

Linearizability and Quorums

Intuitively, quorum reads and writes (w + r > n) seem like they should be linearizable. However, they are not:

Nonlinearizable Execution Despite Quorum

Replica 1

Replica 2

Replica 3

Writer: x=1→

x=1 ✓

x=0 (slow)

→

Client A reads

Quorum: Replica 1,2

Returns x=1 ✓

Client B reads

Quorum: Replica 2,3

Returns x=0 ❌

The quorum condition is met (w + r > n), but this execution is not linearizable: B's request begins after A's completes, but B returns the old value while A returns the new value.

To make Dynamo-style quorums linearizable (at significant performance cost):

Readers must perform synchronous read repair before returning
Writers must read latest state from quorum to fetch the latest timestamp
Even then, compare-and-set cannot be implemented (requires consensus)

💡 Insight

It is safest to assume that a leaderless system with Dynamo-style replication does not provide linearizability, even with quorum reads and writes.

2.5. The Cost of Linearizability

Consider a multi-region deployment with a network partition between regions:

Network Partition: Linearizability vs. Availability

West Region

Leader

Clients can write

❌ NETWORK PARTITION ❌

East Region

Follower

Clients cannot write!

With single-leader replication:

Clients in the follower region cannot make writes (unavailable)
Linearizable reads also unavailable (must contact leader)
Only clients in the leader region can continue working normally

With multi-leader replication:

Both regions can accept writes (available)
Writes are queued and exchanged when network recovers
But this is not linearizable

The CAP Theorem

In plain English: When your network is broken, you must choose: either maintain consistency (linearizability) but become unavailable in some regions, or remain available everywhere but give up consistency.

More precisely: The CAP theorem observes that if your application requires linearizability, and some replicas are disconnected due to a network problem, then those replicas cannot process requests (they become unavailable). This choice is sometimes known as CP (consistent under partitions).

If your application does not require linearizability, then each replica can process requests independently even when disconnected (AP: available under partitions).

Why CAP is unhelpful:

CAP Misconception

Reality

Network partitions

Presented as a "choice"

Actually inevitable—they will happen

Better phrasing

"Pick 2 out of 3: C, A, P"

"Consistent OR Available when Partitioned"

Scope

Only linearizability and partitions

Ignores network delays, dead nodes, other trade-offs

Practical value

Historically influential (triggered NoSQL)

Little help for designing real systems

💡 Insight

CAP theorem has been superseded by more precise impossibility results. It's of mostly historical interest today. The real trade-off is not about network partitions (which are inevitable) but about choosing between linearizability and low latency.

Linearizability and Network Delays

Surprisingly, even RAM on modern multi-core CPUs is not linearizable! If a thread on one CPU core writes to a memory address, and a thread on another core reads the same address shortly after, it may not see the new value (unless a memory barrier is used).

Why? Every CPU core has its own cache and store buffer. Accessing cache is much faster than main memory, so this is essential for performance. But there are now multiple copies of data, asynchronously updated—linearizability is lost.

Key insight: The reason for dropping linearizability is performance, not fault tolerance.

The same is true of distributed databases that choose not to provide linearizable guarantees: they do so primarily to increase performance, not for fault tolerance.

Attiya and Welch proved: If you want linearizability, the response time of read and write requests is at least proportional to the uncertainty of delays in the network. In a network with highly variable delays (like most computer networks), linearizable reads and writes will inevitably be slow.

A faster algorithm for linearizability does not exist, but weaker consistency models can be much faster.

3. ID Generators and Logical Clocks

In plain English: In a single-node database, you can generate unique IDs by just incrementing a counter (1, 2, 3, ...). This is fast, compact, and the order tells you which records were created first. But in a distributed system, this simple approach doesn't work—you need more sophisticated techniques.

In technical terms: Many applications need to assign unique IDs to database records when they are created. In single-node databases, an auto-incrementing integer is common (64-bit or 32-bit). Another advantage is that the order of IDs reflects the order records were created.

Why it matters: Incorrectly ordered IDs can cause serious bugs, like showing a private photo to unauthorized viewers because its timestamp was assigned out of order relative to an account permission change.

Single-Node ID Generator Problems

A single-node ID generator is linearizable (each fetch-and-add operation atomically increments and returns the counter), but has issues:

❌

Single Point of Failure

If the node fails, no IDs can be generated until recovery

🌍

Geographic Latency

Creating a record in another region requires round-trip to ID generator

🚦

Bottleneck

Single node can become overwhelmed with high write throughput

Alternative ID Generation Schemes

Scheme

Trade-offs

Sharded assignment

Node A: even numbers, Node B: odd numbers

Compact but loses ordering (ID 17 might be earlier than ID 16)

Preallocated blocks

Node A: 1-1000, Node B: 1001-2000

Reduces contention but loses ordering

Random UUIDs (v4)

128-bit random number

No coordination needed but random order, larger size

Timestamp-based

Wall-clock time + shard number + sequence

Approximate ordering but clock skew causes issues (Version 7 UUIDs, Snowflake, MongoDB ObjectID)

All these schemes generate unique IDs but have much weaker ordering guarantees than single-node auto-increment. As discussed in Chapter 9, wall-clock timestamps can provide at best approximate ordering due to clock skew and jumps.

3.1. Logical Clocks

In plain English: Instead of measuring time in seconds, a logical clock measures time in events. It doesn't tell you "what time is it?" but it can tell you "which happened first?"

In technical terms: While a physical clock is a hardware device that counts seconds (or milliseconds, microseconds), a logical clock is an algorithm that counts events that have occurred. Timestamps from a logical clock can be compared to determine ordering.

Requirements for a logical clock:

Compact — A few bytes in size
Unique — Every timestamp is different
Totally ordered — Any two timestamps can be compared
Causally consistent — If operation A happened before B, then A's timestamp < B's timestamp

Lamport Timestamps

How it works:

Lamport Clock Example

Each node has unique ID and counter

Counter starts at 0, increments on every operation

→

Timestamp = (counter, node_id)

Example: (5, "alice"), (12, "bob")

→

On local operation: increment counter

Attach new counter value to operation

→

On receiving message: max(local, received) + 1

Ensures causal consistency across nodes

Comparing Lamport timestamps:

(2, "bryce") > (1, "aaliyah")   // Compare counter first
(1, "caleb") > (1, "aaliyah")   // If equal, compare node ID

💡 Insight

Lamport timestamps capture the happens-before relationship: if A happened before B (in the causal sense), then A's timestamp will be less than B's. But the reverse is not true—if A's timestamp is less than B's, A might have happened before B, or they might be concurrent.

Hybrid Logical Clocks

Hybrid logical clocks combine physical time-of-day clocks with Lamport clock ordering:

Physical Time Component

Counts microseconds like a normal clock
Can find events by date/time
Slowly drifts with underlying physical clock

Logical Counter

Increments on every timestamp generation
Moves forward if remote timestamp is greater
Ensures monotonic progress even if clock jumps backward

Benefits:

Treats timestamp almost like conventional time-of-day
Ordering is consistent with happens-before relation
Doesn't depend on special hardware (atomic clocks, GPS)
Only requires roughly synchronized clocks

Used by: CockroachDB, Apache Cassandra (in some modes)

Lamport/Hybrid Clocks vs. Vector Clocks

Lamport/Hybrid clocks: Provide total order but can't detect concurrency
Vector clocks: Can detect when events are concurrent but timestamps are much larger (one integer per node)

Use vector clocks when you need to detect concurrent writes (as in conflict resolution). Use Lamport/hybrid clocks for transaction IDs in snapshot isolation.

3.2. Linearizable ID Generators

Although Lamport and hybrid logical clocks provide useful ordering, that ordering is weaker than linearizable:

Linearizability: If request A completed before request B began, then B must have higher ID (even if A and B never communicated)
Lamport clocks: Can only ensure greater timestamps than what a node has seen, not what it hasn't seen

Example: Privacy violation with non-linearizable IDs

Non-Linearizable ID Generator Problem

User A: Set account to private (timestamp: 105)

Request goes to accounts database

→

User A: Upload embarrassing photo (timestamp: 104)

Photos database counter was slightly behind—assigns earlier timestamp!

→

Viewer reads with snapshot timestamp 104.5

Sees photo (104 < 104.5) but not privacy setting (105 > 104.5)

→

Result: Unauthorized viewer sees private photo

Linearizable ID generator would have prevented this

Implementing a Linearizable ID Generator

Simplest approach: Use a single node

Single-Node ID Generator with Fault Tolerance

ID Generator (Leader)

Counter: 1,247,832

Persisted + Replicated

Failover→

Replica (Follower)

Ready to take over

Optimizations:

Batch allocation: Write blocks of IDs (1-1000, then 1001-2000) to reduce disk writes
Some IDs will be skipped if the node crashes, but no duplicates or out-of-order IDs
Single node can handle large throughput since its job is simple

You cannot easily shard the ID generator—multiple shards independently handing out IDs can't guarantee linearizable ordering.

Alternative: Google Spanner approach

Use physical clocks that return uncertainty interval (not single timestamp)
Wait for uncertainty interval duration to elapse before returning
Ensures if one request completes before another begins, the later has greater timestamp
Requires tightly synchronized clocks (atomic clocks, GPS)

4. Consensus

In plain English: Consensus is about getting all the nodes in a distributed system to agree on something—who is the leader, what order events happened in, whether to commit a transaction. It's one of the most important and difficult problems in distributed computing.

In technical terms: Consensus is the fundamental problem underlying many distributed systems challenges: leader election, atomic commit, linearizable operations, and more. The best-known consensus algorithms are Viewstamped Replication, Paxos, Raft, and Zab.

Why it matters: Consensus algorithms enable automatic failover, preventing split brain while ensuring no committed data is lost. They're the foundation of coordination services like ZooKeeper and etcd, which in turn are used by countless distributed systems.

We have seen several examples of things that are easy with a single node but hard when you want fault tolerance:

👑

Leader Failover

Single leader is simple, but how to fail over safely without split brain?

🔢

Linearizable ID Generator

Counter with fetch-and-add on one node, but what if it crashes?

⚛️

Compare-and-Set

Useful for locks and leases, trivial on one node, hard to make fault-tolerant

All of these are instances of consensus.

4.1. The Impossibility of Consensus

The FLP result (Fischer, Lynch, Paterson) proves that there is no algorithm that is always able to reach consensus if there is a risk that a node may crash.

How can consensus algorithms exist?

The FLP result doesn't say we can never reach consensus—only that we can't guarantee that a consensus algorithm will always terminate. The proof assumes:

Deterministic algorithm in the asynchronous system model
No clocks or timeouts

Ways around the impossibility:

Use timeouts to suspect crashed nodes (even if suspicion is sometimes wrong)
Use random numbers (randomized algorithms)

💡 Insight

Although FLP proves theoretical impossibility, distributed systems can usually achieve consensus in practice by using timeouts and failure detectors. The FLP result is important theoretically but doesn't prevent building real consensus systems.

4.2. The Many Faces of Consensus

Consensus can be expressed in several different ways—and surprisingly, these are all equivalent:

Problem

Use Case

Single-value consensus

Decide on one value (e.g., who is the leader)

Similar to atomic compare-and-set

Shared logs (total order broadcast)

Agree on order of log entries

Used for state machine replication, event sourcing

Atomic commitment

All nodes commit or all abort a transaction

Two-phase commit but fault-tolerant

If you have an algorithm that solves one of these problems, you can convert it into a solution for any of the others!

Single-Value Consensus

Formal properties:

Uniform Agreement

No two nodes decide differently

Integrity

Once decided, cannot change mind

Validity

If node decides v, then v was proposed by some node

Termination

Every non-crashed node eventually decides

The first three are safety properties (bad things don't happen). Termination is a liveness property (good things eventually happen).

Example uses:

Multiple nodes race to become leader → consensus decides which wins
Multiple people book the last airplane seat → consensus decides which succeeds

Shared Logs (Total Order Broadcast)

In plain English: A shared log is like a notebook that everyone can write in, but once something is written, it's permanent and everyone sees the same sequence of entries in the same order.

Properties:

Shared Log Properties

Eventual Append

If a node requests value to be added, it eventually reads it back (unless it crashes)

→

Reliable Delivery

If one node reads a log entry, all nodes eventually read it

→

Append-Only

Entries are immutable; new entries only added at the end

→

Agreement

All nodes read the same entries in the same order

Implementing consensus from shared log:

Every node that wants to propose a value requests it to be added to the log
Whichever value appears first in the log is the decided value
Since all nodes read the same order, they all agree

Implementing shared log from consensus:

For every log slot, run a consensus instance to decide what goes in that slot
When consensus decides for a slot, append it to the log
If a value wasn't chosen, retry by proposing it for a later slot

Atomic Commitment

Properties (similar to consensus but with key difference):

Uniform agreement: No two nodes decide differently (commit or abort)
Integrity: Cannot change decision once made
Validity: If decides commit, all nodes voted commit; if any voted abort, must abort
Non-triviality: If all vote commit and no timeouts, must commit (can't always abort)
Termination: Every non-crashed node eventually decides

Key difference from consensus: With consensus, any proposed value is acceptable. With atomic commit, must abort if any participant voted abort.

Implementing from consensus:

Every node sends its vote (commit/abort) to all others
Nodes that receive "commit" from all propose "commit" to consensus
Nodes that receive "abort" or timeout propose "abort"
Consensus decides; all nodes commit or abort accordingly

4.3. Consensus in Practice

Most consensus systems provide shared logs (total order broadcast):

Raft, Viewstamped Replication, Zab: Shared logs directly
Paxos: Single-value consensus, but Multi-Paxos provides shared log
Used by: ZooKeeper (Zab), etcd (Raft), Consul (Raft)

Using Shared Logs

🔄

State Machine Replication

Every log entry = write to database; replicas process same writes in same order → consistent state

🔒

Serializable Transactions

Every log entry = transaction as stored procedure; all nodes execute in same order → serializable

🔀

Derive Other Consensus

CAS: decide first value in log; Fetch-and-add: sum of log entries; Fencing tokens: sequence number

From Single-Leader Replication to Consensus

Challenge: Single-leader is easy but how to provide fault tolerance through automatic failover?

Solution: Epochs and two rounds of voting

Consensus Algorithm Structure

Epoch Numbers

Each leader election gets new epoch (term/view/ballot number)

→

Vote 1: Elect Leader

Nodes vote for new leader; requires quorum

→

Vote 2: Append Log Entry

Leader proposes entry; requires quorum vote

→

Quorum Overlap

Same node participates in both votes → ensures no higher-epoch leader exists

Key properties:

Within each epoch, leader is unique
Higher epoch prevails in conflicts
Before appending, leader checks no higher epoch exists (via quorum vote)
Two rounds of voting look like 2PC but very different: consensus only needs quorum (not all nodes) and any node can start election

Subtleties:

New leader catching up: Raft requires new leader to have up-to-date log; Paxos allows any node but requires catching up first
Linearizable reads: Must go through quorum vote (or lease-based optimization)
Reconfiguration: Can add/remove nodes dynamically (useful for multi-region expansion)

💡 Insight

Consensus algorithms are essentially "single-leader replication done right"—with automatic leader election, no split brain possible, and no committed data lost during failover. The complexity comes from handling all the edge cases correctly.

Pros and Cons of Consensus

Advantages:

Automatic failover without data loss
Prevents split brain
Ensures linearizability
Battle-tested algorithms with formal proofs

Disadvantages:

📊

Requires Strict Majority

3 nodes for 1 failure, 5 nodes for 2 failures; cannot scale throughput by adding nodes

🌐

Quorum Communication

Every operation needs majority response; slow across geographic regions

⏱️

Timeout Sensitivity

Variable network delays make timeouts hard to tune; too large = slow recovery; too small = unnecessary elections

⚠️

Edge Cases

Raft has issues with unreliable network links causing continuous leadership churn

For systems that want high availability but don't want the cost of consensus, the alternative is weaker consistency models (leaderless or multi-leader replication).

4.4. Coordination Services

In plain English: Coordination services like ZooKeeper and etcd are specialized databases designed to help distributed systems coordinate with each other—not for storing your application data, but for storing tiny amounts of critical coordination metadata.

Examples: ZooKeeper, etcd, Consul

Why not a regular database? Coordination services combine consensus with features specifically useful for distributed coordination:

🔐

Locks and Leases

Atomic CAS for distributed locks; only one node acquires lease

🛡️

Fencing Support

Monotonically increasing IDs (zxid, cversion) prevent zombie processes

💓

Failure Detection

Heartbeats and session timeouts; ephemeral nodes disappear when client dies

🔔

Change Notifications

Clients subscribe to key changes; no need to poll

Common Use Cases

Use Case

How It Works

Managing Configuration

Store config as key-value pairs

Processes load on startup + subscribe to changes

Leader Election

Several instances need to pick one leader

Use CAS to acquire lease; ephemeral node for failure detection

Allocating Work

Decide which shard assigned to which node

Atomic operations + notifications enable automatic rebalancing

Service Discovery

Find IP addresses of services

Services register on startup; clients subscribe; often overkill (DNS works too)

Design characteristics:

Hold small amounts of data (fits in memory)
Data is replicated to 3-5 nodes using consensus
Fixed set of nodes regardless of application cluster size
Slow-changing data (changes on timescale of minutes/hours)
For fast-changing state, use conventional databases instead

Service discovery note: Using consensus for service discovery is often overkill. DNS-based service discovery with caching is usually sufficient since linearizability isn't needed for finding services.

ZooKeeper observers: Read-only replicas that don't participate in voting; provide high read throughput and availability for non-linearizable reads (useful for service discovery).

5. Summary

In this chapter we examined strong consistency in fault-tolerant systems: what it is and how to achieve it.

Linearizability:

Linearizability Properties

Appears as single copy

→

All operations atomic

→

Recency guarantee

Makes replicated data appear as a single copy with atomic operations
Useful for avoiding race conditions (e.g., uniqueness constraints, leader election)
Has downsides: slow, especially with large network delays
Many replication algorithms don't provide it (including Dynamo-style quorums)

ID Generators and Logical Clocks:

Single-node auto-increment is linearizable but not fault-tolerant
Distributed ID schemes (UUIDs, timestamp-based) don't guarantee linearizable ordering
Lamport clocks provide causally consistent ordering without linearizability
Hybrid logical clocks combine physical time with causal ordering
Linearizable ID generators require single-node with replication or Spanner-style synchronized clocks

Consensus:

A wide range of problems are equivalent to consensus:

Compare-and-Set

Atomically decide whether to update based on current value

Locks and Leases

Decide which client acquires the lock

Uniqueness Constraints

Decide which transaction succeeds when creating conflicting records

Shared Logs

Decide order of log entries (total order broadcast)

Atomic Commit

All nodes decide same outcome for distributed transaction

Fetch-and-Add

Decide order of counter increments (consensus number = 2)

All are straightforward on a single node, but challenging to make fault-tolerant in a distributed setting.

Consensus algorithms:

Raft, Paxos, Viewstamped Replication, Zab — Widely used, battle-tested algorithms
Essentially single-leader replication with automatic leader election and failover
Ensure no committed writes lost and no split brain
Every write and linearizable read requires quorum vote
ZooKeeper, etcd, Consul built on consensus; provide locks, leases, failure detection, notifications

💡 Insight

Consensus is not always the right tool. For systems that don't need strong consistency, weaker models (leaderless, multi-leader) offer better availability and performance. The decision comes down to your application requirements: do you need linearizability, or can you tolerate eventual consistency?

When not to use consensus:

Strong consistency not needed
High availability more important than consistency
Latency-sensitive applications
Wide-area networks with variable delays

For such cases, leaderless or multi-leader replication with weaker consistency models is often more appropriate. The logical clocks we discussed (Lamport, hybrid) are helpful in that context.

Consensus algorithms are complicated and subtle, but they are supported by a rich body of theory developed since the 1980s. This theory makes it possible to build systems that tolerate all the faults discussed in Chapter 9 while ensuring data is not corrupted.

Previous: Chapter 9 | Next: Chapter 11

Table of Contents​

1. Introduction​

2. Linearizability​

Example: Nonlinearizable Sports Website​

2.1. What Makes a System Linearizable?​

2.2. Linearizability Versus Serializability​

2.3. Relying on Linearizability​

2.4. Implementing Linearizable Systems​

Linearizability and Quorums​

2.5. The Cost of Linearizability​

The CAP Theorem​

Linearizability and Network Delays​

3. ID Generators and Logical Clocks​

Single-Node ID Generator Problems​

Alternative ID Generation Schemes​

3.1. Logical Clocks​

Lamport Timestamps​

Hybrid Logical Clocks​

Lamport/Hybrid Clocks vs. Vector Clocks​

3.2. Linearizable ID Generators​

Implementing a Linearizable ID Generator​

4. Consensus​

4.1. The Impossibility of Consensus​

4.2. The Many Faces of Consensus​

Single-Value Consensus​

Shared Logs (Total Order Broadcast)​

Atomic Commitment​

4.3. Consensus in Practice​

Using Shared Logs​

From Single-Leader Replication to Consensus​

Pros and Cons of Consensus​

4.4. Coordination Services​

Common Use Cases​

5. Summary​

Table of Contents