Skip to main content

Chapter 9. The Trouble with Distributed Systems

They're funny things, Accidents. You never have them till you're having them.

Table of Contents

  1. Introduction
  2. Unreliable Networks
  3. Unreliable Clocks
  4. Knowledge, Truth, and Lies
  5. Summary

1. Introduction

In plain English:

  • Building on a single computer is like cooking alone in your kitchen
  • You control everything; if something goes wrong, you know what happened
  • Distributed systems are like coordinating a meal across multiple kitchens in different cities
  • Communication takes time, messages might get lost
  • You can never be entirely sure what's happening elsewhere

In technical terms:

  • Distributed systems introduce partial failures—some components fail while others continue
  • No shared memory
  • No global clock
  • No way to know the true state of the system at any given moment

Why it matters:

  • Understanding what can go wrong helps you design reliable software
  • Network failures, clock skew, and process pauses aren't theoretical edge cases
  • They happen regularly in production systems

The mindset shift required:

  • Making a system reliable means it continues working even when things go wrong
  • It's tempting to focus on the happy path—after all, most of the time things work
  • But in a large system, one-in-a-million events happen every day
  • Experienced operators know: anything that can go wrong will go wrong

💡 Insight

The defining characteristic of distributed systems is partial failure:

  • Some parts fail while others continue working
  • On a single machine, failures are typically total (the whole system crashes)
  • In distributed systems, you must design for the messy middle ground
  • You can't determine what succeeded and what failed

1.1. Faults and Partial Failures

In plain English:

  • When your laptop crashes, everything stops (blue screen or kernel panic)
  • Nothing works until you reboot
  • But in a distributed system, one server might crash while three others keep running
  • Or the network between two servers might fail while everything else works fine
  • These "partial failures" are much harder to reason about

In technical terms:

  • On a single computer, operations are deterministic (same operation → same result)
  • Hardware faults typically cause total failures
  • In distributed systems, nondeterministic partial failures are the norm
  • You try something involving multiple nodes—it may sometimes work, sometimes fail

Why it matters:

  • Traditional debugging techniques don't work well with partial failures
  • You can't just add logging and reproduce the issue
  • Failure might depend on rare timing conditions or network behavior
Single Node vs. Distributed System Failures
Single Node
Working
Hardware Fault
Total Failure
Distributed System
Node 1 ✓
Node 2 ✓
Node 3 ✓
Partial Fault
Node 1 ✓
Node 2 ✗
Node 3 ✓

Single computer vs. distributed system:

AspectSingle ComputerDistributed System
BehaviorPredictable: works or doesn'tNondeterministic partial failures
Failure modeTotal crash (by design)Some parts broken, others fine
ReproducibilitySame operation → same resultSame operation → sometimes works, sometimes fails

The trade-off:

  • Nondeterminism makes distributed systems hard to work with
  • But tolerating partial failures opens powerful possibilities:
    • Rolling upgrades (one node at a time)
    • Continuous availability during maintenance

2. Unreliable Networks

In plain English:

  • Think of network communication like sending mail through the postal service
  • You drop a letter in the mailbox, but no guarantee when (or if) it will arrive
  • The postal truck might break down, letter might get lost, or sit in sorting for days
  • The only way to know if it arrived is if the recipient sends a reply
  • But that reply might also get lost!

In technical terms:

  • Most distributed systems use asynchronous packet networks
  • Messages can be lost, delayed arbitrarily, or delivered out of order
  • The sender cannot determine whether a message was received
  • Unless the recipient explicitly responds—and that response might be lost too

Why it matters:

  • Every network operation can fail
  • You can't distinguish "request lost" from "server crashed" from "response delayed"
  • Your code must handle all these scenarios correctly

Key constraint: The network is the only way machines can communicate—there's no shared memory.

2.1. Network Failure Scenarios

Asynchronous packet networks (internet, datacenter networks):

  • One node can send a message (packet) to another node
  • Network gives no guarantees as to when it will arrive
  • Or whether it will arrive at all
Network Failure Scenarios
Scenario A: Request Lost
Client
Request ✗
Server
Scenario B: Server Down
Client
Request
Server Crashed
Scenario C: Response Lost
Client
Response ✗
Server (processed)

If you send a request and expect a response, many things could go wrong:

  1. Your request may have been lost (perhaps someone unplugged a network cable)
  2. Your request may be waiting in a queue and will be delivered later (perhaps the network or the recipient is overloaded)
  3. The remote node may have failed (perhaps it crashed or it was powered down)
  4. The remote node may have temporarily stopped responding (perhaps it is experiencing a long garbage collection pause)
  5. The remote node may have processed your request, but the response has been lost on the network
  6. The remote node may have processed your request, but the response has been delayed and will be delivered later

💡 Insight

From the sender's perspective, all network failures look identical: no response.

  • You cannot distinguish "request lost" from "server crashed" from "response delayed by 5 minutes"
  • This fundamental uncertainty shapes how we must design distributed systems
  • We can't rely on definitive knowledge about remote state

The only option:

  • Recipient must send a response (which itself might be lost or delayed)
  • All failure scenarios are indistinguishable: you just haven't received a response
  • Usual handling: timeout—give up waiting and assume response won't arrive
  • But even with timeout, you don't know if the remote node got your request

2.2. The Limitations of TCP

In plain English:

  • TCP is like a postal service with package tracking and redelivery
  • If a package gets lost, the service automatically sends another one
  • But the catch: tracking only tells you when it reaches the recipient's mailbox
  • Not when they actually open and read it
  • If the postal service gives up after too many failed attempts, you're back to square one

In technical terms:

  • TCP provides reliability at the network layer:
    • Detects and retransmits dropped packets
    • Reorders out-of-sequence packets
    • Checks data integrity with checksums
  • But TCP can only guarantee delivery to the OS network stack
  • Not to the application itself
  • If TCP times out, you still don't know if your data was processed

Why it matters:

  • Many developers assume TCP makes the network "reliable"
  • But TCP can't solve application-level reliability problems
  • You still need timeouts, retries, and idempotence at the application layer

What TCP does provide:

  • Breaks large messages into packets
  • Reassembles them on receiving side
  • Congestion control, flow control, backpressure
TCP Connection Lifecycle
1
Application writes to socket
Data buffered by OS
2
TCP sends packets
Managed by congestion control
3
Network delivery
Through switches and routers
4
OS receives & ACKs
Places data in receive buffer
5
Application reads
May crash before processing

TCP's limitations:

  • TCP decides a packet was lost if no ACK arrives within timeout
  • But can't tell whether outbound packet or ACK was lost
  • If network cable is unplugged, TCP can't plug it back in
  • If connection closes with error, you don't know how much data was processed
  • Even if TCP ACKed the packet, that only means the OS kernel received it
  • The application may have crashed before handling that data

2.3. Network Faults in Practice

In plain English:

  • You might think modern datacenters with redundant connections would be reliable
  • In reality, network problems happen all the time
  • Squirrels chew through cables, construction crews cut fiber lines
  • Misconfigured switches create routing loops
  • Software upgrades temporarily disrupt connectivity

In technical terms:

  • Empirical studies show network faults occur regularly even in well-managed datacenters
  • About 12 faults per month in a medium-sized datacenter
  • Human error (misconfiguration) is a major contributor
  • Redundant hardware doesn't protect against configuration mistakes

Why it matters:

  • Network faults aren't theoretical corner cases—they're operational realities
  • Systems that don't handle network partitions gracefully will experience outages
  • And potentially data corruption in production

Decades of networking, and still unreliable:

  • One study in a medium-sized datacenter found about 12 network faults per month, of which half disconnected a single machine, and half disconnected an entire rack
  • Interruptions of wide-area fiber links have been blamed on cows, beavers, and sharks (though shark bites have become rarer due to better shielding of submarine cables)
  • Across different cloud regions, round-trip times of up to several minutes have been observed at high percentiles
  • Network interfaces can fail in asymmetric ways: sending packets successfully but dropping all incoming packets
  • Even a brief network interruption can have repercussions that last for much longer than the original issue

💡 Insight

Network partitions (netsplit) occur when one part of the network is cut off from the rest:

  • Particularly dangerous: nodes on different sides can both believe they are primary
  • Leads to split brain scenarios
  • The CAP theorem derives from the impossibility of reliably detecting partitions

Bottom line:

  • Even if faults are rare, your software must handle them
  • If error handling isn't defined and tested, arbitrarily bad things could happen:
    • Cluster could become deadlocked
    • Permanently unable to serve requests
    • Could even delete all your data

2.4. Detecting Faults

In plain English:

  • Imagine calling a friend
  • Busy signal → you know they're on another call
  • "Number not in service" → you know the number is wrong
  • But if it just rings and rings with no answer, you're stuck
  • Maybe they're ignoring you, phone is off, or no signal
  • Detecting faults in distributed systems has the same ambiguity

In technical terms:

  • Fault detection relies on timeouts (most failures don't generate explicit error signals)
  • Timeouts cannot distinguish between:
    • Network failures
    • Slow nodes
    • Actually crashed nodes
  • Some systems get explicit feedback (TCP RST, ICMP Destination Unreachable)
  • But these are the exception, not the rule

Why it matters:

  • False positives (declaring a live node dead) can be as harmful as false negatives
  • Choosing timeout values requires balancing detection speed vs. risk of premature declarations

Systems that need automatic fault detection:

  • A load balancer needs to stop sending requests to a node that is dead
  • In a distributed database with single-leader replication, if the leader fails, one of the followers needs to be promoted to be the new leader

The challenge: Network uncertainty makes it difficult to tell whether a node is working.

Sometimes you get explicit feedback:

Explicit Signals
  • TCP RST/FIN packets (port not listening)
  • Node script notifies others of crash (HBase)
  • Switch reports link down
  • ICMP Destination Unreachable
No Signal (Silent Failures)
  • Process paused (GC, swapping)
  • Network cable unplugged
  • Switch misconfigured
  • Node overloaded but alive

General approach:

  • Rapid feedback is useful but can't be counted on
  • Assume you will get no response at all
  • Retry a few times, wait for timeout to elapse
  • Eventually declare the node dead if no response within timeout

2.5. Timeouts and Unbounded Delays

In plain English:

  • Setting a timeout is like deciding how long to wait for a friend who's late to lunch
  • Wait too long → waste your time
  • Give up too quickly → they might show up just after you leave
  • In distributed systems, delays are unpredictable (milliseconds to minutes)

In technical terms:

  • Asynchronous networks have unbounded delays
  • No upper limit on how long a packet might take to arrive
  • Impossible to choose a "correct" timeout value
  • Short timeouts: detect faults faster, but risk false positives
  • Long timeouts: fewer false positives, but delay fault detection

Why it matters:

  • Prematurely declaring a node dead can cause cascading failures
  • (Transfer responsibilities to already-overloaded nodes)
  • Long timeouts delay recovery from genuine failures
  • Systems must experimentally determine appropriate timeouts

The timeout dilemma:

  • Long timeout → long wait until node declared dead (users see errors)
  • Short timeout → faster detection, but risk of incorrectly declaring a slow node dead
Timeout Trade-offs
Short Timeout
Fast detection
False positives
Cascading failure
Long Timeout
Fewer false positives
Slow detection
Long user wait
Adaptive Timeout
Learns from history
Adjusts to jitter
Phi Accrual detector

Why premature declarations are problematic:

  • If the node is actually alive and performing an action (e.g., sending an email)
  • And another node takes over
  • The action may end up being performed twice

Hypothetical bounded delay system:

  • If network guaranteed max delay d (packet delivered or lost within time d)
  • And non-failed node handles requests within time r
  • Then every successful request receives response within 2d + r
  • No response in that time → network or node is down

Reality:

  • Most systems have neither of those guarantees
  • Asynchronous networks have unbounded delays
  • Servers cannot guarantee handling requests within max time

💡 Insight

Systems can use adaptive timeouts that learn from observed response times:

  • The Phi Accrual failure detector (used in Akka and Cassandra)
  • Maintains a sliding window of response times
  • Calculates a suspicion level rather than binary alive/dead
  • Provides more nuanced failure detection than fixed timeouts

2.6. Network Congestion and Queueing

In plain English:

  • Think of network delays like traffic congestion on highways
  • Most of the time traffic flows smoothly
  • But during rush hour, a 20-minute drive might take an hour
  • The problem isn't that cars are moving slower
  • It's that they're waiting in queues at every intersection
  • Networks work the same way: packets spend most time in queues, not traveling

In technical terms:

  • Variable network delays are primarily caused by queueing:
    • Switch queues when multiple senders target same destination
    • OS queues when CPU cores are busy
    • Hypervisor queues in virtualized environments
    • TCP send queues due to congestion control
  • These queues can build up quickly under load
  • Delays can vary by orders of magnitude

Why it matters:

  • In a heavily utilized system, queueing delays dominate transmission times
  • A packet that normally takes 1ms might take 100ms+ when queues fill up
  • This variability makes timeout selection especially challenging in multi-tenant environments

Queueing points in the network:

Sources of Queueing Delay
Queueing Points in Network Stack
TCP Send Queue
Congestion control limits send rateBuffers data before transmission
Switch Queue
Multiple senders → same destinationQueue fills up, packets dropped
OS Receive Queue
CPU cores busy processingPackets buffered until app ready
Hypervisor Queue
VM paused while another VM runsNetwork data queued during pause

Specific queueing scenarios:

  1. Switch congestion: If several different nodes simultaneously try to send packets to the same destination, the network switch must queue them up and feed them into the destination network link one by one. If there is so much incoming data that the switch queue fills up, the packet is dropped.

  2. CPU contention: When a packet reaches the destination machine, if all CPU cores are currently busy, the incoming request from the network is queued by the operating system until the application is ready to handle it.

  3. VM pauses: In virtualized environments, a running operating system is often paused for tens of milliseconds while another virtual machine uses a CPU core. During this time, the VM cannot consume any data from the network, so the incoming data is queued.

  4. TCP backpressure: TCP limits the rate at which it sends data to avoid overloading the network, which means additional queueing at the sender before the data even enters the network.

💡 Insight

In public clouds and multitenant datacenters, you share resources with other customers:

  • The "noisy neighbor" problem
  • A customer near you using lots of bandwidth can dramatically increase your delays
  • You have no visibility into or control over this
  • Timeout values that work in a private datacenter might be too aggressive in the cloud

The capacity factor:

  • System with spare capacity can easily drain queues
  • Highly utilized system → long queues build up quickly

How to choose timeouts:

  • Can only be determined experimentally
  • Measure RTT distribution over extended period, across many machines
  • Take into account your application's characteristics
  • Balance failure detection delay vs. risk of premature timeouts

Better approach: Use adaptive timeouts

  • Continually measure response times and their variability (jitter)
  • Automatically adjust timeouts based on observed distribution

2.7. Synchronous vs. Asynchronous Networks

In plain English:

  • Traditional telephone networks are like trains running on a fixed schedule
  • When you make a call, the network reserves a dedicated "track" (bandwidth)
  • Internet networks are like buses that share the road
  • Packets jostle with each other for space—more efficient but less predictable

In technical terms:

  • Telephone networks use circuit switching
    • Fixed bandwidth allocated for duration of connection
    • Provides bounded latency
  • IP networks use packet switching
    • Opportunistically uses whatever bandwidth is available
    • Better utilization but unbounded delays due to queueing

Why it matters:

  • Why don't we build networks with bounded delays like telephone systems?
  • Answer: economics—packet switching provides much better resource utilization
  • For bursty data traffic, circuit switching would waste bandwidth
  • And make data transfers unnecessarily slow

How circuit switching works:

  • Telephone call establishes a circuit
  • Fixed, guaranteed bandwidth allocated along the entire route
  • Circuit remains in place until call ends
  • Example: ISDN runs at 4,000 frames/second, 16 bits per frame allocated per call
Circuit Switching vs. Packet Switching
Circuit Switching (Telephone)
Fixed bandwidth reserved
No queueing
Bounded delay
Wastes capacity
Packet Switching (Internet)
Shares bandwidth dynamically
Queueing occurs
Unbounded delay
Better utilization

Synchronous networks:

  • Data passes through routers without suffering from queueing
  • Because bandwidth is already reserved at each hop
  • Maximum end-to-end latency is fixed → bounded delay

Why datacenter/internet use packet switching:

  • Optimized for bursty traffic
  • Audio/video calls need constant bits per second → circuits work well
  • But web pages, emails, file transfers don't have bandwidth requirements
  • We just want them to complete as quickly as possible

The problem with circuits for data:

  • Would have to guess bandwidth allocation
  • Guess too low → transfer unnecessarily slow, wasted capacity
  • Guess too high → circuit cannot be set up
  • TCP dynamically adapts to available capacity → much better for bursty traffic

💡 Insight

Variable delays in networks are not a law of nature:

  • Simply the result of a cost/benefit trade-off
  • We could build networks with bounded delays (circuit switching, QoS)
  • But it would be significantly more expensive and provide lower utilization
  • Most non-safety-critical systems choose cheap and unreliable over expensive and reliable

3. Unreliable Clocks

In plain English:

  • Imagine two friends trying to meet up, but their watches show different times
  • One arrives at 3:00 PM their time, the other at 3:00 PM their time
  • But they're actually 10 minutes apart
  • In distributed systems, every computer has its own clock
  • They're all slightly different—some run faster, some slower
  • This makes it surprisingly difficult to determine the order of events

In technical terms:

  • Clocks serve two purposes:
    • Measuring durations (timeouts, latencies)
    • Determining points in time (timestamps, cache expiry)
  • Physical clocks drift from the true time
  • Synchronization via NTP is imperfect
  • This makes clocks unsuitable for ordering events across nodes without additional mechanisms

Why it matters:

  • Relying on unsynchronized clocks can cause data corruption
  • Including lost updates in databases using last-write-wins
  • Understanding clock behavior is essential for building correct distributed systems

Applications depend on clocks for:

  • Has this request timed out yet?
  • What's the 99th percentile response time of this service?
  • When was this article published?
  • When does this cache entry expire?
  • What is the timestamp on this error message in the log file?

The challenges of time in distributed systems:

  • Communication is not instantaneous (message travel time)
  • Message received is always later than sent, but we don't know how much later
  • Each machine has its own clock (quartz crystal oscillator)
  • These devices are not perfectly accurate
  • Each machine has its own notion of time (may be faster or slower than others)

3.1. Monotonic vs. Time-of-Day Clocks

In plain English:

  • Computers have two types of clocks, like a wall clock and a stopwatch
  • Wall clock (time-of-day): tells you "it's 3:00 PM" but can jump backward
  • Stopwatch (monotonic): only measures duration, always moves forward
  • Never use the wall clock to measure how long something took
  • Never use the stopwatch to know what time it is

In technical terms:

  • Time-of-day clocks: return current date/time (wall-clock time)
    • Can jump backward due to NTP corrections or leap seconds
  • Monotonic clocks: measure elapsed time only
    • Guaranteed to always move forward
    • NTP may adjust their rate (slewing) but won't jump

Why it matters:

  • Using time-of-day for durations can cause negative elapsed times
  • Or timeouts that never expire
  • Use monotonic for elapsed time (timeouts, latencies)
  • Use time-of-day only when you need actual calendar time

Key distinction: Both measure time, but serve different purposes.

🕐
Time-of-Day Clock
Returns current date/time (wall-clock time)
  • clock_gettime(CLOCK_REALTIME)
  • System.currentTimeMillis()
  • Synchronized with NTP
  • Can jump backward
  • Use for: timestamps, calendar time
⏱️
Monotonic Clock
Measures elapsed time (duration)
  • clock_gettime(CLOCK_MONOTONIC)
  • System.nanoTime()
  • Always moves forward
  • NTP can adjust rate (slewing)
  • Use for: timeouts, measuring latency

Time-of-day clocks return the current date and time according to some calendar. For example, clock_gettime(CLOCK_REALTIME) on Linux and System.currentTimeMillis() in Java return the number of seconds (or milliseconds) since the epoch: midnight UTC on January 1, 1970.

Time-of-day clocks are usually synchronized with NTP, which means that a timestamp from one machine (ideally) means the same as a timestamp on another machine. However, time-of-day clocks have various oddities. In particular, if the local clock is too far ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in time. These jumps make time-of-day clocks unsuitable for measuring elapsed time.

Monotonic clocks are suitable for measuring a duration (time interval), such as a timeout or a service's response time. The name comes from the fact that they are guaranteed to always move forward (whereas a time-of-day clock may jump back in time).

You can check the value of the monotonic clock at one point in time, do something, and then check the clock again at a later time. The difference between the two values tells you how much time elapsed between the two checks. However, the absolute value of the clock is meaningless: it might be the number of nanoseconds since the computer was booted up. In particular, it makes no sense to compare monotonic clock values from two different computers.

💡 Insight

The distinction between these clock types is crucial but often overlooked. Many bugs arise from using System.currentTimeMillis() to measure timeouts—if NTP steps the clock backward during the timeout period, the timeout might never fire or might fire immediately. Always use monotonic clocks for measuring elapsed time.

3.2. Clock Synchronization and Accuracy

In plain English: Computer clocks are like cheap wristwatches—they drift over time, running a bit fast or slow. NTP is like periodically checking your watch against an atomic clock and adjusting it. But just like you can't instantly adjust your watch to be perfectly accurate, computer clocks can't be perfectly synchronized either. They might drift by milliseconds or even seconds between synchronizations.

In technical terms: Quartz clock drift can reach 200 ppm (17 seconds per day). NTP can synchronize clocks to within tens of milliseconds over the internet, or better with local GPS receivers. However, NTP's accuracy is limited by network round-trip time, and misconfiguration or network issues can cause large clock errors.

Why it matters: If you assume clocks are accurate to better than they actually are, you'll experience subtle data corruption bugs. Systems like Spanner that rely on clock synchronization must explicitly account for uncertainty bounds and use hardware (GPS, atomic clocks) to minimize them.

Monotonic clocks don't need synchronization, but time-of-day clocks need to be set according to an NTP server or other external time source in order to be useful. Unfortunately, our methods for getting a clock to tell the correct time aren't nearly as reliable or accurate as you might hope—hardware clocks and NTP can be fickle beasts.

⚠️
Clock Accuracy Problems
  • Quartz drift: up to 17s/day
  • NTP server might be wrong
  • Network delays limit NTP accuracy
  • Leap seconds cause time jumps
  • VM clock can jump on resume
  • Firewalled from NTP = unbounded drift
High-Accuracy Solutions
  • GPS receivers in datacenter
  • Atomic clocks (expensive)
  • PTP (Precision Time Protocol)
  • Google TrueTime API
  • AWS ClockBound
  • Regular monitoring of clock skew

To give just a few examples of clock accuracy problems:

  • The quartz clock in a computer drifts (runs faster or slower than it should). Google assumes a clock drift of up to 200 ppm (parts per million) for its servers, which is equivalent to 6 ms drift for a clock that is resynchronized every 30 seconds, or 17 seconds drift for a clock that is resynchronized once a day.

  • If a computer's clock differs too much from an NTP server, it may refuse to synchronize, or the local clock will be forcibly reset. Any applications observing the time before and after this reset may see time go backward or suddenly jump forward.

  • If a node is accidentally firewalled off from NTP servers, the misconfiguration may go unnoticed for some time, during which the drift may add up to large discrepancies between different nodes' clocks.

  • NTP synchronization can only be as good as the network delay. One experiment showed that a minimum error of 35 ms is achievable when synchronizing over the internet, though occasional spikes in network delay lead to errors of around a second.

  • Leap seconds result in a minute that is 59 seconds or 61 seconds long, which messes up timing assumptions. Leap seconds will no longer be used from 2035 onwards.

  • In virtual machines, the hardware clock is virtualized. When a VM is paused for several seconds, the clock may then be several seconds behind the actual time.

💡 Insight

It is possible to achieve very good clock accuracy with GPS receivers, atomic clocks, and the Precision Time Protocol (PTP). For example, MiFID II regulation requires high-frequency trading funds to synchronize their clocks to within 100 microseconds of UTC. However, this requires specialized hardware and careful monitoring—clock accuracy doesn't happen by default.

3.3. Relying on Synchronized Clocks

In plain English: Using clocks to order events is like using postmarks to determine which letter was sent first. It works if all the post offices have perfectly synchronized clocks, but in reality, one post office's clock might be 10 minutes fast. The letter postmarked "3:05 PM" might actually have been sent after the one marked "3:02 PM" from a different office.

In technical terms: Using timestamps from unsynchronized clocks to order events can violate causality. Last-write-wins (LWW) conflict resolution based on timestamps can silently discard writes from nodes with slow clocks. Logical clocks (version vectors, Lamport timestamps) provide correct ordering without relying on physical clocks.

Why it matters: Databases like Cassandra use client timestamps for conflict resolution, which means a client with a fast clock can overwrite more recent data written by a client with a slow clock. Understanding this helps you choose appropriate conflict resolution strategies.

The problem with clocks is that while they seem simple and easy to use, they have a surprising number of pitfalls: a day may not have exactly 86,400 seconds, time-of-day clocks may move backward in time, and the time according to one node's clock may be quite different from another node's clock.

Timestamps for ordering events

Let's consider one particular situation in which it is tempting, but dangerous, to rely on clocks: ordering of events across multiple nodes. For example, if two clients write to a distributed database, who got there first? Which write is the more recent one?

Clock Skew Causing Incorrect Ordering
Node 1 (fast clock)
Client A writes x=1
Timestamp: 42.004
Replicates
Node 3 (slow clock)
Client B writes x=2
Timestamp: 42.003

Node 2 receives both: x=1 (ts: 42.004) and x=2 (ts: 42.003)


Last-write-wins chooses x=1, discarding the causally later write x=2!

When a write is replicated to other nodes, it is tagged with a timestamp according to the time-of-day clock on the node where the write originated. Even if clock synchronization is very good (skew less than 3 ms), the following problem can occur:

Client A writes x = 1 on node 1, which assigns timestamp 42.004 seconds. This write is replicated to node 3. Client B then increments x on node 3 (we now have x = 2), which assigns timestamp 42.003 seconds (because node 3's clock is slightly behind node 1). When node 2 receives both writes, it incorrectly concludes that x = 1 is more recent and discards x = 2, so the increment is lost.

This problem can be prevented by ensuring that when a value is overwritten, the new value always has a higher timestamp than the overwritten value. However, some systems (like Cassandra and ScyllaDB) simply use client clock timestamps with last write wins, which has serious problems:

  1. Database writes can mysteriously disappear: A node with a lagging clock is unable to overwrite values previously written by a node with a fast clock until the clock skew has elapsed.

  2. LWW cannot distinguish sequential from concurrent writes: In the example above, client B's increment definitely occurred after client A's write, but LWW cannot detect this causality violation.

  3. Conflicts with same timestamp need a tiebreaker: If clocks only have millisecond resolution, two writes might get the same timestamp, requiring a random tiebreaker that can also violate causality.

💡 Insight

Logical clocks provide a safer alternative for ordering events. They're based on incrementing counters rather than physical time, and they only track relative ordering (whether one event happened before or after another), not absolute time. Version vectors and Lamport timestamps are examples of logical clocks that provide correct causality tracking without relying on synchronized clocks.

Clock readings with a confidence interval

You may be able to read a machine's time-of-day clock with microsecond or even nanosecond resolution. But even if you can get such a fine-grained measurement, that doesn't mean the value is actually accurate to such precision. The drift in an imprecise quartz clock can easily be several milliseconds.

Thus, it doesn't make sense to think of a clock reading as a point in time—it is more like a range of times, within a confidence interval: for example, a system may be 95% confident that the time now is between 10.3 and 10.5 seconds past the minute, but it doesn't know any more precisely than that.

The TrueTime API in Google's Spanner and Amazon's ClockBound explicitly report the confidence interval on the local clock. When you ask it for the current time, you get back two values: [earliest, latest], which are the earliest possible and the latest possible timestamp. The width of the interval depends on how long it has been since the local quartz clock was last synchronized with a more accurate clock source.

Synchronized clocks for global snapshots

Spanner implements snapshot isolation across datacenters by using the clock's confidence interval and deliberately waiting for the length of the confidence interval before committing a read-write transaction. By doing so, it ensures that any transaction that may read the data is at a sufficiently later time, so their confidence intervals do not overlap.

In order to keep the wait time as short as possible, Spanner needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms.

3.4. Process Pauses

In plain English: Imagine you're in the middle of a sentence, and suddenly you black out for 15 seconds. When you wake up, you have no idea any time has passed—you just continue your sentence as if nothing happened. Your listeners are confused because you've been standing there silently for 15 seconds. This is what happens to programs when they pause unexpectedly—the program doesn't realize time has passed, but the rest of the world has moved on.

In technical terms: Threads can be paused for arbitrary lengths of time due to garbage collection, virtualization, swapping, I/O waits, or OS scheduling. During this pause, the program makes no progress but has no way to detect the pause until it checks the clock afterward. This can cause leases to expire, timeouts to fire spuriously, and race conditions.

Why it matters: You cannot assume that code executes without interruption. Any operation that depends on timing (like "check if lease is valid, then process request") can be broken by a pause between the check and the action. This requires defensive programming techniques like fencing tokens.

Let's consider another example of dangerous clock use in a distributed system. Say you have a database with a single leader per shard. Only the leader is allowed to accept writes. How does a node know that it is still leader, and that it may safely accept writes?

One option is for the leader to obtain a lease from the other nodes, which is similar to a lock with a timeout. Only one node can hold the lease at any one time. In order to remain leader, the node must periodically renew the lease before it expires. If the node fails, it stops renewing the lease, so another node can take over when it expires.

while (true) {
request = getIncomingRequest();

// Ensure that the lease always has at least 10 seconds remaining
if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) {
lease = lease.renew();
}

if (lease.isValid()) {
process(request);
}
}

What's wrong with this code? The code assumes that very little time passes between checking the time and processing the request. However, what if there is an unexpected pause in the execution of the program?

Process Pause Breaking Lease Safety
1
Check lease validity
Lease expires in 15 seconds - looks good!
2
Thread pauses for 15 seconds
GC pause, VM suspended, disk I/O, etc.
3
Thread resumes
No knowledge that time has passed
4
Process request
Lease has expired! Another node is now leader!
5
Split brain
Two nodes both think they are leader

For example, imagine the thread stops for 15 seconds around the line lease.isValid() before finally continuing. In that case, it's likely that the lease will have expired by the time the request is processed, and another node has already taken over as leader. However, there is nothing to tell this thread that it was paused for so long, so this code won't notice that the lease has expired until the next iteration of the loop.

Is it reasonable to assume that a thread might be paused for so long? Unfortunately yes. There are various reasons why this could happen:

🗑️
Garbage Collection Pauses
  • Stop-the-world GC (Java, Go)
  • Can last milliseconds to minutes
  • Improved with modern collectors (G1, ZGC)
  • Can be mitigated with careful tuning
☁️
Virtualization
  • VM suspended and resumed
  • Live migration to another host
  • Steal time from hypervisor
  • Unpredictable pause duration
💻
Operating System
  • Context switches to other threads
  • Swapping (paging) to disk
  • Synchronous disk I/O
  • SIGSTOP signal (Ctrl-Z)
⚙️
Hardware Issues
  • Laptop lid closed
  • Thermal throttling
  • Driver bugs
  • Page faults

All of these occurrences can preempt the running thread at any point and resume it at some later time, without the thread even noticing. A node in a distributed system must assume that its execution can be paused for a significant length of time at any point, even in the middle of a function.

💡 Insight

When writing multi-threaded code on a single machine, we have good tools for making it thread-safe: mutexes, semaphores, atomic counters, lock-free data structures. Unfortunately, these tools don't directly translate to distributed systems, because a distributed system has no shared memory—only messages sent over an unreliable network. You cannot use a mutex to protect against concurrent writes when both writers might believe they hold the lease.

Response time guarantees

Some software runs in environments where a failure to respond within a specified time can cause serious damage: computers that control aircraft, rockets, robots, cars, and other physical objects must respond quickly and predictably to their sensor inputs. These are so-called hard real-time systems.

Providing real-time guarantees in a system requires support from all levels of the software stack:

  • A real-time operating system (RTOS) that allows processes to be scheduled with guaranteed CPU time
  • Library functions must document their worst-case execution times
  • Dynamic memory allocation may be restricted or disallowed entirely
  • Real-time garbage collectors exist, but require careful memory management
  • Enormous amounts of testing and measurement

For most server-side data processing systems, real-time guarantees are simply not economical or appropriate. Consequently, these systems must suffer the pauses and clock instability that come from operating in a non-real-time environment.

Limiting the impact of garbage collection

One approach is to treat GC pauses like brief planned outages of a node, and to let other nodes handle requests from clients while one node is collecting its garbage. If the runtime can warn the application that a node soon requires a GC pause, the application can stop sending new requests to that node, wait for it to finish processing outstanding requests, and then perform the GC while no requests are in progress.

A variant of this idea is to use the garbage collector only for short-lived objects (which are fast to collect) and to restart processes periodically, before they accumulate enough long-lived objects to require a full GC. One node can be restarted at a time, and traffic can be shifted away from the node before the planned restart.


4. Knowledge, Truth, and Lies

In plain English: In a distributed system, you can never be completely sure about anything happening on other nodes. It's like being in a dark room with other people where you can only communicate by shouting—you might not hear their response, they might be asleep, or they might be shouting back but you can't hear them. The philosophical question becomes: how can you make reliable decisions when you can't trust your observations?

In technical terms: Nodes in a distributed system cannot have perfect knowledge about the system state. They can only make guesses based on messages received (or not received). A node cannot reliably distinguish between "the other node is down," "the network is down," and "the other node is slow." Algorithms must be designed to work correctly despite this fundamental uncertainty.

Why it matters: Many bugs in distributed systems come from implicit assumptions that nodes can know things they cannot actually know. Quorum-based decision making and consensus algorithms provide ways to make progress despite imperfect knowledge.

So far in this chapter we have explored the ways in which distributed systems are different from programs running on a single computer: there is no shared memory, only message passing via an unreliable network with variable delays, and the systems may suffer from partial failures, unreliable clocks, and processing pauses.

The consequences of these issues are profoundly disorienting if you're not used to distributed systems. A node in the network cannot know anything for sure about other nodes—it can only make guesses based on the messages it receives (or doesn't receive). A node can only find out what state another node is in by exchanging messages with it. If a remote node doesn't respond, there is no way of knowing what state it is in.

💡 Insight

Discussions of distributed systems border on the philosophical: What do we know to be true in our system? How sure can we be of that knowledge? The key insight is that we must design systems that provide useful guarantees even when built from unreliable components. This is achieved through careful abstraction—defining a system model that captures the essential properties we need.

4.1. The Majority Rules

In plain English: Imagine three judges on a panel. If one judge falls asleep, the other two can still make a decision. If two judges say "guilty," that's the verdict—even if the third judge wakes up later and disagrees. Distributed systems work the same way: instead of letting one node make critical decisions, we require a majority vote. This way, even if some nodes fail or are slow, the system can still make progress.

In technical terms: A node cannot exclusively rely on its own judgment because it may be experiencing a network partition, a long GC pause, or other issues that make it diverge from the rest of the system. Quorum-based algorithms require a majority (more than half) of nodes to agree before making decisions. Since there can only be one majority, this prevents conflicting decisions (split brain).

Why it matters: Quorum-based consensus is the foundation of reliable distributed systems. Understanding why majorities are special (they're the smallest group that prevents two disjoint groups from both making decisions) helps you design correct distributed algorithms.

Imagine a network with an asymmetric fault: a node is able to receive all messages sent to it, but any outgoing messages from that node are dropped or delayed. Even though that node is working perfectly well and receiving requests, the other nodes cannot hear its responses. After some timeout, the other nodes declare it dead.

The situation unfolds like a nightmare: the semi-disconnected node is dragged to the graveyard, kicking and screaming "I'm not dead!"—but since nobody can hear its screaming, the funeral procession continues with stoic determination.

Node Cannot Trust Its Own Judgment
From Node's Perspective
I'm healthy
Receiving requests
Processing normally
From Others' Perspective
No responses
Timeout elapsed
Node declared dead

The moral of these stories is that a node cannot necessarily trust its own judgment of a situation. A distributed system cannot exclusively rely on a single node, because a node may fail at any time, potentially leaving the system stuck and unable to recover.

Instead, many distributed algorithms rely on a quorum, that is, voting among the nodes: decisions require some minimum number of votes from several nodes in order to reduce the dependence on any one particular node.

Most commonly, the quorum is an absolute majority of more than half the nodes (although other kinds of quorums are possible). A majority quorum allows the system to continue working if a minority of nodes are faulty:

  • With 3 nodes, 1 faulty node can be tolerated (2 out of 3 nodes form a majority)
  • With 5 nodes, 2 faulty nodes can be tolerated (3 out of 5 nodes form a majority)

However, it is still safe, because there can only be one majority in the system—there cannot be two majorities with conflicting decisions at the same time.

💡 Insight

Why is a majority special? Because it's the smallest quorum that prevents split brain. If you required only 2 out of 5 nodes to agree, you could have two disjoint groups of 2 nodes each making conflicting decisions. If you required 4 out of 5, you couldn't tolerate even a single failure. A majority (3 out of 5) is the sweet spot: it tolerates f failures in a system of 2f+1 nodes, while ensuring only one group can reach quorum.

4.2. Distributed Locks and Leases

In plain English: A lease is like renting an apartment with a time limit. You have exclusive access until the lease expires, at which point someone else can take over. But there's a catch: if you fall asleep for a long time, you might wake up and think you still have the apartment, but your lease has expired and someone else has moved in. If you both think you own the apartment, chaos ensues.

In technical terms: Leases provide time-limited exclusive access to a resource. However, if the leaseholder experiences a process pause (GC, VM suspension), it may believe its lease is still valid when it has actually expired, and another node has acquired the lease. This creates a split brain scenario where two nodes both believe they hold the lease.

Why it matters: Simply checking lease validity and then acting on it is insufficient because a pause can occur between the check and the action. Correct implementations require additional mechanisms like fencing tokens to prevent zombies (expired leaseholders) from corrupting data.

You can use leases in situations where a system requires there to be only one of some thing. For example:

  • Only one node is allowed to be the leader for a database shard, to avoid split brain
  • Only one transaction or client is allowed to update a particular resource, to prevent concurrent writes
  • Only one node should process a given input file, to avoid wasted effort
Incorrect Lock Implementation - Data Corruption
1
Client 1 acquires lease
Lock service grants lease for 30 seconds
2
Client 1 starts writing to storage
Sends write request
3
Client 1 pauses (GC)
Stops making progress for 40 seconds
4
Lease expires after 30s
Client 2 acquires lease and starts writing
5
Client 1 resumes
Unaware time has passed, continues writing
6
Both clients write concurrently
File corrupted by conflicting writes

It is worth thinking carefully about what happens if several nodes simultaneously believe that they hold the lease, perhaps due to a process pause. In some cases (like processing the same file twice), the consequence is only wasted computational resources. But in others, the consequence could be lost or corrupted data.

4.3. Fencing Tokens

In plain English: Imagine apartment keys that have serial numbers stamped on them. Each time someone rents the apartment, they get a key with a higher number. Now, even if an old tenant wakes up from a long nap and tries to use their key, the apartment's smart lock can check the number and refuse entry because a newer key has already been issued. This is how fencing tokens work—they let the storage system reject requests from expired leaseholders.

In technical terms: A fencing token is a monotonically increasing number returned by the lock service each time a lease is granted. Clients must include their token with every write request. The storage service tracks the highest token it has seen and rejects writes with lower tokens, preventing zombies and delayed requests from corrupting data.

Why it matters: Fencing tokens provide a robust solution to the distributed lock problem that doesn't rely on perfect timing or the absence of pauses. This pattern is widely applicable: similar mechanisms appear in consensus algorithms (ballot numbers in Paxos, term numbers in Raft).

The term zombie is sometimes used to describe a former leaseholder who has not yet found out that it lost the lease, and who is still acting as if it was the current leaseholder. Since we cannot rule out zombies entirely, we have to instead ensure that they can't do any damage in the form of split brain. This is called fencing off the zombie.

Fencing Tokens Prevent Corruption
1
Client 1 gets lease with token 33
Lock service issues lease and token
2
Client 1 pauses (long GC)
Lease expires during pause
3
Client 2 gets lease with token 34
Lock service issues new lease
4
Client 2 writes with token 34
Storage accepts: highest token so far
5
Client 1 resumes, writes with token 33
Storage rejects: token 34 already seen

Let's assume that every time the lock service grants a lock or lease, it also returns a fencing token, which is a number that increases every time a lock is granted (e.g., incremented by the lock service). We can then require that every time a client sends a write request to the storage service, it must include its current fencing token.

In this way, the storage service can reject requests with old tokens. If ZooKeeper is your lock service, you can use the transaction ID zxid or the node version cversion as fencing token. With etcd, the revision number along with the lease ID serves a similar purpose.

💡 Insight

Fencing tokens are a specific instance of a general pattern in distributed systems: using sequence numbers to order operations. The same idea appears in:

  • Consensus algorithms: ballot numbers (Paxos), term numbers (Raft)
  • Databases: transaction IDs, LSNs (log sequence numbers)
  • Replication: replication offsets, version vectors

The key property: sequence numbers are monotonically increasing and assigned by a single authority, making them suitable for detecting stale operations.

Fencing with multiple replicas

If your clients need to write to multiple replicas, you can put the writer's fencing token in the most significant bits or digits of the timestamp. You can then be sure that any timestamp generated by the new leaseholder will be greater than any timestamp from the old leaseholder, even if the old leaseholder's writes happened later.

For example, imagine a leaderless replicated key-value store with last-write-wins conflict resolution. Client 2 has fencing token 34, so all of its timestamps start with 34... which are greater than any timestamps starting with 33... from Client 1. Client 2 writes to a quorum of replicas but can't reach Replica 3. When zombie Client 1 later tries to write, its write may succeed at Replica 3 even though it is ignored by replicas 1 and 2. This is not a problem, since a subsequent quorum read will prefer the write from Client 2 with the greater timestamp.

4.4. Byzantine Faults

In plain English: So far, we've assumed that when computers fail, they either stop working entirely or they respond slowly—but at least they're trying their best. Byzantine faults are when computers actively lie or try to deceive others, like a node that reports different values to different peers, or pretends to have data it doesn't actually have. It's the difference between a honest witness who might be confused versus one who's actively trying to mislead the court.

In technical terms: A Byzantine fault occurs when a node behaves in arbitrary or malicious ways: sending contradictory messages to different nodes, corrupting data, pretending to have data it doesn't have, or any other behavior that violates the protocol. Byzantine fault-tolerant algorithms can reach consensus even when some nodes are behaving maliciously.

Why it matters: Byzantine fault tolerance is essential in systems where nodes are controlled by mutually distrusting parties (blockchains, cryptocurrencies) or in aerospace systems where radiation might corrupt memory. However, for most datacenter applications, the cost of Byzantine fault tolerance outweighs the benefits, and we assume that nodes are honest even if they fail.

In this book we assume that nodes are unreliable but honest: they may be slow or never respond (due to a fault), and their state may be outdated (due to a GC pause or network delays), but we assume that if a node does respond, it is telling the "truth": to the best of its knowledge, it is playing by the rules of the protocol.

Distributed systems problems become much harder if there is a risk that nodes may "lie" (send arbitrary faulty or corrupted responses). Such behavior is known as a Byzantine fault, and the problem of reaching consensus in this untrusting environment is known as the Byzantine Generals Problem.

⚠️
When Byzantine Fault Tolerance Is Needed
  • Aerospace: radiation can corrupt memory
  • Cryptocurrencies: mutually untrusting parties
  • Peer-to-peer networks: no central authority
  • Multi-party computation
When It's Not Worth It
  • Datacenter: all nodes controlled by you
  • Very expensive to implement
  • Requires 2f+1 nodes to tolerate f faults
  • Simpler solutions (auth, encryption) work

A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network. This concern is relevant in certain specific circumstances, but in the kinds of systems we discuss in this book, we can usually safely assume that there are no Byzantine faults.

In a datacenter, all the nodes are controlled by your organization (so they can hopefully be trusted), and radiation levels are low enough that memory corruption is not a major problem. Protocols for making systems Byzantine fault-tolerant are quite expensive, and fault-tolerant embedded systems rely on support from the hardware level.

💡 Insight

Web applications do need to expect arbitrary and malicious behavior of clients (web browsers, mobile apps). This is why input validation, sanitization, and output escaping are so important. However, we typically don't use Byzantine fault-tolerant protocols here—we simply make the server the authority on deciding what client behavior is allowed, using traditional security mechanisms like authentication, authorization, and encryption.

Weak forms of lying

Although we assume that nodes are generally honest, it can be worth adding mechanisms to software that guard against weak forms of "lying"—for example, invalid messages due to hardware issues, software bugs, and misconfiguration:

  • Network packets do sometimes get corrupted. Usually caught by TCP/UDP checksums, but sometimes they evade detection. Application-level checksums and TLS provide additional protection.

  • A publicly accessible application must carefully sanitize inputs from users. An internal service behind a firewall may be able to get away with less strict checks, but basic validation is still a good idea.

  • NTP clients can be configured with multiple server addresses. When synchronizing, the client contacts all of them and checks that a majority agree on some time range, detecting misconfigured outliers.

4.5. System Model and Reality

In plain English: When designing a distributed algorithm, you need to specify what kinds of failures can happen (the rules of the game). Just like a chess game has rules about how pieces move, distributed systems have "rules" about what faults can occur. These rules are called the system model, and they help us reason about whether an algorithm will work correctly.

In technical terms: A system model is a formalization of the assumptions an algorithm makes about timing (synchronous, partially synchronous, asynchronous) and faults (crash-stop, crash-recovery, Byzantine). Algorithms are designed and proven correct within a specific system model. The partially synchronous model with crash-recovery faults is the most realistic for practical systems.

Why it matters: Understanding system models helps you evaluate whether a distributed algorithm is suitable for your use case. It also clarifies the gap between theory (where we prove correctness) and practice (where additional edge cases arise).

Algorithms need to be written in a way that does not depend too heavily on the details of the hardware and software configuration on which they are run. We do this by defining a system model, which is an abstraction that describes what things an algorithm may assume.

System Models
Timing Models
Synchronous: bounded delays
Partially synchronous: usually bounded
Asynchronous: no timing assumptions
Fault Models
Crash-stop: fails by stopping
Crash-recovery: stops then restarts
Fail-slow: responds but too slow
Byzantine: arbitrary behavior

Timing assumptions

Three system models are in common use:

Synchronous model: Assumes bounded network delay, bounded process pauses, and bounded clock error. This does not imply zero delay, just that delays never exceed some fixed upper bound. Not realistic for most practical systems.

Partially synchronous model: A system behaves like a synchronous system most of the time, but it sometimes exceeds the bounds for network delay, process pauses, and clock drift. This is a realistic model of many systems—most of the time things are well-behaved, but occasionally timing assumptions are violated.

Asynchronous model: An algorithm is not allowed to make any timing assumptions—in fact, it does not even have a clock. Some algorithms can be designed for the asynchronous model, but it is very restrictive.

Node failure models

Crash-stop faults: An algorithm may assume that a node can fail in only one way, namely by crashing. The node may suddenly stop responding at any moment, and thereafter that node is gone forever.

Crash-recovery faults: We assume that nodes may crash at any moment, and perhaps start responding again after some unknown time. Nodes are assumed to have stable storage (disk) that is preserved across crashes, while in-memory state is lost.

Degraded performance and partial functionality: Nodes may go slow (limping nodes, fail-slow, gray failure): they may still respond to health checks while being too slow to do real work. A Gigabit network might drop to 1 Kb/s; a process under memory pressure might spend all its time in GC; worn-out SSDs can have erratic performance. This can be even more difficult to deal with than clean failures.

Byzantine faults: Nodes may do absolutely anything, including trying to trick and deceive other nodes.

💡 Insight

For modeling real systems, the partially synchronous model with crash-recovery faults is generally the most useful. It captures the essential challenges (unbounded delays, process pauses, node failures) while remaining tractable for algorithm design. Most consensus algorithms (Raft, Multi-Paxos, Viewstamped Replication) target this model.

Defining correctness

To define what it means for an algorithm to be correct, we can describe its properties. For example, if we are generating fencing tokens for a lock, we may require:

  • Uniqueness: No two requests for a fencing token return the same value
  • Monotonic sequence: If request x returned token tx, and request y returned token ty, and x completed before y began, then tx < ty
  • Availability: A node that requests a fencing token and does not crash eventually receives a response

An algorithm is correct in some system model if it always satisfies its properties in all situations that we assume may occur in that system model.

Safety and liveness

It is worth distinguishing between two different kinds of properties: safety and liveness.

  • Safety: "Nothing bad happens." If a safety property is violated, we can point at a particular point in time at which it was broken. After a safety property has been violated, the violation cannot be undone—the damage is already done.

  • Liveness: "Something good eventually happens." A liveness property may not hold at some point in time, but there is always hope that it may be satisfied in the future (by receiving a response). Liveness properties often include the word "eventually."

Examples:

  • Uniqueness and monotonic sequence are safety properties
  • Availability is a liveness property
  • Eventual consistency is a liveness property

An advantage of distinguishing between safety and liveness is that it helps us deal with difficult system models. For distributed algorithms, it is common to require that safety properties always hold, in all possible situations of a system model. However, with liveness properties we are allowed to make caveats: for example, we could say that a request needs to receive a response only if a majority of nodes have not crashed, and only if the network eventually recovers from an outage.

4.6. Formal Methods and Testing

In plain English: How do you know your distributed algorithm works correctly? You can't just test it on your laptop—the bugs only appear under specific timing conditions in production. Two approaches help: (1) mathematically prove the algorithm correct (formal methods), and (2) systematically test it under many different failure scenarios (fault injection, simulation testing).

In technical terms: Formal verification uses mathematical proof techniques to show that an algorithm satisfies its properties in all states allowed by the system model. Complementary techniques include model checking (TLA+, FizzBee), fault injection (Jepsen), and deterministic simulation testing (FoundationDB, TigerBeetle, Antithesis).

Why it matters: Distributed systems have enormous state spaces—manual testing catches only a tiny fraction of possible bugs. Formal methods and advanced testing techniques help find bugs that would otherwise only manifest rarely in production, potentially causing data loss or corruption.

How do we know that an algorithm satisfies the required properties? Due to concurrency, partial failures, and network delays there are a huge number of potential states. We need to guarantee that the properties hold in every possible state.

📋
Model Checking
Write algorithm spec in TLA+, Gallina, or FizzBee
  • Systematically explores states
  • Finds subtle bugs
  • Used by AWS, Azure, CockroachDB
  • Model may diverge from code
💥
Fault Injection
Inject failures into running system
  • Network partitions
  • Machine crashes
  • Disk corruption
  • Jepsen framework
  • Netflix Chaos Monkey
🔄
Deterministic Simulation
Test actual code with controlled nondeterminism
  • Mock time, network, I/O
  • Reproducible failures
  • FoundationDB, TigerBeetle
  • Antithesis hypervisor
  • Faster than real-time

Model checking

Model checkers are tools that help verify that an algorithm or system behaves as expected. An algorithm specification is written in a purpose-built language such as TLA+, Gallina, or FizzBee. Model checkers then systematically try all the things that could happen, verifying that invariants hold across all states.

Model checking can't actually prove correctness for infinite state spaces, but it strikes a nice balance between ease of use and ability to find non-obvious bugs. CockroachDB, TiDB, Kafka, and many other distributed systems use model specifications to find and fix bugs.

Fault injection

Fault injection verifies whether a system's implementation works as expected when things go wrong. The idea is simple: inject faults into a running system's environment and see how it behaves. Faults can be network failures, machine crashes, disk corruption, paused processes—anything you can imagine going wrong.

Jepsen is a popular fault injection framework that has been remarkably effective at finding critical bugs in many widely-used systems. It comes with integrations for various operating systems and many pre-built fault injectors.

Deterministic simulation testing

Deterministic simulation testing (DST) uses a similar state space exploration process as a model checker, but it tests your actual code, not a model. A simulation automatically runs through a large number of randomized executions of the system. Network communication, I/O, and clock timing are all replaced with mocks that allow the simulator to control the exact order in which things happen.

If a test fails, it can be re-run with exactly the same sequence of events, since the simulator knows the exact order of operations that triggered the failure. Three strategies are commonly used:

  1. Application-level: Systems like FoundationDB and TigerBeetle are built with DST in mind, using asynchronous libraries that provide injection points for deterministic simulation.

  2. Runtime-level: Languages with asynchronous runtimes (Go, Rust/Tokio) can be patched to execute sequentially. FrostDB patches Go's runtime; Rust's madsim library provides deterministic implementations of Tokio and common libraries.

  3. Machine-level: Tools like Antithesis build a custom hypervisor that makes an entire machine deterministic by replacing all normally nondeterministic operations with deterministic ones.

💡 Insight

Determinism is a powerful idea that arises repeatedly in distributed systems:

  • Event sourcing: Deterministically replay events to rebuild state
  • Workflow engines: Deterministic workflow definitions enable durable execution
  • State machine replication: Execute the same deterministic operations on each replica
  • Deterministic simulation testing: Make the execution environment deterministic to enable reproducible testing

The common thread: eliminating nondeterminism makes systems easier to reason about, test, and debug.


5. Summary

In this chapter we have discussed a wide range of problems that can occur in distributed systems:

Network problems:

  • Whenever you try to send a packet over the network, it may be lost or arbitrarily delayed
  • If you don't get a reply, you have no idea whether the message got through—the request might have been lost, the remote node might be down, or the response might have been delayed
  • Timeouts are the only way to detect faults, but choosing appropriate timeout values is difficult

Clock problems:

  • A node's clock may be significantly out of sync with other nodes, despite NTP
  • Clocks may suddenly jump forward or back in time
  • Relying on clocks is dangerous because you most likely don't have a good measure of your clock's confidence interval
  • Time-of-day clocks can jump backward; use monotonic clocks for measuring durations

Process pauses:

  • A process may pause for a substantial amount of time at any point (GC, VM suspension, OS scheduling, swapping, I/O)
  • The node may be declared dead by other nodes, then come back to life without realizing it was paused
  • You cannot assume that execution proceeds without interruption
Three Sources of Unreliability
Unreliable Networks
Packets lost
Arbitrary delays
Partitions
Unreliable Clocks
Drift
Jumps backward
Skew between nodes
Process Pauses
GC pauses
VM suspension
OS scheduling

The fact that such partial failures can occur is the defining characteristic of distributed systems. Whenever software tries to do anything involving other nodes, there is the possibility that it may occasionally fail, or randomly go slow, or not respond at all.

To tolerate faults, the first step is to detect them, but even that is hard. Most distributed algorithms rely on timeouts to determine whether a remote node is still available. However, timeouts can't distinguish between network and node failures, and variable network delay sometimes causes a node to be falsely suspected of crashing.

Once a fault is detected, making a system tolerate it is not easy either: there is no global variable, no shared memory, no common knowledge between machines. The only way information can flow from one node to another is by sending it over the unreliable network. Major decisions cannot be safely made by a single node, so we require protocols that enlist help from other nodes and try to get a quorum to agree.

💡 Insight

If you're used to writing software on a single computer, where operations are deterministic, moving to distributed systems can be disorienting. Conversely, distributed systems engineers will often regard a problem as trivial if it can be solved on a single computer. If you can avoid opening Pandora's box and simply keep things on a single machine (using an embedded storage engine, for example), it is generally worth doing so.

However, scalability is not the only reason for distributed systems. Fault tolerance (surviving machine failures) and low latency (placing data geographically close to users) cannot be achieved with a single node. The power of distributed systems is that in principle, they can run forever without service-level interruptions, because all faults and maintenance can be handled at the node level.

We also explored whether the unreliability of networks, clocks, and processes is an inevitable law of nature. We saw that it isn't: it is possible to give hard real-time response guarantees and bounded delays in networks, but doing so is very expensive and results in lower utilization of hardware resources. Most non-safety-critical systems choose cheap and unreliable over expensive and reliable.

In Chapter 10, we will turn our attention to consensus—getting multiple nodes to agree on something, despite all these reliability problems. We will examine algorithms that provide strong guarantees even when built from unreliable components, which is one of the most important and fascinating challenges in distributed systems.


Previous: Chapter 8 | Next: Chapter 10