Skip to main content

Chapter 8. Transactions

Some authors have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.

Table of Contents

  1. Introduction
  2. What Exactly Is a Transaction?
  3. Weak Isolation Levels
  4. Serializability
  5. Distributed Transactions
  6. Summary

1. Introduction

In plain English: Think of a transaction like making multiple bank transfers in one atomic operation. Either all the transfers complete successfully (you get your money and the recipient gets theirs), or none of them happen (everyone's balance stays the same). You never end up in a weird state where your money disappeared but the recipient didn't receive it.

In technical terms: A transaction is a way for an application to group several reads and writes together into a logical unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds (commit) or it fails (abort, rollback).

Why it matters: Without transactions, you'd need to manually handle partial failures, network interruptions, concurrent writes, and countless edge cases. Transactions simplify error handling dramatically—if something goes wrong, just abort and retry. This reduces an enormous problem space down to simple retry logic.

1.1. What Can Go Wrong

In the harsh reality of data systems, many things can go wrong:

PROBLEMS IN DATA SYSTEMS
🔥
Hardware Failures
  • Database crashes mid-write
  • Disk becomes full
  • Power failure
💥
Software Crashes
  • App crashes halfway through
  • Process terminated
  • Out of memory errors
📡
Network Issues
  • Connection interrupted
  • Timeout while writing
  • Packets lost
🔀
Concurrency Problems
  • Clients overwrite each other
  • Read partially updated data
  • Race conditions

💡 Insight

The Post Office Horizon scandal—where faulty accounting software caused massive financial losses and wrongful prosecutions—was likely caused by lack of proper ACID transactions. This isn't just theoretical: weak transaction guarantees can destroy businesses and lives.

1.2. The Purpose of Transactions

For decades, transactions have been the mechanism of choice for simplifying these issues.

What transactions provide:

  1. All-or-nothing guarantee: Either the entire transaction succeeds or it fails cleanly
  2. Isolation from concurrency: The database handles concurrent access so you don't have to
  3. Simple error handling: If something fails, just retry the whole transaction
  4. Human fault tolerance: Deploy buggy code? Roll back and rerun with the fix

When you might not need transactions:

  • Very simple access patterns (single record reads/writes)
  • Systems where eventual consistency is acceptable
  • Use cases requiring maximum performance/availability

However, figuring out whether you need transactions requires understanding exactly what safety guarantees they provide and what costs they impose. Let's dive deep.


2. What Exactly Is a Transaction?

2.1. The Evolution of Transactions

Almost all relational databases today follow the transaction model introduced in 1975 by IBM System R—the first SQL database. Despite 50 years of evolution, the general idea remains remarkably similar across MySQL, PostgreSQL, Oracle, and SQL Server.

The NoSQL era (late 2000s):

In the late 2000s, nonrelational (NoSQL) databases gained popularity by offering:

  • New data models (documents, graphs, key-value)
  • Built-in replication and sharding
  • Higher performance and availability

The cost: Many abandoned transactions entirely, leading to a popular belief that transactions were fundamentally unscalable.

The NewSQL renaissance (2010s+):

EVOLUTION OF TRANSACTION SUPPORT
1
1975-2005: Traditional ACID
Single-node databases, strong guarantees, limited scale
2
2005-2015: NoSQL Era
Distributed systems, weak guarantees, high scale
3
2015-Present: NewSQL
Distributed + ACID, proving transactions can scale

NewSQL databases like CockroachDB, TiDB, Spanner, FoundationDB, and YugabyteDB proved that transactional systems can scale to massive data volumes and high throughput by combining sharding with consensus protocols.

💡 Insight

The key lesson: Transactions aren't fundamentally unscalable—early NoSQL systems abandoned them prematurely. Modern distributed databases show you can have both ACID guarantees and horizontal scalability, though with careful engineering and trade-offs.

2.2. The Meaning of ACID

The safety guarantees provided by transactions are described by the acronym ACID: Atomicity, Consistency, Isolation, and Durability. Coined in 1983 by Theo Härder and Andreas Reuter, it established precise terminology for fault-tolerance mechanisms.

The problem: In practice, one database's ACID implementation differs significantly from another's. When a system claims to be "ACID compliant," it's often unclear what guarantees you actually get. ACID has unfortunately become mostly a marketing term.

Systems that don't meet ACID criteria are sometimes called BASE (Basically Available, Soft state, Eventual consistency). This is even more vague than ACID—the only sensible definition of BASE is "not ACID."

Let's break down what each letter actually means:

2.2.1. Atomicity

In plain English: Think of atomicity like a light switch—it's either on or off, never halfway. If you're making multiple database changes and something fails partway through, atomicity ensures all changes are rolled back, leaving no partial state.

In technical terms: ACID atomicity describes what happens if a client wants to make several writes, but a fault occurs after some writes have been processed. If the writes are grouped into an atomic transaction and the transaction cannot be completed due to a fault, the transaction is aborted and the database must discard or undo any writes it has made.

Why "abortability" would be a better term:

ATOMICITY IN ACTION
Without Atomicity
Transfer $100
✓ Deduct from Account A
✗ CRASH!
Money disappeared!
With Atomicity
Transfer $100
✓ Deduct from Account A
✗ CRASH!
↻ Automatic rollback
Both accounts unchanged

Key point: Without atomicity, if an error occurs partway through multiple changes, it's difficult to know which changes took effect. The application could retry, but that risks making the same change twice. Atomicity simplifies this: if a transaction was aborted, nothing changed, so it's safe to retry.

Note: In multi-threaded programming, "atomic" means indivisible operations that other threads can't observe partially. In ACID, atomicity is NOT about concurrency—that's covered by Isolation (the "I" in ACID). ACID atomicity is about fault handling.

2.2.2. Consistency

The problem: "Consistency" is terribly overloaded. It means at least five different things:

FIVE MEANINGS OF 'CONSISTENCY'
Replica Consistency
Eventual consistency in async replication (Chapter 6)
Consistent Snapshot
Backup at one moment in time, respecting happens-before
Consistent Hashing
Sharding approach for rebalancing (Chapter 7)
CAP Consistency
Linearizability in CAP theorem (Chapter 10)
ACID Consistency
Application-specific invariants always satisfied

In plain English: ACID consistency means your data follows the business rules you've defined. For example, in an accounting system, credits and debits must always balance. If a transaction starts with valid data and preserves validity during execution, the database stays in a "good state."

In technical terms: ACID consistency refers to application-specific invariants that must always be true. If you declare these as constraints (foreign keys, uniqueness, check constraints), the database enforces them. For more complex invariants, it's the application's responsibility to define transactions correctly.

Why it matters: The "C" in ACID is somewhat misplaced—consistency is actually a property of the application, not just the database. The database provides mechanisms (constraints, triggers), but ultimately the application must use transactions correctly to preserve consistency.

💡 Insight

Consistency is the odd one out in ACID—Atomicity, Isolation, and Durability are properties the database guarantees, but Consistency depends on the application using the database correctly. If you write bad data that violates your invariants but haven't declared those invariants, the database can't stop you.

2.2.3. Isolation

In plain English: Imagine you and a friend are both trying to increment a counter at the exact same time. Without isolation, you might both read "42", both compute "43", and both write "43"—so the counter only increases by 1 instead of 2. Isolation prevents these race conditions.

In technical terms: Isolation means concurrently executing transactions are isolated from each other—they cannot step on each other's toes. The classic formalization is serializability: each transaction can pretend it's the only transaction running. The database ensures that when transactions commit, the result is the same as if they had run serially (one after another).

RACE CONDITION WITHOUT ISOLATION
Counter = 42 (in database)
User 1
Read: 42
Compute: 42 + 1
Write: 43
User 2
Read: 42
Compute: 42 + 1
Write: 43
Counter = 43 (should be 44!)

The reality: Serializability has a performance cost. In practice, many databases use weaker isolation levels that allow concurrent transactions to interfere in limited ways:

  • Oracle's "serializable" actually implements snapshot isolation (weaker than true serializability)
  • MySQL, PostgreSQL offer multiple isolation levels, not all serializable
  • Some race conditions can still occur with weak isolation

We'll explore these isolation levels and their trade-offs in detail in Section 3.

2.2.4. Durability

In plain English: Durability is the promise that once you've successfully saved your data, it won't disappear—even if the power goes out, the server crashes, or the disk fails. Think of it like writing something in permanent ink versus a whiteboard.

In technical terms: Once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes.

How it's implemented:

DURABILITY IMPLEMENTATIONS
💾
Single-Node Database
Data written to nonvolatile storage (HDD/SSD)
  • Write-ahead log (WAL)
  • fsync() to force disk write
  • Recovery from log after crash
📦
Replicated Database
Data copied to multiple nodes
  • Replication to N nodes
  • Wait for confirmations
  • Survive node failures

The harsh reality—nothing is perfect:

💡 Insight

Perfect durability doesn't exist. There are only risk-reduction techniques: writing to disk, replicating to remote machines, backups. Each has flaws:

  • Disk writes: SSDs can violate fsync guarantees, firmware bugs, bad sectors (30-80% of SSDs develop bad blocks in 4 years)
  • Replication: Correlated faults can take down all replicas, async replication loses recent writes
  • Backups: Gradual corruption might affect all backups

Best practice: Use all techniques together—write to disk AND replicate AND backup. Take theoretical "guarantees" with a grain of salt.

Durability evolution:

EraImplementationTrade-offs
1970sArchive tapeReliable but slow to restore
1990s-2000sDisk/SSDFast but single point of failure
2010s+ReplicationAvailable but complex failure modes

2.3. Single-Object and Multi-Object Operations

In plain English: Sometimes you need to update multiple related pieces of data at once. Imagine an email app that shows "3 unread messages" in a counter. When a new email arrives, you need to: (1) insert the email, and (2) increment the counter. If only one happens, the app shows inconsistent data.

In technical terms: Multi-object transactions are needed when several pieces of data must be kept in sync. ACID's atomicity and isolation describe what happens when a transaction modifies multiple objects (rows, documents, records).

MULTI-OBJECT TRANSACTION EXAMPLE
Transaction Start
emails table
100 rows
unread_count = 3
New email arrives
Transaction End
emails table
101 rows
unread_count = 4

Without proper isolation:

DIRTY READ ANOMALY
1
Transaction 1: Start
Begin inserting new email
2
Transaction 1: Insert Email
Email added, but counter not yet updated
3
Transaction 2: Read Data
Sees new email but old counter (dirty read!)
4
Transaction 1: Update Counter
Finally increments counter
5
Transaction 1: Commit
Now consistent, but Transaction 2 saw inconsistent state

User 2 sees: Unread counter shows 3, but mailbox shows 4 unread emails. This violates isolation.

If an error occurs:

ATOMICITY VIOLATION
Start transaction
✓ Insert email
✗ Network timeout updating counter
Without atomicity: email exists, counter wrong
With atomicity: email rolled back, nothing changed

How transactions are typically grouped:

In relational databases, everything between BEGIN TRANSACTION and COMMIT on a TCP connection belongs to the same transaction. If the TCP connection is interrupted, the transaction must be aborted.

In nonrelational databases: Many don't have a way of grouping operations. Even if there's a multi-put API, it may not have transaction semantics—some keys might succeed while others fail, leaving partial state.

2.3.1. Single-object writes

In plain English: Even writing a single document can fail partway through. What if you're saving a 20 KB JSON file and the network dies after 10 KB? Without atomicity, you'd get corrupted half-written data.

In technical terms: Storage engines almost universally provide atomicity and isolation at the single-object level (one key-value pair on one node):

  • Atomicity: Implemented using a write-ahead log for crash recovery
  • Isolation: Implemented using a lock on each object (one thread accesses at a time)

Useful single-object operations:

SINGLE-OBJECT OPERATIONS
Atomic Increment
UPDATE counters SET value = value + 1
  • No read-modify-write cycle
  • Database handles atomically
  • Prevents lost updates
Compare-and-Set (CAS)
Write only if value unchanged
  • Conditional write operation
  • Detects concurrent modifications
  • Similar to CPU CAS instructions

Important limitation: These single-object operations are useful but are NOT transactions in the traditional sense:

  • Cassandra/ScyllaDB "lightweight transactions": Linearizable reads and conditional writes on single object only
  • Aerospike "strong consistency": Single-object guarantees, no cross-object guarantees

2.3.2. The need for multi-object transactions

When you need multi-object transactions:

USE CASES FOR MULTI-OBJECT TRANSACTIONS
🔗
Foreign Keys
Relational data model
  • Row in table A references row in table B
  • Must keep references valid
  • Inserting related records must be atomic
📄
Denormalization
Document databases without joins
  • Same data stored in multiple places
  • Updates must hit all copies
  • Prevent denormalized data from diverging
📇
Secondary Indexes
Almost all databases
  • Indexes are separate database objects
  • Must update with every value change
  • Without isolation: record in one index, missing from another

💡 Insight

Applications CAN be implemented without multi-object transactions, but error handling becomes vastly more complicated without atomicity, and the lack of isolation causes concurrency problems. Many developers underestimate this complexity until production issues arise.

2.3.3. Handling errors and aborts

In plain English: A key feature of transactions is they can be aborted and safely retried if an error occurs. This is like an "undo" button for database operations.

In technical terms: ACID databases follow the philosophy: if the database is in danger of violating atomicity, isolation, or durability, abandon the transaction entirely rather than allow it to remain half-finished.

Not all systems follow this philosophy:

  • Leaderless replication datastores work on "best effort" basis
  • Won't undo what's already done if an error occurs
  • Application's responsibility to recover

The retry problem: Although retrying is simple and effective, it isn't perfect:

RETRY CHALLENGES
📡
Network Acknowledgment Lost
Transaction succeeded but client didn't know
  • Client times out waiting for ACK
  • Retries, executing twice
  • Need deduplication mechanism
🔥
Overload
Retrying makes problem worse
  • Error due to high contention
  • Retry adds more load
  • Use exponential backoff
⚠️
Transient vs Permanent Errors
Only retry transient errors
  • Deadlock: worth retrying
  • Constraint violation: pointless to retry
  • Distinguish error types
📧
Side Effects
Outside database actions
  • Sending email
  • Calling external API
  • Need two-phase commit for atomicity

ORM framework problem: Popular ORMs like Rails ActiveRecord and Django don't retry aborted transactions—exceptions bubble up and user input is lost. This defeats the whole point of aborts!


3. Weak Isolation Levels

In plain English: Perfect isolation (serializability) means transactions can't interfere with each other at all—like each transaction runs alone on the database. But this is expensive, so most databases use "weaker" isolation that allows some interference. Understanding these levels is crucial because they can cause subtle bugs.

In technical terms: If two transactions don't touch the same data, or both are read-only, they can safely run in parallel. Concurrency issues only arise when one transaction reads data being modified by another, or when transactions modify the same data simultaneously.

Why this matters:

💡 Insight

Concurrency bugs are hard to find by testing because they only trigger with unlucky timing—they're rare and difficult to reproduce. Weak isolation levels have caused substantial financial losses, corrupted customer data, and led to regulatory investigations. Even "ACID" databases often use weak isolation by default!

Concurrency control is relevant for security too: an attacker might deliberately send bursts of concurrent requests to exploit race conditions. Building reliable and secure applications requires systematically preventing these bugs.

3.1. Read Committed

The most basic isolation level. Read committed makes two guarantees:

  1. No dirty reads: You only see committed data when reading
  2. No dirty writes: You only overwrite committed data when writing

Some databases support an even weaker level called read uncommitted (prevents dirty writes but allows dirty reads). Let's explore these guarantees:

3.1.1. No dirty reads

In plain English: Imagine you're transferring money between accounts. Without dirty read prevention, someone else could see the money deducted from your account but not yet added to the recipient's account—it looks like money vanished. Dirty read prevention ensures you only see committed states.

In technical terms: A dirty read occurs when a transaction reads data written by another transaction that hasn't yet committed. Read committed prevents this: writes only become visible to others when the transaction commits (and then all writes become visible at once).

PREVENTING DIRTY READS
Initial: x = 2
User 1
Begin transaction
Set x = 3
Not yet committed
User 2
Begin transaction
Get x
Returns: 2 (old value)
User 1 commits → Now everyone sees x = 3

Why prevent dirty reads:

  1. Partial updates visible: Transaction updates several rows; another transaction sees some but not others (like the email counter example)
  2. Reading data that gets rolled back: If a transaction aborts, any reads of its uncommitted data would also need to be aborted (cascading aborts)

3.1.2. No dirty writes

In plain English: Imagine two people trying to buy the same car on a website simultaneously. Without dirty write prevention, one person might win the sale but the other person might get the invoice—a nonsensical outcome.

In technical terms: A dirty write occurs when one transaction overwrites an uncommitted value written by another transaction. Read committed prevents this, usually by delaying the second write until the first transaction commits or aborts.

DIRTY WRITE EXAMPLE: CAR PURCHASE
Available: Car #1234
Aaliyah
Start buying car
Update listings
buyer = Aaliyah
Bryce
Start buying car
Update listings
buyer = Bryce
Update invoices
(invoice to Aaliyah)
Update invoices
(invoice to Bryce)
Without dirty write prevention: Sale to Bryce, invoice to Aaliyah!
With dirty write prevention: One person waits, only one completes

Important note: Read committed does NOT prevent all race conditions. The counter increment problem (Figure 8-1) can still occur because the second write happens after the first transaction commits—it's not a dirty write, but it's still incorrect. We'll discuss this in "Preventing Lost Updates."

3.1.3. Implementing read committed

Read committed is very popular—it's the default in Oracle, PostgreSQL, SQL Server, and many others.

Preventing dirty writes:

Most databases use row-level locks:

  • Transaction must acquire lock before modifying a row
  • Hold lock until commit or abort
  • Only one transaction can hold lock per row
  • Other transactions must wait

This locking happens automatically in read committed mode.

Preventing dirty reads—two approaches:

APPROACHES TO PREVENT DIRTY READS
Approach 1: Read Locks
Approach 2: Remember Old Values
Mechanism
Acquire lock briefly to read
Store both old and new values
Concurrency
One long write blocks many reads
Readers never blocked
Performance
Poor for read-heavy workloads
Excellent for read-heavy workloads
Used by
IBM Db2, MS SQL Server (some modes)
PostgreSQL, Oracle, most modern DBs

Approach 2 (most common): For every row being written:

  • Remember both old committed value and new value
  • Transactions writing hold the write lock
  • Transactions reading get the old value
  • Switch to reading new value after commit

This is related to multi-version concurrency control (MVCC), which we'll explore in depth next.

💡 Insight

The reason most databases prefer Approach 2 (remembering old values) is that long-running writes are common (analytics queries, backups). With read locks, one slow write would freeze the entire database for reads—unacceptable for production systems.

3.2. Snapshot Isolation and Repeatable Read

In plain English: Imagine checking your bank account balance across two accounts. You see Account A has $500, but by the time you check Account B, a transfer has moved $100 from B to A. Account B now shows $400, so your total appears to be $900 instead of $1,000. The money seems to have vanished!

In technical terms: Read committed prevents dirty reads but allows read skew (also called nonrepeatable read): if you read the same data twice in a transaction, you might get different results. Some situations can't tolerate this temporary inconsistency.

READ SKEW EXAMPLE
1
Initial State
Account 1: $500, Account 2: $500
2
Aaliyah Reads Account 1
Sees $500
3
Transfer Transaction Starts
Moving $100 from Account 2 to Account 1
4
Transfer Commits
Account 1: $600, Account 2: $400
5
Aaliyah Reads Account 2
Sees $400 (new value)
6
Aaliyah's View
Total appears to be $900—$100 vanished!

When read skew is unacceptable:

SITUATIONS REQUIRING SNAPSHOT ISOLATION
💾
Backups
Taking backups may take hours
  • Writes continue during backup
  • Some parts contain old version
  • Other parts contain new version
  • Restoring gives permanent inconsistency
📊
Analytics Queries
Queries scanning large portions of database
  • Common in data warehouses
  • Integrity checks scanning all data
  • Return nonsensical results if observing different time points

The solution: Snapshot Isolation

In plain English: Imagine taking a photograph of your database. That photo captures everything at one specific moment. Even if things change in real life after you take the photo, the photo itself never changes. Snapshot isolation gives each transaction its own "photograph" of the database.

In technical terms: Each transaction reads from a consistent snapshot of the database—the transaction sees all data that was committed at the start of the transaction. Even if data is subsequently changed by other transactions, each transaction sees only the old data from its particular point in time.

Why it matters: Long-running read-only queries (backups, analytics) are much easier to reason about when operating on a consistent snapshot frozen at a particular moment.

Popularity: Supported by PostgreSQL, MySQL (InnoDB), Oracle, SQL Server. Some databases (Oracle, TiDB, Aurora DSQL) even use snapshot isolation as their highest isolation level.

3.2.1. Multi-version concurrency control (MVCC)

In plain English: Instead of overwriting data in place, the database keeps multiple versions side-by-side. When you read, you see the version that was current when your transaction started. When you write, you create a new version. Old versions stick around for transactions that started earlier.

In technical terms: To implement snapshot isolation, databases keep several different committed versions of each row because various in-progress transactions may need to see the database at different points in time. This technique is called multi-version concurrency control (MVCC).

Key principle: Readers never block writers, and writers never block readers. This allows long-running read queries on a consistent snapshot while processing writes normally, with no lock contention.

MVCC IMPLEMENTATION (PostgreSQL Example)
Each transaction gets unique, increasing transaction ID (txid)
Table Structure
account_id
balance
inserted_by (txid)
deleted_by (txid)
Example Versions
Account 2, $500, txid=12, deleted=13
Account 2, $400, txid=13, deleted=NULL
Update = Delete + Insert (both tagged with same txid)

How MVCC works:

  1. Each transaction gets a unique, always-increasing ID when it starts
  2. Every write is tagged with the transaction ID of the writer
  3. Each row has inserted_by field: ID of transaction that inserted this row
  4. Each row has deleted_by field: Initially empty; set to deleting transaction's ID
  5. Deletes don't remove rows: Just mark for deletion; garbage collection removes later
  6. Updates are delete + insert: Old row marked deleted, new row inserted
MVCC EXAMPLE: TRANSACTION 13 UPDATES ACCOUNT
Before
Row: Account 2
balance = $500
inserted_by = 12
deleted_by = NULL
Transaction 13: Deduct $100
After
Old row (marked deleted)
balance = $500
inserted_by = 12
deleted_by = 13
New row (inserted)
balance = $400
inserted_by = 13
deleted_by = NULL

Storage: All versions stored in same database heap (data structure), regardless of commit status. Versions of same row form a linked list for easy iteration.

3.2.2. Visibility rules for observing a consistent snapshot

In plain English: When your transaction reads data, the database uses special rules to decide which version of each row you're allowed to see. These rules ensure you see a consistent snapshot—as if you took a photo at the moment your transaction started.

In technical terms: The database uses transaction IDs and visibility rules to present a consistent snapshot:

VISIBILITY RULES
1
Rule 1: Ignore In-Progress Transactions
At transaction start, list all other in-progress transactions. Ignore their writes, even if they later commit.
2
Rule 2: Ignore Future Transactions
Ignore writes from transactions with higher IDs (started after current transaction).
3
Rule 3: Ignore Aborted Transactions
Ignore all writes from aborted transactions, regardless of when they aborted.
4
Rule 4: Everything Else Is Visible
All other writes are visible to the transaction's queries.

Simplified visibility check for a row:

A row is visible if BOTH conditions are true:

  1. At transaction start time: The transaction that inserted the row had already committed
  2. Not deleted, or: The transaction that deleted the row hadn't yet committed at transaction start time

Example from our MVCC diagram:

  • Transaction 12 reading Account 2 sees balance of $500
  • Why? Transaction 13's deletion hasn't committed yet (from Transaction 12's perspective)
  • The $400 balance inserted by Transaction 13 is also invisible (Transaction 13 hasn't committed)

💡 Insight

Long-running transactions can continue using old snapshots for extended periods, reading "stale" data that other transactions have long since overwritten. By never updating values in place but creating new versions instead, the database provides consistent snapshots with minimal overhead.

3.2.3. Indexes and snapshot isolation

The challenge: How do indexes work when there are multiple versions of each row?

Most common approach:

INDEX STRATEGIES FOR MVCC
Index entry points to ONE version (oldest or newest)
Row version contains reference to next version
Query follows chain to find visible version
Garbage collection removes old versions and index entries

Optimizations:

  • PostgreSQL: Avoid index updates if different versions fit on same page
  • Some databases: Store only diffs between versions instead of full copies

Alternative approach (CouchDB, Datomic, LMDB):

Use immutable (copy-on-write) B-trees:

  • Don't overwrite pages when updated
  • Create new copy of each modified page
  • Parent pages copied and updated to point to new children
  • Every write creates a new B-tree root
  • Each root is a consistent snapshot at a point in time
  • No need to filter based on transaction IDs
  • Requires background compaction and garbage collection

3.2.4. Snapshot isolation, repeatable read, and naming confusion

The problem: Different databases use different terms for the same thing, and the same terms for different things!

NAMING CONFUSION
PostgreSQL
Calls snapshot isolation 'repeatable read'
  • Meets SQL standard requirements
  • Claims standards compliance
  • Actually implements snapshot isolation
Oracle
Calls snapshot isolation 'serializable'
  • Confusing terminology
  • NOT true serializability
  • Actually snapshot isolation
MySQL
'Repeatable read' = MVCC (weaker than snapshot isolation)
  • Different from PostgreSQL
  • Same term, different meaning
  • Weaker consistency than snapshot isolation
IBM Db2
'Repeatable read' = serializability
  • Completely different meaning
  • Strongest guarantee
  • Same term as PostgreSQL/MySQL!

Why the confusion?

The SQL standard doesn't have the concept of snapshot isolation (it was invented after the standard was written in 1975). The standard defines "repeatable read," which looks superficially similar. The standard's definition is flawed—ambiguous, imprecise, and not implementation-independent.

Result: Nobody really knows what "repeatable read" means!

💡 Insight

Don't rely on isolation level names alone. Always check your database's documentation for what guarantees it actually provides. The SQL standard's isolation level definitions are fundamentally broken—they don't match how modern databases actually work.

3.3. Preventing Lost Updates

In plain English: The lost update problem happens when two people edit the same thing at once. Both read the current value, make their changes, and write back. The second write "clobbers" the first one, losing those changes. It's like two people editing a Google Doc without real-time sync—the last one to save overwrites the other's work.

In technical terms: The lost update problem occurs when an application reads a value, modifies it, and writes back the modified value (a read-modify-write cycle). If two transactions do this concurrently, one modification can be lost because the second write doesn't include the first modification.

LOST UPDATE PROBLEM
Transaction 1
Read counter: 42
Compute: 42 + 1 = 43
Write: 43
Transaction 2
Read counter: 42
Compute: 42 + 1 = 43
Write: 43
Result: 43 (should be 44—one update lost!)

Common scenarios:

  1. Incrementing a counter or updating an account balance
  2. Making local changes to complex values (e.g., adding element to JSON array)
  3. Wiki page editing: Two users edit simultaneously, each sends full page content

A variety of solutions have been developed:

3.3.1. Atomic write operations

In plain English: Instead of reading the current value, modifying it in your application code, and writing it back, use a special database command that does the whole operation atomically in one step.

In technical terms: Many databases provide atomic update operations that remove the need to implement read-modify-write cycles in application code. They're usually the best solution if your code can be expressed using them.

ATOMIC OPERATIONS
SQL Databases
Atomic increment/update
  • UPDATE counters SET value = value + 1 WHERE key = 'foo'
  • Entire operation is concurrency-safe
  • Database handles locking internally
Document Databases
Atomic JSON modifications
  • MongoDB: atomic ops on JSON parts
  • Update nested fields atomically
  • No read-modify-write cycle needed
Redis
Atomic data structure operations
  • Priority queues
  • Lists, sets, sorted sets
  • All operations atomic

Implementation: Usually implemented by taking an exclusive lock on the object when read, or forcing all atomic operations on a single thread.

ORM framework danger: ORMs make it easy to accidentally write unsafe read-modify-write cycles instead of using atomic operations—a source of subtle bugs difficult to find by testing.

3.3.2. Explicit locking

In plain English: If atomic operations aren't flexible enough, manually lock the rows you're about to update. Other transactions trying to update the same rows must wait for your lock to be released.

In technical terms: The application explicitly locks objects that will be updated, performs a read-modify-write cycle, and any concurrent transaction trying to access the same object is forced to wait.

Example: Multiplayer game

BEGIN TRANSACTION;

-- Lock the game piece
SELECT * FROM figures
WHERE name = 'robot' AND game_id = 222
FOR UPDATE; -- Tells database to lock these rows

-- Check if move is valid (custom game logic)
-- Then update the position
UPDATE figures SET position = 'c4' WHERE id = 1234;

COMMIT;
EXPLICIT LOCKING EXAMPLE
1
Player 1: SELECT...FOR UPDATE
Acquires lock on robot piece
2
Player 2: Tries to Move Same Piece
Blocked waiting for Player 1's lock
3
Player 1: Validates Move
Checks game rules in application code
4
Player 1: UPDATE and COMMIT
Releases lock
5
Player 2: Acquires Lock
Now can proceed with their move

Challenges:

  1. Easy to forget locks somewhere in code → race condition
  2. Deadlock risk if locking multiple objects (databases auto-detect and abort one transaction)
  3. Must retry aborted transactions at application level

3.3.3. Automatically detecting lost updates

In plain English: Let transactions run in parallel and do their read-modify-write cycles. If the database detects a lost update, automatically abort the transaction and force it to retry. You don't need special code—the database just handles it.

In technical terms: Allow transactions to execute read-modify-write cycles in parallel. If the transaction manager detects a lost update, abort the transaction and force retry. Databases can perform this check efficiently with snapshot isolation.

AUTOMATIC LOST UPDATE DETECTION
Detects Lost Updates
Does NOT Detect
PostgreSQL
✓ repeatable read level
Oracle
✓ serializable level
SQL Server
✓ snapshot isolation
MySQL/InnoDB
✗ repeatable read level

Advantages:

  • No special database features required in application code
  • Less error-prone than forgetting locks
  • Happens automatically

Disadvantage: Must retry aborted transactions at application level

💡 Insight

Some argue a database must prevent lost updates to qualify as providing snapshot isolation, so by this definition MySQL doesn't provide true snapshot isolation. Lost update detection is a great feature because it's automatic and doesn't require special application code.

3.3.4. Conditional writes (compare-and-set)

In plain English: Update the value only if it hasn't changed since you last read it. If someone else changed it in the meantime, your update fails and you need to retry. It's like saying "I'll buy this if the price is still $10" instead of blindly buying regardless of current price.

In technical terms: In databases without transactions, a conditional write operation prevents lost updates by allowing an update only if the value hasn't changed since last read. If the current value doesn't match what you previously read, the update has no effect and the read-modify-write cycle must be retried.

Example: Preventing concurrent wiki edits

-- May or may not be safe, depending on database implementation
UPDATE wiki_pages
SET content = 'new content'
WHERE id = 1234
AND content = 'old content'; -- Only update if content unchanged

Better approach with version numbers:

UPDATE wiki_pages
SET content = 'new content',
version = version + 1
WHERE id = 1234
AND version = 42; -- Only update if version still 42
COMPARE-AND-SET OPERATION
1
Read
Get current value and version: (content='old', version=42)
2
Modify
User edits content in application
3
Write with Condition
UPDATE WHERE version=42
4
Success or Failure
Success if version still 42, failure if someone else updated

Sometimes called "optimistic locking" (as opposed to pessimistic locking with SELECT FOR UPDATE).

MVCC caveat: In MVCC databases, there's often an exception to visibility rules for compare-and-set: values written by other transactions may be visible to the WHERE clause of UPDATE/DELETE, even if not otherwise visible in the snapshot.

3.3.5. Conflict resolution and replication

In plain English: In replicated databases where data can be modified on multiple servers simultaneously, preventing lost updates becomes even harder. You can't use locks or conditional writes because there's no single "main copy" of the data—each server has its own copy that can be modified independently.

In technical terms: In replicated databases with multi-leader or leaderless replication, concurrent writes can happen on different nodes and replicate asynchronously. Locks and conditional writes don't apply because there's no guarantee of a single up-to-date copy.

Alternative approaches:

CONFLICT RESOLUTION IN REPLICATION
🔄
Commutative Operations (CRDTs)
Operations that can be applied in any order
  • Incrementing a counter
  • Adding element to a set
  • Can be safely replicated
  • Same result regardless of order
🌿
Multi-version Conflict Resolution
Create conflicting versions (siblings)
  • Allow concurrent writes
  • Create multiple versions
  • Application merges after the fact
  • Complex but flexible
⚠️
Last Write Wins (LWW)
Default in many systems (problematic!)
  • Simple but loses updates
  • Based on timestamp
  • Prone to lost updates
  • Default in many replicated DBs

💡 Insight

Last Write Wins (LWW) is the default in many replicated databases, but it's prone to lost updates. If you're using multi-leader or leaderless replication, check whether your database uses LWW and consider whether that's acceptable for your use case.

3.4. Write Skew and Phantoms

In plain English: Write skew is a subtle race condition where two transactions read the same data, make decisions based on what they read, then each write to different places. Neither transaction sees the other's write, leading to a violated constraint. It's like two doctors both checking "are there 2 doctors on-call?" and both seeing "yes," then both going off-call, leaving zero doctors.

In technical terms: Write skew occurs when two transactions read the same objects, then update some of those objects (different transactions may update different objects). It's a generalization of the lost update problem. In the special case where transactions update the same object, you get a dirty write or lost update; when they update different objects, you get write skew.

3.4.1. Characterizing write skew

Classic example: Hospital on-call scheduling

WRITE SKEW: ON-CALL DOCTORS
Initial: Aaliyah and Bryce both on-call
Aaliyah's Transaction
SELECT count from doctors
WHERE on_call = true
Result: 2
Check: 2 ≥ 1? ✓
UPDATE doctors
SET on_call = false
WHERE name = 'Aaliyah'
Bryce's Transaction
SELECT count from doctors
WHERE on_call = true
Result: 2
Check: 2 ≥ 1? ✓
UPDATE doctors
SET on_call = false
WHERE name = 'Bryce'
Result: 0 doctors on-call (constraint violated!)

Why it happens:

  1. Both transactions check the same condition (two doctors on-call)
  2. Both see the database in the same state (snapshot isolation)
  3. Both make different updates (to their own records)
  4. Constraint requires ≥1 doctor on-call, but result is 0

Why it's subtle:

  • Not a dirty write (different objects updated)
  • Not a lost update (different objects updated)
  • Only obvious in hindsight: if transactions ran sequentially, the second doctor would be prevented from going off-call

How write skew generalizes lost updates:

WRITE-WRITE CONFLICT SPECTRUM
1
Lost Update
Both transactions update SAME object
2
Write Skew
Transactions read same data, update DIFFERENT objects
3
Phantom
Write changes result of search query

Options for preventing write skew are limited:

PREVENTION STRATEGIES
Does NOT Work
Works
Atomic operations
✗ Multiple objects involved
Automatic detection
✗ PostgreSQL, MySQL, Oracle, SQL Server
Constraints
✗ Most DBs don't support multi-object constraints
Serializable isolation
✓ True serializability prevents it
Explicit locks
✓ SELECT...FOR UPDATE on dependent rows

Using explicit locks to prevent write skew:

BEGIN TRANSACTION;

-- Lock all doctors on-call in this shift
SELECT * FROM doctors
WHERE on_call = true
AND shift_id = 1234
FOR UPDATE;

-- Now safe to update
UPDATE doctors
SET on_call = false
WHERE name = 'Aaliyah'
AND shift_id = 1234;

COMMIT;

3.4.2. More examples of write skew

WRITE SKEW SCENARIOS
🏢
Meeting Room Booking
Preventing double-booking
  • Check for overlapping bookings
  • If none found, create booking
  • Two transactions can both see no conflict
  • Both insert, creating double-booking
🎮
Multiplayer Game
Moving pieces on a board
  • Lock prevents moving same piece twice
  • Doesn't prevent moving to same position
  • Two players move different pieces to same spot
  • Violates game rules
👤
Username Registration
Claiming unique usernames
  • Check if username taken
  • If not, create account
  • Two users can both see name available
  • Solution: unique constraint
💰
Preventing Double-Spending
Account balance checks
  • Insert spending item
  • Check total balance is positive
  • Two concurrent spends can both pass check
  • Balance goes negative

3.4.3. Phantoms causing write skew

In plain English: A phantom is when one transaction's write changes the result of another transaction's search query. The searched-for data is like a ghost—it wasn't there when you looked, but appears (or disappears) when another transaction writes.

In technical terms: All write skew examples follow a similar pattern:

WRITE SKEW PATTERN
1
Step 1: SELECT
Check if requirement is satisfied (e.g., ≥2 doctors on-call, no room booking conflicts)
2
Step 2: Decide
Based on query result, decide whether to proceed
3
Step 3: Write
INSERT, UPDATE, or DELETE and commit
4
Step 4: Precondition Changed
The write changes the precondition—if you repeated the SELECT, you'd get a different result

The phantom problem:

In the doctor example, we could lock the rows in step 1 (SELECT...FOR UPDATE) to prevent write skew. But in other examples (meeting room booking, username registration, double-spending), the query checks for the absence of rows. SELECT...FOR UPDATE can't lock something that doesn't exist!

Example: Meeting room booking

BEGIN TRANSACTION;

-- Check for conflicting bookings (returns zero rows if room is free)
SELECT COUNT(*) FROM bookings
WHERE room_id = 123
AND end_time > '2025-01-01 12:00'
AND start_time < '2025-01-01 13:00';

-- If zero, proceed to book
INSERT INTO bookings
(room_id, start_time, end_time, user_id)
VALUES (123, '2025-01-01 12:00', '2025-01-01 13:00', 666);

COMMIT;

The problem: SELECT returns zero rows, so FOR UPDATE has nothing to lock. Two concurrent transactions can both see zero conflicts and both insert bookings.

💡 Insight

Phantoms are particularly tricky because they involve data that doesn't exist yet. You can't lock something that isn't there. This is why even snapshot isolation, which prevents phantoms in read-only queries, struggles with phantoms in read-write transactions.

3.4.4. Materializing conflicts

In plain English: If the problem is "there's nothing to lock," create something to lock! For meeting rooms, create a table with rows for every room and time slot (e.g., 15-minute intervals for the next 6 months). Now you can lock the specific rows you need.

In technical terms: Materializing conflicts means taking a phantom and turning it into a lock conflict on a concrete set of rows that exist in the database.

MATERIALIZING CONFLICTS FOR MEETING ROOMS
1
Create Lock Table
Table with all room-time combinations (e.g., Room 123, 12:00-12:15, 12:15-12:30, etc.)
2
Lock Relevant Rows
SELECT...FOR UPDATE on rows for desired room and time period
3
Check Conflicts
After acquiring locks, check for overlapping bookings
4
Insert Booking
Create booking if no conflicts

Problems with materializing conflicts:

  1. Hard to figure out how to materialize them correctly
  2. Error-prone to implement
  3. Ugly because concurrency control leaks into application data model
  4. Last resort when no alternative is possible

Better solution: Serializable isolation level (much preferable in most cases)


4. Serializability

In plain English: We've seen transactions are prone to many race conditions—some prevented by read committed and snapshot isolation, but others (like write skew and phantoms) are not. Wouldn't it be great if the database just prevented all race conditions automatically? That's serializability: the strongest isolation level.

In technical terms: Serializable isolation guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially, without any concurrency. The database prevents all possible race conditions.

The frustrating reality:

CHALLENGES WITH ISOLATION LEVELS
🤔
Hard to Understand
Inconsistently implemented across databases
  • 'Repeatable read' varies significantly
  • Snapshot isolation vs. serializability
  • Marketing terms vs. actual guarantees
🧪
Hard to Test
Difficult to verify safety
  • Can't tell if code is safe at a given level
  • Large apps: unaware of all concurrent access
  • No good tools to detect race conditions
🎲
Nondeterministic
Problems only occur with unlucky timing
  • Testing doesn't reliably catch issues
  • Race conditions are rare in testing
  • Appear in production under load

Since the 1970s, researchers have had a simple answer: Use serializable isolation!

But if serializability is so much better, why isn't everyone using it?

To answer this, we need to examine the options for implementing serializability and how they perform. Three techniques:

SERIALIZABILITY IMPLEMENTATION APPROACHES
➡️
1. Actual Serial Execution
Literally execute transactions one at a time on a single thread
🔒
2. Two-Phase Locking (2PL)
For decades the only viable option; readers block writers and vice versa
3. Serializable Snapshot Isolation (SSI)
Optimistic approach; detect conflicts and abort instead of blocking

4.1. Actual Serial Execution

In plain English: The simplest way to avoid concurrency problems is to eliminate concurrency entirely—just execute one transaction at a time. No parallelism, no race conditions. It sounds naive, but with modern hardware and careful design, it actually works!

In technical terms: Execute only one transaction at a time, in serial order, on a single thread. By removing concurrency entirely, the resulting isolation is by definition serializable.

Why this seemed impossible until the 2000s:

For 30 years, multi-threaded concurrency was considered essential for performance. What changed?

DEVELOPMENTS ENABLING SERIAL EXECUTION
💾
RAM Became Cheap
Keep entire active dataset in memory
  • No waiting for disk I/O
  • Transactions execute much faster
  • Single-threaded throughput sufficient
OLTP Are Short
Transaction characteristics recognized
  • OLTP: short, small number of reads/writes
  • Analytics: read-only, can use snapshots
  • Long-running queries outside serial loop

Systems using this approach: VoltDB/H-Store, Redis, Datomic

Performance: Can sometimes perform better than multi-threaded systems by avoiding coordination overhead of locking. However, throughput is limited to a single CPU core.

4.1.1. Encapsulating transactions in stored procedures

The traditional approach problem:

TRADITIONAL VS. STORED PROCEDURE TRANSACTIONS
Interactive Client/Server
Stored Procedures
Execution
Query → Result → Think → Query → ...
Submit entire transaction code upfront
Network
Many round-trips
One round-trip
Concurrency
Must process multiple transactions to hide network latency
Can execute serially without performance penalty
Speed
Slow due to network waits
Fast if all data in memory

Traditional interactive transaction:

  1. Application makes query
  2. Waits for network round-trip
  3. Processes result
  4. Makes another query based on result
  5. Repeat...

Problem for serial execution: Database would spend most time waiting for network I/O. Throughput would be terrible.

Solution: Submit entire transaction code to database as a stored procedure.

INTERACTIVE VS STORED PROCEDURE EXECUTION
Interactive Transaction
Client: SELECT
network
DB: Result
network
Client: UPDATE
network
DB: OK
Multiple network round-trips
Stored Procedure
Client: Submit procedure
network
DB: Execute entirely
SELECT, UPDATE, etc.
network
DB: Final result
One network round-trip

4.1.2. Pros and cons of stored procedures

Stored procedures have a bad reputation:

STORED PROCEDURE PROBLEMS (TRADITIONAL)
🔧
Vendor-Specific Languages
  • Oracle: PL/SQL
  • SQL Server: T-SQL
  • PostgreSQL: PL/pgSQL
  • Archaic, ugly syntax
  • Lack modern language features
😰
Hard to Manage
  • Difficult to debug
  • Awkward version control
  • Tricky to test
  • Hard to integrate with monitoring
🔥
Performance Sensitive
  • Database shared by many app servers
  • Badly written procedure affects everyone
  • More trouble than app server code
🔒
Security Risk
  • In multitenant systems
  • Untrusted code in DB kernel
  • Security vulnerability

Modern implementations overcome these issues:

MODERN STORED PROCEDURE IMPLEMENTATIONS
VoltDB
Java or Groovy
Datomic
Java or Clojure
Redis
Lua scripting
MongoDB
JavaScript

Additional benefits:

  • GraphQL deployments: Validation logic can live in database stored procedures when proxy doesn't support complex logic
  • State machine replication: VoltDB executes same stored procedure on each replica instead of copying writes (requires deterministic execution)

4.1.3. Sharding

In plain English: A single CPU core can only process so many transactions per second. To scale beyond that, split your data across multiple CPU cores or machines, giving each its own independent transaction processing thread.

In technical terms: Single-threaded execution limits throughput to one CPU core's speed. To scale, shard your data so each transaction only reads/writes data within a single shard. Each shard has its own transaction processing thread running independently.

SHARDING FOR SERIAL EXECUTION
Single Shard Transactions
Shard A
CPU Core 1
Fast throughput
Single Shard Transactions
Shard B
CPU Core 2
Fast throughput
Cross-Shard Transaction
Shards A + B
Coordination needed
Much slower

Performance characteristics:

Transaction TypeThroughputScalability
Single-shardVery highLinear with CPU cores
Cross-shard~1,000/sec (VoltDB)Does not scale with machines

Why cross-shard is slow: Must coordinate across all touched shards in lock-step to ensure serializability. Requires additional coordination overhead.

Whether this works depends on data structure:

  • Simple key-value data: Easily sharded
  • Data with secondary indexes: Requires lots of cross-shard coordination

💡 Insight

The success of serial execution depends heavily on whether your workload can be partitioned to avoid cross-shard transactions. For some applications this works great; for others, cross-shard coordination becomes a bottleneck that negates the benefits.

4.1.4. Summary of serial execution

Serial execution of transactions is viable within certain constraints:

REQUIREMENTS FOR SERIAL EXECUTION
Fast Transactions
Every transaction must be small and fast
  • One slow transaction stalls everything
  • Use stored procedures
  • All data in memory
💾
Active Dataset in Memory
Working set must fit in RAM
  • Rarely accessed data can be on disk
  • But accessing it would be very slow
  • Limits dataset size
✍️
Low Write Throughput
Single CPU core must handle load
  • Or transactions must be shardable
  • Without cross-shard coordination
  • Cross-shard throughput hard to scale

4.2. Two-Phase Locking (2PL)

In plain English: Two-phase locking is like a strict library where readers and writers can never be in the same room at the same time. If someone is reading a book, nobody can write in it. If someone is writing in a book, nobody can read it. Everyone has to wait their turn. This prevents all race conditions but can make things very slow.

In technical terms: For around 30 years, two-phase locking (2PL) was the only widely used algorithm for serializability in databases. It makes lock requirements much stronger than simple dirty write prevention: readers block writers AND writers block readers.

Important note: 2PL is NOT 2PC!

2PL vs 2PC - DIFFERENT THINGS!
Two-Phase Locking (2PL)
Provides:
Serializable isolation
Purpose:
Prevent race conditions
Two-Phase Commit (2PC)
Provides:
Atomic commit
Purpose:
Distributed transactions

2PL locking rules:

TWO-PHASE LOCKING RULES
1
Transaction A Reads
Transaction B wants to write → B must wait for A to commit/abort
2
Transaction A Writes
Transaction B wants to read → B must wait for A to commit/abort

Key difference from snapshot isolation:

ApproachMantra
Snapshot isolationReaders never block writers, writers never block readers
Two-phase lockingReaders block writers, writers block readers

Benefit: Protects against all race conditions (lost updates, write skew, phantoms, etc.)

4.2.1. Implementation of two-phase locking

Used by: MySQL (InnoDB) serializable level, SQL Server serializable level, Db2 repeatable read level

Lock mechanism:

SHARED VS EXCLUSIVE LOCKS
Shared Mode (Read Lock)
Exclusive Mode (Write Lock)
Who can hold
Multiple transactions simultaneously
Only one transaction
Acquired for
Reading an object
Writing an object
Blocks
Exclusive locks only
All other locks (shared and exclusive)
Upgrade
Can upgrade to exclusive
N/A

Locking protocol:

  1. To read: Acquire shared lock (multiple transactions can hold simultaneously, but must wait if exclusive lock exists)
  2. To write: Acquire exclusive lock (no other locks can exist; must wait for all existing locks)
  3. Upgrade: Transaction can upgrade shared → exclusive lock
  4. Hold until end: After acquiring locks, must hold until commit or abort

Where "two-phase" comes from:

TWO PHASES OF 2PL
Phase 1: Growing
While transaction executes
Locks are acquired
Phase 2: Shrinking
At end of transaction
All locks are released

Deadlocks:

With many locks in use, deadlocks happen easily:

  • Transaction A waiting for B's lock
  • Transaction B waiting for A's lock
  • Database auto-detects and aborts one transaction
  • Application must retry aborted transaction

4.2.2. Performance of two-phase locking

Why 2PL has poor performance:

2PL PERFORMANCE PROBLEMS
🔒
Lock Overhead
Acquiring and releasing many locks
  • CPU overhead
  • Memory for lock tracking
  • Coordination between threads
🚦
Reduced Concurrency
Transactions waiting for each other
  • Any potential race condition causes waiting
  • One waits for other to complete
  • Throughput limited
📈
Unstable Latencies
Unpredictable response times
  • Very slow at high percentiles
  • One slow transaction blocks many others
  • Cascading delays
💥
Frequent Deadlocks
Much more common than with weak isolation
  • Aborted transactions must retry
  • Work done all over again
  • Significant wasted effort

Worst-case scenario example:

A transaction reading an entire table (backup, analytics, integrity check):

  1. Takes shared lock on entire table
  2. Must wait for all in-progress writes to complete
  3. While reading (potentially hours), all writes are blocked
  4. Database becomes unavailable for writes

💡 Insight

2PL's fundamental problem is that it's pessimistic—if anything might possibly go wrong, wait. This conservative approach guarantees safety but sacrifices performance. In high-contention workloads, the database can grind to a halt.

4.2.3. Predicate locks

The phantom problem (revisited):

In write skew examples, one transaction changes the results of another's search query. To prevent this, we need to lock not just existing rows, but all objects that match some search condition.

In plain English: Instead of locking specific rows (like "Account #123"), lock a pattern (like "all accounts with balance < $100"). This prevents other transactions from inserting, updating, or deleting anything that matches your pattern.

In technical terms: A predicate lock belongs to all objects that match some search condition, rather than a particular object.

Example: Meeting room booking

SELECT * FROM bookings
WHERE room_id = 123
AND end_time > '2025-01-01 12:00'
AND start_time < '2025-01-01 13:00';

Predicate lock works as follows:

PREDICATE LOCK MECHANISM
1
Transaction A: Read with Condition
Acquires shared predicate lock on search condition
2
Transaction B: Wants to Insert/Update
Must check if new/old value matches any predicate lock
3
If Match Found
Transaction B must wait until A commits/aborts

Key property: Applies even to objects that don't yet exist in the database (phantoms)!

If two-phase locking includes predicate locks, the database prevents all forms of write skew and other race conditions → isolation becomes serializable.

4.2.4. Index-range locks

The problem with predicate locks: Performance is poor. Checking for matching locks becomes time-consuming with many active transactions.

The solution: Index-range locking (also called next-key locking)—a simplified approximation of predicate locking.

In plain English: Instead of locking exactly what you need, lock a slightly bigger range. It's safe because anything matching your exact condition will definitely be in the bigger range. It's like locking the entire bookshelf instead of just one book.

In technical terms: Safe to simplify a predicate by making it match a greater set of objects. Any write matching the original predicate will also match the approximation.

INDEX-RANGE LOCKING EXAMPLE
Original predicate: Room 123 between 12:00-13:00
Approximation 1
Lock all bookings
for Room 123
at any time
Approximation 2
Lock all rooms
between
12:00-13:00
Both approximations are safe—they include the original condition

How it works with indexes:

Scenario 1: Index on room_id

  • Database finds existing bookings for room 123 using index
  • Attaches shared lock to this index entry
  • Any transaction trying to insert/update room 123 booking must update same index entry
  • Encounters the shared lock and must wait

Scenario 2: Index on time

  • Database finds bookings overlapping 12:00-13:00 using time index
  • Attaches shared lock to range of values in index
  • Any transaction trying to insert/update overlapping booking must touch same index range
  • Encounters the lock and must wait

Fallback: If no suitable index exists, database can use shared lock on entire table (bad for performance but safe).

💡 Insight

Index-range locks aren't as precise as predicate locks (they may lock more than strictly necessary), but they have much lower overhead. This trade-off makes them practical for production systems, whereas pure predicate locks would be too slow.

4.3. Serializable Snapshot Isolation (SSI)

In plain English: SSI is like having everyone proceed optimistically (no blocking), but with a security camera checking if anything went wrong. If two people's actions would have conflicted, one gets caught by the camera and has to redo their work. Most of the time everyone proceeds smoothly; conflicts are rare enough that the occasional retry is worth the better performance.

In technical terms: Serializable Snapshot Isolation provides full serializability with only a small performance penalty compared to snapshot isolation. It's an optimistic concurrency control technique—instead of blocking when something might go wrong, transactions proceed anyway and are checked at commit time.

The breakthrough: SSI was first described in 2008—comparatively new! Prior to this, the choice was:

  • Serializable isolation with poor performance (2PL) or limited scale (serial execution)
  • Weak isolation with good performance but prone to race conditions

SSI bridges this gap.

Used by:

SSI IMPLEMENTATIONS
Single-Node Databases
  • PostgreSQL (serializable level)
  • SQL Server In-Memory OLTP/Hekaton
  • HyPer
Distributed Databases
  • CockroachDB
  • FoundationDB
Embedded Storage
  • BadgerDB

4.3.1. Pessimistic versus optimistic concurrency control

PESSIMISTIC VS OPTIMISTIC CONCURRENCY
Pessimistic (2PL, Serial Execution)
Optimistic (SSI)
Philosophy
If anything might go wrong, wait until safe
Continue anyway, check at commit time
Blocking
Transactions wait for locks
Transactions proceed without blocking
When conflict detected
Before operation
At commit (after work done)
High contention
Slow but no wasted work
Many aborts, wasted work
Spare capacity
Underutilizes resources
Better throughput

Serial execution is pessimistic to the extreme: equivalent to an exclusive lock on entire database for each transaction's duration. Compensate by making transactions very fast.

SSI optimistic approach: When transaction wants to commit, database checks whether anything bad happened (isolation violated). If so, abort and retry. Only serially-executed transactions allowed to commit.

Performance characteristics:

  • High contention: Optimistic performs badly (many aborts, wasted work from retries)
  • Spare capacity + low contention: Optimistic performs better than pessimistic
  • Commutative operations: Reduce contention (e.g., concurrent counter increments don't conflict if not read in same transaction)

💡 Insight

The key insight: SSI is based on snapshot isolation (all reads from consistent snapshot) + an algorithm for detecting serialization conflicts. It combines the performance benefits of snapshot isolation with the safety guarantees of serializability.

4.3.2. Decisions based on an outdated premise

The write skew pattern (revisited):

Recall from write skew: a transaction reads data, makes a decision based on what it saw, then writes. Under snapshot isolation, the data may have changed by commit time—the premise may no longer be true.

In plain English: You made a decision based on old information (like checking "are there 2 doctors on-call?" and seeing "yes"), but by the time you act on that decision (going off-call), the information is outdated (the other doctor already went off-call).

In technical terms: The transaction takes an action based on a premise (a fact true at transaction start). When committing, the premise may no longer be true because other transactions modified the data. To provide serializability, the database must detect when a transaction acted on an outdated premise and abort it.

The challenge: The database doesn't know how application logic uses query results. To be safe, it must assume any change in query results means the transaction's writes may be invalid—there may be a causal dependency between queries and writes.

Two cases to consider:

SSI DETECTION SCENARIOS
👁️
Stale MVCC Reads
Uncommitted write occurred before the read
  • Transaction reads from snapshot
  • Ignores uncommitted writes (MVCC rule)
  • By commit time, those writes have committed
  • Read was based on stale data
✍️
Writes Affecting Prior Reads
Write occurs after the read
  • Transaction reads data
  • Another transaction modifies that data
  • Original transaction's premise invalidated
  • Decision based on now-changed data

4.3.3. Detecting stale MVCC reads

The scenario:

DETECTING STALE MVCC READS
1
Transaction 42 Starts
Modifies Aaliyah's on-call status (uncommitted)
2
Transaction 43 Starts
Reads Aaliyah's status, sees on_call=true (ignoring txn 42 per MVCC)
3
Transaction 42 Commits
Aaliyah's status now committed as on_call=false
4
Transaction 43 Wants to Commit
Database detects: ignored write from txn 42 has now committed
5
Transaction 43 Aborted
Premise no longer valid—must retry

How it works:

  1. Track ignored writes: Database remembers when a transaction ignores another's writes due to MVCC visibility rules
  2. Check at commit: When transaction wants to commit, check if any ignored writes have now committed
  3. Abort if stale: If so, the transaction must be aborted

Why wait until commit? Why not abort immediately when stale read detected?

  • Read-only transactions don't need to abort (no risk of write skew)
  • Database doesn't know yet if transaction will later write
  • Transaction 42 may still abort (read turns out not to be stale after all)
  • Avoiding unnecessary aborts preserves snapshot isolation's support for long-running reads

4.3.4. Detecting writes that affect prior reads

The scenario:

DETECTING WRITES AFFECTING PRIOR READS
Transaction 42
Read doctors WHERE shift_id=1234
Sees 2 doctors on-call
UPDATE: Set Aaliyah off-call
Transaction 43
Read doctors WHERE shift_id=1234
Sees 2 doctors on-call
UPDATE: Set Bryce off-call
Both transactions notify each other that prior reads are affected

How it works (similar to index-range locks, but without blocking):

  1. Record reads: When transaction searches (e.g., WHERE shift_id = 1234), use index entry to record which transactions read this data
  2. Check on write: When a transaction writes, look in indexes for other transactions that recently read affected data
  3. Notify conflicts: Instead of blocking, simply notify those transactions that their data may be outdated (acts as a "tripwire")
  4. Check at commit: When notified transaction tries to commit, check if conflicting write has committed

Example from diagram:

  • Transaction 43 notifies Transaction 42: your read is outdated
  • Transaction 42 notifies Transaction 43: your read is outdated
  • Transaction 42 commits first: Successful (Transaction 43's write hasn't committed yet)
  • Transaction 43 tries to commit: Conflicting write from 42 already committed → must abort

Tracking granularity trade-off:

  • Detailed tracking: Precise about which transactions abort, but high bookkeeping overhead
  • Coarse tracking: Fast, but may abort more transactions than strictly necessary

4.3.5. Performance of serializable snapshot isolation

Engineering details matter: Many factors affect practical performance. Trade-off example: granularity of tracking reads/writes.

Compared to two-phase locking:

SSI VS 2PL
Two-Phase Locking
Serializable Snapshot Isolation
Blocking
Transactions wait for locks
No blocking (one doesn't wait for another)
Query latency
Unpredictable, high variance
More predictable, lower variance
Read-only queries
May need locks
Run on snapshot without any locks
Read-heavy workloads
Poor performance
Excellent performance

Compared to serial execution:

  • Not limited to single CPU core throughput
  • FoundationDB distributes conflict detection across machines
  • Can scale to very high throughput
  • Transactions can read/write data in multiple shards while ensuring serializability

Compared to non-serializable snapshot isolation:

  • Checking for violations introduces overhead
  • Debate about whether it's worth it:
    • Some: "Serializability checking not worth the cost"
    • Others: "Performance now so good, no need for weaker isolation"

Abort rate significantly affects performance:

  • Long-running read-write transactions likely to conflict and abort
  • SSI requires fairly short read-write transactions
  • Long-running read-only transactions are fine
  • Less sensitive to slow transactions than 2PL or serial execution

💡 Insight

SSI represents a major advancement in database research—it's the first practical algorithm offering true serializability with performance approaching snapshot isolation. The key innovation: detect conflicts instead of preventing them, allowing much more concurrency.


5. Distributed Transactions

In plain English: Everything we've discussed so far works whether your database is on one machine or spread across many. But when a single transaction needs to update data on multiple machines, ensuring atomicity becomes much harder. You can't just tell each machine "commit now" because some might succeed while others fail, leaving your data inconsistent across machines.

In technical terms: Concurrency control for isolation (the "I" in ACID) applies similarly to both single-node and distributed databases. Consistency and durability also don't change much. However, atomicity requires special care in distributed transactions.

5.1. The Atomic Commitment Problem

Single-node atomicity:

On a single database node, atomicity is implemented by the storage engine:

SINGLE-NODE ATOMIC COMMIT
1
Client Requests Commit
Application asks database to commit transaction
2
Write Data to Disk
Database makes writes durable (write-ahead log)
3
Write Commit Record
Append commit record to log on disk (the deciding moment!)
4
Result Determined
If commit record written: committed. If crash before: aborted.

Key point: It's a single device (one disk controller, on one machine) that makes the commit atomic. Before the commit record is written, abort is still possible. After it's written, the transaction is committed (even if database crashes).

Multi-node problem:

What if multiple nodes are involved? Examples:

  • Multi-object transaction in sharded database
  • Global secondary index (index on different node from primary data)

Simply sending commit requests to all nodes doesn't work:

NAIVE MULTI-NODE COMMIT FAILS
Application sends "commit" to all nodes
Possible Failures
Constraint violation on some nodes
Network loses some commit requests
Some nodes crash before commit
Result: Transaction commits on some nodes, aborts on others!

Why this is catastrophic:

INCONSISTENT COMMIT PROBLEM
1
User 1: Transaction Partially Commits
Commits on DB 1, fails on DB 2
2
User 2: Reads from DB 2
Reads data from successful commit on DB 2
3
User 1: Discovers Failure
Learns commit failed on DB 1
4
Impossible to Undo
Can't abort—User 2 already saw committed data on DB 2!

Once data is committed on one node, it becomes visible to other transactions (under read committed or stronger isolation). If it's later aborted on another node, transactions that read the "committed" data would have to be reverted too—impossible!

The solution: Ensure all nodes involved in a transaction either all commit or all abort. This is the atomic commitment problem.

5.2. Two-Phase Commit (2PC)

In plain English: Two-phase commit is like a wedding ceremony. The officiant asks both the bride and groom "Do you take this person?" individually. Only after both say "I do" does the officiant pronounce them married. If either says "no," the ceremony is aborted. Once the officiant makes the pronouncement, the marriage is final—no take-backs!

In technical terms: Two-phase commit is a classic algorithm for achieving atomic transaction commit across multiple nodes. Instead of a single commit request, the commit/abort process is split into two phases.

Used by:

  • Internal to some databases
  • XA transactions (Java Transaction API)
  • WS-AtomicTransaction (SOAP web services)
TWO-PHASE COMMIT FLOW
1
Phase 0: Application Writes
Application reads/writes data on multiple nodes (participants)
2
Phase 1: Prepare
Coordinator asks each participant: 'Are you ready to commit?'
3
Participants Vote
Each replies 'yes' (ready) or 'no' (abort)
4
Phase 2: Commit or Abort
If all voted 'yes': coordinator sends commit. If any voted 'no': coordinator sends abort.

Key component: The Coordinator (also called transaction manager)

  • Often implemented as a library within application process
  • Can be separate process or service
  • Examples: Narayana, JOTM, BTM, MSDTC

5.2.1. A system of promises

Why does 2PC work when simple commit doesn't?

To understand, let's break down the process in detail:

DETAILED 2PC PROTOCOL
1
1. Request Transaction ID
Application requests globally unique transaction ID from coordinator
2
2. Begin on Each Participant
App begins single-node transaction on each participant, attaching global txn ID
3
3. Reads and Writes
All reads/writes done in single-node transactions. If anything fails, coordinator or participant can abort.
4
4. Coordinator Sends Prepare
When app ready to commit, coordinator sends prepare to all participants
🤝
5. Participants Make Promise
Each participant ensures it CAN commit (write to disk, check constraints, no conflicts)
6. Participants Vote
Reply 'yes' = promise to commit. Reply 'no' = will abort. By saying 'yes', participant surrenders right to abort!
⚖️
7. Coordinator Decides
Write decision to transaction log on disk (commit point!)
♾️
8. Coordinator Sends Decision
Send commit/abort to all participants. If this fails, retry FOREVER.

Two crucial "points of no return":

2PC PROMISES
🤝
Participant's Promise
When voting 'yes'
  • Promises it will definitely commit if asked
  • Surrenders right to abort
  • Must keep promise even after crash/recovery
  • Coordinator may still abort
⚖️
Coordinator's Decision
Once decision written to log
  • Decision is irrevocable
  • Must be enforced no matter what
  • Even if requires infinite retries
  • No going back

Marriage analogy:

  • Before "I do": Both parties can abort ("No way!")
  • After "I do": Cannot retract, even if you faint
  • Recovering from fainting: Ask officiant for status of your transaction ("Am I married?")
  • Officiant's retries: Continue pronouncing you married until you acknowledge

💡 Insight

The key insight: 2PC separates the "I promise I can commit" moment (prepare phase) from the "actually commit now" moment (commit phase). This separation, combined with durable logging of the coordinator's decision, makes atomic commitment across nodes possible.

5.2.2. Coordinator failure

What happens if participants or network fail?

  • Prepare request fails/times out → Coordinator aborts
  • Commit/abort request fails → Coordinator retries indefinitely

But what if the coordinator crashes?

COORDINATOR FAILURE SCENARIOS
1
Before Sending Prepare
Participant can safely abort (hasn't promised anything)
After Participant Votes 'Yes'
Participant is IN DOUBT—cannot commit or abort unilaterally

The in-doubt (uncertain) state:

IN-DOUBT PARTICIPANT PROBLEM
Scenario: Coordinator decides to commit, crashes before telling all participants
Database 2
Received commit
Transaction committed
Database 1
Didn't receive commit
Coordinator crashed
What should I do?
Database 1 cannot decide!
• Can't commit: other participant may have aborted
• Can't abort: DB 2 has committed (would be inconsistent)
• Timeout doesn't help: still don't know what happened

The only solution: Wait for coordinator to recover.

Coordinator recovery process:

  1. Coordinator must write commit/abort decision to log before sending to participants
  2. When coordinator recovers, reads transaction log
  3. Determines status of all in-doubt transactions from log
  4. Transactions without commit record in log → abort
  5. Transactions with commit record → send commit to participants

Why 2PC is called a "blocking" protocol: Can become stuck waiting for coordinator to recover.

💡 Insight

The in-doubt state is 2PC's Achilles' heel. Participants hold locks and resources while waiting for the coordinator. If the coordinator takes 20 minutes to recover, those locks are held for 20 minutes, blocking other transactions and potentially making large parts of your application unavailable.

5.2.3. Three-phase commit

Alternative to 2PC: Three-phase commit (3PC) is a nonblocking atomic commit protocol.

However: 3PC assumes:

  • Network with bounded delay
  • Nodes with bounded response times

In practice: Most real systems have:

  • Unbounded network delay
  • Process pauses (see Chapter 9)

Result: 3PC cannot guarantee atomicity in practical systems.

Better solution: Replace single-node coordinator with a fault-tolerant consensus protocol (see Chapter 10).

5.3. XA Transactions

In plain English: XA is a standard that allows completely different technologies (Oracle database, IBM message queue, PostgreSQL, etc.) to participate in the same two-phase commit transaction. It defines a common language for coordinators and participants to communicate, even if they're built by different vendors.

In technical terms: X/Open XA (eXtended Architecture) is a standard for implementing two-phase commit across heterogeneous technologies. Introduced in 1991, widely implemented by traditional relational databases and message brokers.

5.3.1. Distributed transactions across different systems

Two types of distributed transactions (often conflated):

TYPES OF DISTRIBUTED TRANSACTIONS
Database-Internal
Heterogeneous (XA)
Participants
All nodes run same database software
Two or more different technologies
Protocol
Can use any protocol, apply optimizations
Must use standard XA protocol
Performance
Can work quite well
More challenging
Examples
YugabyteDB, TiDB, FoundationDB, Spanner, VoltDB
PostgreSQL + ActiveMQ, Oracle + IBM MQ

Heterogeneous transaction example:

HETEROGENEOUS TRANSACTION USE CASE
1
Message Arrives
Message queue delivers message to application
2
Process Message
Application processes message, writes to database
3
Atomic Commit
Database write AND message acknowledgment in single transaction
4
Result
Either both happen or neither happens (exactly-once semantics)

Exactly-once message processing:

  • If message delivery OR database write fails → both aborted
  • Message broker can safely redeliver later
  • By atomically committing message acknowledgment + database writes, message is processed exactly once
  • Partial completion is rolled back

Limitation: Only possible if all systems support same atomic commit protocol (e.g., XA).

Example problem without XA: If email server doesn't support 2PC, sending email can't be part of atomic transaction. Email might be sent twice if processing fails and is retried.

5.3.2. XA transactions

What is XA?

  • Not a network protocol—it's a C API for interfacing with transaction coordinator
  • Bindings exist for other languages (Java Transaction API/JTA in Java EE)

Supported by:

XA SUPPORT
Databases
PostgreSQL
MySQL
Db2
SQL Server
Oracle
Message Brokers
ActiveMQ
HornetQ
MSMQ
IBM MQ

How it works:

XA ARCHITECTURE
Application uses network driver/client library
Driver supports XA → calls XA API
Driver sends necessary info to database/message broker
Driver exposes callbacks for coordinator (prepare, commit, abort)

Transaction coordinator implementation:

  • Often a library loaded into same process as application
  • Not a separate service
  • Keeps track of participants
  • Collects responses after prepare
  • Uses log on local disk to track commit/abort decisions

If application crashes:

  1. Coordinator goes with it (same process)
  2. Participants stuck in doubt
  3. Must restart application server
  4. Coordinator reads log to recover decisions
  5. Uses driver callbacks to tell participants to commit/abort

Important limitation: Database cannot contact coordinator directly—all communication via client library.

5.3.3. Holding locks while in doubt

In plain English: When a participant is in doubt, it's like a person standing in a doorway—they're blocking the way but don't know whether to go forward (commit) or backward (abort). Meanwhile, everyone else who needs to use that doorway (access that data) has to wait.

In technical terms: Why is being stuck in doubt such a problem?

Lock requirements:

  • Transactions take row-level exclusive locks on modified rows (prevent dirty writes)
  • With serializable isolation, also take shared locks on read rows
  • Cannot release locks until commit or abort
LOCKS HELD WHILE IN DOUBT
1
Normal Transaction
Acquire locks → Do work → Commit/Abort → Release locks (seconds)
2
In-Doubt Transaction
Acquire locks → Vote 'yes' → Coordinator crashes → Hold locks forever
3
Impact
Other transactions blocked from reading or writing those rows

Example scenario:

  • 2PC transaction modifies rows, votes "yes"
  • Coordinator crashes, takes 20 minutes to restart
  • Locks held for 20 minutes
  • Other transactions wanting to access those rows: blocked for 20 minutes
  • Depending on isolation level: may block even read-only queries

Result: Large parts of application become unavailable.

💡 Insight

This is why 2PC has such a bad reputation in practice. In theory it provides atomicity; in practice, coordinator failures can cause prolonged outages. Unlike other failure modes that affect only the failed component, 2PC coordinator failures can bring down large parts of your entire application.

5.3.4. Recovering from coordinator failure

In theory: Coordinator should cleanly recover from its log and resolve in-doubt transactions.

In practice: Orphaned in-doubt transactions occur—coordinator cannot decide outcome because:

  • Transaction log lost or corrupted
  • Software bug
  • Other issues

These transactions:

  • Cannot be resolved automatically
  • Sit forever in database
  • Hold locks indefinitely
  • Block other transactions perpetually

Rebooting doesn't fix it: 2PC must preserve locks of in-doubt transactions even across restarts (otherwise violates atomicity guarantee).

The only solution: Manual intervention

Administrator must:

  1. Examine participants of each in-doubt transaction
  2. Determine if any participant has committed or aborted
  3. Manually apply same outcome to other participants

Challenges:

  • Requires manual effort
  • Done under stress during production outage
  • High time pressure
  • Complex and error-prone

Emergency escape hatch: Heuristic decisions

  • Participants can unilaterally decide to commit or abort
  • Without coordinator's definitive decision
  • "Heuristic" = euphemism for "probably breaking atomicity"
  • Violates 2PC's system of promises
  • Intended only for catastrophic situations, not regular use

💡 Insight

The combination of holding locks during in-doubt periods and potential for manual recovery makes XA transactions operationally risky. Many cloud services choose not to implement distributed transactions specifically to avoid these operational problems.

5.3.5. Problems with XA transactions

FUNDAMENTAL XA PROBLEMS
Coordinator Single Point of Failure
For the entire system
  • Part of application server
  • Coordinator logs are crucial
  • As important as databases themselves
  • Hard to make highly available
🚫
No Direct Communication
Fundamental protocol limitation
  • Coordinator and participants can't talk directly
  • Must communicate via application code
  • Application becomes single point of failure
  • Even if coordinator is replicated
⬇️
Lowest Common Denominator
Must work with wide range of systems
  • Can't detect deadlocks across systems
  • Doesn't work with SSI
  • No advanced features
  • Limited to what all systems support

The application code problem:

Even if coordinator were highly available and replicated, application code would still be a single point of failure. Solving this would require totally redesigning how application code runs (similar to durable execution), but no practical tools take this approach.

Limitations across heterogeneous systems:

  • Deadlock detection: Would require standardized protocol for exchanging lock information
  • SSI: Would require protocol for identifying conflicts across systems
  • Both would need to be standardized and implemented by all participants

💡 Insight

These problems are somewhat inherent in performing transactions across heterogeneous technologies. However, keeping heterogeneous systems consistent is still a real problem. The solution isn't XA—we need different approaches (covered in next section and Chapter 12).

5.4. Database-Internal Distributed Transactions

In plain English: When a database is designed from the ground up to be distributed (like modern "NewSQL" databases), it can use much better protocols for distributed transactions than XA. Because all the nodes run the same software and can talk directly to each other, they can use optimized protocols and avoid XA's problems.

In technical terms: Database-internal distributed transactions are those where all participating nodes are shards of the same database running the same software. These avoid XA's limitations because they don't need to interface with other technologies.

Examples: CockroachDB, TiDB, Spanner, FoundationDB, YugabyteDB, Kafka

Key difference: Many use 2PC internally but don't suffer XA problems because they can optimize the protocol.

DATABASE-INTERNAL VS XA TRANSACTIONS
XA (Heterogeneous)
Database-Internal
Coordinator
Single-node, part of application
Replicated with automatic failover
Communication
Via application code only
Direct communication between nodes
Participant replication
Up to individual systems
All shards replicated
Concurrency control
Limited to lowest common denominator
Integrated deadlock detection, SSI support
Performance
Poor
Good

How database-internal distributed transactions fix XA problems:

IMPROVEMENTS OVER XA
🔄
Replicated Coordinator
Automatic failover
  • Coordinator replicated using consensus
  • Fails over automatically if primary crashes
  • No manual intervention needed
  • No prolonged unavailability
↔️
Direct Communication
Between coordinator and shards
  • Don't need to route via application
  • Application not single point of failure
  • More reliable communication
  • Better performance
📦
Replicated Shards
Fault tolerance
  • Each shard is replicated
  • Reduces risk of abort due to faults
  • Better availability
  • Faster recovery
🔗
Integrated Protocols
Advanced features
  • Deadlock detection across shards
  • Snapshot isolation across shards
  • SSI support
  • Optimized for single system

Consensus algorithms: Commonly used to replicate coordinator and database shards. We'll see in Chapter 10 how to implement atomic commitment using consensus (Chapter 10).

Isolation levels: Snapshot isolation and SSI are both possible across shards.

💡 Insight

The key lesson: Distributed transactions aren't inherently bad—XA transactions are bad. When the entire distributed database is designed as a cohesive system, distributed transactions can work well. This is a major achievement of NewSQL databases.

5.4.1. Exactly-once message processing revisited

Recall: An important use case for distributed transactions is ensuring operations take effect exactly once, even if crashes occur and processing needs retry.

With distributed transactions: Atomically commit across message broker and database—acknowledge message if and only if database writes succeeded.

Alternative approach (without distributed transactions):

Only requires transactions within the database:

EXACTLY-ONCE WITHOUT DISTRIBUTED TRANSACTIONS
1
1. Check Message ID
Begin DB transaction, check if message ID already in processed table
2
2. If Already Processed
Drop message, acknowledge to broker (idempotent retry)
3
3. If Not Processed
Add message ID to table, process message, commit DB transaction
4
4. Acknowledge to Broker
After successful DB commit, acknowledge message
5
5. Cleanup (Optional)
Delete message ID from table (separate transaction)

Crash scenarios:

Crash PointWhat HappensResult
Before DB commitTransaction aborted, broker retriesMessage reprocessed, ID check prevents duplicate
After DB commit, before ACKBroker retriesMessage ID in DB, dropped as duplicate
After ACK, before cleanupOld message ID remainsHarmless (just takes storage space)
Network interruptionIf before commit abortedUniqueness constraint prevents concurrent inserts

Key insight: Recording the message ID in the database makes processing idempotent—can safely retry without duplicating side effects.

Similar approach: Used by Kafka Streams for exactly-once semantics (Chapter 12).

Internal distributed transactions still useful: Allow message IDs on one shard, main data on other shards, with atomicity across shards.

💡 Insight

You don't always need distributed transactions across different systems to achieve exactly-once semantics. With idempotent operations and message IDs, you can achieve the same guarantee with only single-system transactions. However, distributed transactions within a database (across shards) are still valuable for scalability.


6. Summary

In plain English: Transactions are a lifesaver for application developers. Instead of manually handling crashes, network failures, concurrent users, and countless edge cases, transactions let you pretend these problems don't exist. If something goes wrong, just retry. This simplifies programming enormously.

In technical terms: Transactions are an abstraction layer that allows applications to pretend certain concurrency problems and hardware/software faults don't exist. A large class of errors reduces down to simple transaction abort, requiring only retry logic.

What we covered:

TRANSACTION CONCEPTS COVERED
📖
ACID Meaning
What the letters really mean
  • Atomicity = abortability
  • Consistency = application invariants
  • Isolation = concurrency control
  • Durability = no data loss
⚠️
Weak Isolation Levels
Common but dangerous
  • Read committed
  • Snapshot isolation
  • Still allow many race conditions
  • Naming confusion across databases
🏁
Race Conditions
Problems with weak isolation
  • Dirty reads/writes
  • Read skew
  • Lost updates
  • Write skew and phantoms
Serializability
Strongest isolation
  • Serial execution
  • Two-phase locking (2PL)
  • Serializable snapshot isolation (SSI)
  • Prevents all race conditions

Isolation Levels Summary

We discussed several widely-used isolation levels and characterized them by race conditions they allow or prevent:

ISOLATION LEVELS AND ANOMALIES
Isolation LevelDirty ReadsRead SkewPhantom ReadsLost UpdatesWrite Skew
Read Uncommitted✗ Possible✗ Possible✗ Possible✗ Possible✗ Possible
Read Committed✓ Prevented✗ Possible✗ Possible✗ Possible✗ Possible
Snapshot Isolation✓ Prevented✓ Prevented✓ Prevented? Depends✗ Possible
Serializable✓ Prevented✓ Prevented✓ Prevented✓ Prevented✓ Prevented

Anomaly definitions:

  • Dirty reads: Reading another client's uncommitted writes
  • Dirty writes: Overwriting data another client wrote but hasn't committed
  • Read skew: Seeing different parts of database at different points in time (nonrepeatable reads)
  • Lost updates: Two concurrent read-modify-write cycles; one overwrites other's write
  • Write skew: Transaction reads data, makes decision, writes—but premise no longer true at commit time
  • Phantom reads: Transaction reads objects matching search; another transaction's write affects search results

Serializability Implementation Summary

Only serializable isolation protects against all anomalies. We discussed three approaches:

IMPLEMENTING SERIALIZABILITY
➡️
Serial Execution
One transaction at a time
  • Use stored procedures
  • Fast if transactions short
  • Limited to single CPU core
  • Can shard if cross-shard rare
🔒
Two-Phase Locking (2PL)
Decades-old standard
  • Readers block writers, vice versa
  • Poor performance
  • Avoided by many applications
  • Unstable latencies
Serializable Snapshot Isolation (SSI)
Modern optimistic approach
  • Detect conflicts, abort instead of blocking
  • Avoids most downsides of previous approaches
  • Good performance
  • Fairly new (2008)

Distributed Transactions Summary

We examined achieving atomicity when transactions span multiple nodes:

Two-phase commit (2PC):

  • Coordinator + participants
  • Prepare phase (voting)
  • Commit/abort phase (decision)

Database-internal distributed transactions (same database software on all nodes): Can work quite well using 2PC with optimizations.

XA transactions (across different storage technologies):

  • Very sensitive to coordinator and application code faults
  • Interacts poorly with concurrency control
  • Alternative: Idempotence can achieve exactly-once semantics without atomic commit across different systems

💡 Insight

Transactions are a valuable feature regardless of data model. While not every application needs transactions, they hugely reduce the number of error cases you need to think about. Without transactions, complex interacting accesses become very difficult to reason about, and data can easily become inconsistent.

Looking ahead: The examples in this chapter used relational data models, but as discussed in "The need for multi-object transactions," transactions are valuable for any data model.


Previous: Chapter 7 | Next: Chapter 9