Chapter 8. Transactions

Some authors have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.

Introduction
- 1.1. What Can Go Wrong
- 1.2. The Purpose of Transactions
What Exactly Is a Transaction?
- 2.1. The Evolution of Transactions
- 2.2. The Meaning of ACID
- 2.3. Single-Object and Multi-Object Operations
Weak Isolation Levels
- 3.1. Read Committed
- 3.2. Snapshot Isolation and Repeatable Read
- 3.3. Preventing Lost Updates
- 3.4. Write Skew and Phantoms
Serializability
- 4.1. Actual Serial Execution
- 4.2. Two-Phase Locking (2PL)
- 4.3. Serializable Snapshot Isolation (SSI)
Distributed Transactions
- 5.1. The Atomic Commitment Problem
- 5.2. Two-Phase Commit (2PC)
- 5.3. XA Transactions
- 5.4. Database-Internal Distributed Transactions
Summary

1. Introduction

In plain English: Think of a transaction like making multiple bank transfers in one atomic operation. Either all the transfers complete successfully (you get your money and the recipient gets theirs), or none of them happen (everyone's balance stays the same). You never end up in a weird state where your money disappeared but the recipient didn't receive it.

In technical terms: A transaction is a way for an application to group several reads and writes together into a logical unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds (commit) or it fails (abort, rollback).

Why it matters: Without transactions, you'd need to manually handle partial failures, network interruptions, concurrent writes, and countless edge cases. Transactions simplify error handling dramatically—if something goes wrong, just abort and retry. This reduces an enormous problem space down to simple retry logic.

1.1. What Can Go Wrong

In the harsh reality of data systems, many things can go wrong:

PROBLEMS IN DATA SYSTEMS

🔥

Hardware Failures

Database crashes mid-write
Disk becomes full
Power failure

💥

Software Crashes

App crashes halfway through
Process terminated
Out of memory errors

📡

Network Issues

Connection interrupted
Timeout while writing
Packets lost

🔀

Concurrency Problems

Clients overwrite each other
Read partially updated data
Race conditions

💡 Insight

The Post Office Horizon scandal—where faulty accounting software caused massive financial losses and wrongful prosecutions—was likely caused by lack of proper ACID transactions. This isn't just theoretical: weak transaction guarantees can destroy businesses and lives.

1.2. The Purpose of Transactions

For decades, transactions have been the mechanism of choice for simplifying these issues.

What transactions provide:

All-or-nothing guarantee: Either the entire transaction succeeds or it fails cleanly
Isolation from concurrency: The database handles concurrent access so you don't have to
Simple error handling: If something fails, just retry the whole transaction
Human fault tolerance: Deploy buggy code? Roll back and rerun with the fix

When you might not need transactions:

Very simple access patterns (single record reads/writes)
Systems where eventual consistency is acceptable
Use cases requiring maximum performance/availability

However, figuring out whether you need transactions requires understanding exactly what safety guarantees they provide and what costs they impose. Let's dive deep.

2. What Exactly Is a Transaction?

2.1. The Evolution of Transactions

Almost all relational databases today follow the transaction model introduced in 1975 by IBM System R—the first SQL database. Despite 50 years of evolution, the general idea remains remarkably similar across MySQL, PostgreSQL, Oracle, and SQL Server.

The NoSQL era (late 2000s):

In the late 2000s, nonrelational (NoSQL) databases gained popularity by offering:

New data models (documents, graphs, key-value)
Built-in replication and sharding
Higher performance and availability

The cost: Many abandoned transactions entirely, leading to a popular belief that transactions were fundamentally unscalable.

The NewSQL renaissance (2010s+):

EVOLUTION OF TRANSACTION SUPPORT

1975-2005: Traditional ACID

Single-node databases, strong guarantees, limited scale

→

2005-2015: NoSQL Era

Distributed systems, weak guarantees, high scale

→

2015-Present: NewSQL

Distributed + ACID, proving transactions can scale

NewSQL databases like CockroachDB, TiDB, Spanner, FoundationDB, and YugabyteDB proved that transactional systems can scale to massive data volumes and high throughput by combining sharding with consensus protocols.

💡 Insight

The key lesson: Transactions aren't fundamentally unscalable—early NoSQL systems abandoned them prematurely. Modern distributed databases show you can have both ACID guarantees and horizontal scalability, though with careful engineering and trade-offs.

2.2. The Meaning of ACID

The safety guarantees provided by transactions are described by the acronym ACID: Atomicity, Consistency, Isolation, and Durability. Coined in 1983 by Theo Härder and Andreas Reuter, it established precise terminology for fault-tolerance mechanisms.

The problem: In practice, one database's ACID implementation differs significantly from another's. When a system claims to be "ACID compliant," it's often unclear what guarantees you actually get. ACID has unfortunately become mostly a marketing term.

Systems that don't meet ACID criteria are sometimes called BASE (Basically Available, Soft state, Eventual consistency). This is even more vague than ACID—the only sensible definition of BASE is "not ACID."

Let's break down what each letter actually means:

2.2.1. Atomicity

In plain English: Think of atomicity like a light switch—it's either on or off, never halfway. If you're making multiple database changes and something fails partway through, atomicity ensures all changes are rolled back, leaving no partial state.

In technical terms: ACID atomicity describes what happens if a client wants to make several writes, but a fault occurs after some writes have been processed. If the writes are grouped into an atomic transaction and the transaction cannot be completed due to a fault, the transaction is aborted and the database must discard or undo any writes it has made.

Why "abortability" would be a better term:

ATOMICITY IN ACTION

Without Atomicity

Transfer $100

↓

✓ Deduct from Account A

↓

✗ CRASH!

↓

Money disappeared!

With Atomicity

Transfer $100

↓

✓ Deduct from Account A

↓

✗ CRASH!

↓

↻ Automatic rollback

↓

Both accounts unchanged

Key point: Without atomicity, if an error occurs partway through multiple changes, it's difficult to know which changes took effect. The application could retry, but that risks making the same change twice. Atomicity simplifies this: if a transaction was aborted, nothing changed, so it's safe to retry.

Note: In multi-threaded programming, "atomic" means indivisible operations that other threads can't observe partially. In ACID, atomicity is NOT about concurrency—that's covered by Isolation (the "I" in ACID). ACID atomicity is about fault handling.

2.2.2. Consistency

The problem: "Consistency" is terribly overloaded. It means at least five different things:

FIVE MEANINGS OF 'CONSISTENCY'

Replica Consistency

Eventual consistency in async replication (Chapter 6)

Consistent Snapshot

Backup at one moment in time, respecting happens-before

Consistent Hashing

Sharding approach for rebalancing (Chapter 7)

CAP Consistency

Linearizability in CAP theorem (Chapter 10)

ACID Consistency

Application-specific invariants always satisfied

In plain English: ACID consistency means your data follows the business rules you've defined. For example, in an accounting system, credits and debits must always balance. If a transaction starts with valid data and preserves validity during execution, the database stays in a "good state."

In technical terms: ACID consistency refers to application-specific invariants that must always be true. If you declare these as constraints (foreign keys, uniqueness, check constraints), the database enforces them. For more complex invariants, it's the application's responsibility to define transactions correctly.

Why it matters: The "C" in ACID is somewhat misplaced—consistency is actually a property of the application, not just the database. The database provides mechanisms (constraints, triggers), but ultimately the application must use transactions correctly to preserve consistency.

💡 Insight

Consistency is the odd one out in ACID—Atomicity, Isolation, and Durability are properties the database guarantees, but Consistency depends on the application using the database correctly. If you write bad data that violates your invariants but haven't declared those invariants, the database can't stop you.

2.2.3. Isolation

In plain English: Imagine you and a friend are both trying to increment a counter at the exact same time. Without isolation, you might both read "42", both compute "43", and both write "43"—so the counter only increases by 1 instead of 2. Isolation prevents these race conditions.

In technical terms: Isolation means concurrently executing transactions are isolated from each other—they cannot step on each other's toes. The classic formalization is serializability: each transaction can pretend it's the only transaction running. The database ensures that when transactions commit, the result is the same as if they had run serially (one after another).

RACE CONDITION WITHOUT ISOLATION

Counter = 42 (in database)

↓

User 1

↓

Compute: 42 + 1

↓

Write: 43

User 2

↓

Compute: 42 + 1

↓

Write: 43

↓

Counter = 43 (should be 44!)

The reality: Serializability has a performance cost. In practice, many databases use weaker isolation levels that allow concurrent transactions to interfere in limited ways:

Oracle's "serializable" actually implements snapshot isolation (weaker than true serializability)
MySQL, PostgreSQL offer multiple isolation levels, not all serializable
Some race conditions can still occur with weak isolation

We'll explore these isolation levels and their trade-offs in detail in Section 3.

2.2.4. Durability

In plain English: Durability is the promise that once you've successfully saved your data, it won't disappear—even if the power goes out, the server crashes, or the disk fails. Think of it like writing something in permanent ink versus a whiteboard.

In technical terms: Once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes.

How it's implemented:

DURABILITY IMPLEMENTATIONS

💾

Single-Node Database

Data written to nonvolatile storage (HDD/SSD)

Write-ahead log (WAL)
fsync() to force disk write
Recovery from log after crash

📦

Replicated Database

Data copied to multiple nodes

Replication to N nodes
Wait for confirmations
Survive node failures

The harsh reality—nothing is perfect:

💡 Insight

Perfect durability doesn't exist. There are only risk-reduction techniques: writing to disk, replicating to remote machines, backups. Each has flaws:

Disk writes: SSDs can violate fsync guarantees, firmware bugs, bad sectors (30-80% of SSDs develop bad blocks in 4 years)

Replication: Correlated faults can take down all replicas, async replication loses recent writes

Backups: Gradual corruption might affect all backups

Best practice: Use all techniques together—write to disk AND replicate AND backup. Take theoretical "guarantees" with a grain of salt.

Durability evolution:

Era	Implementation	Trade-offs
1970s	Archive tape	Reliable but slow to restore
1990s-2000s	Disk/SSD	Fast but single point of failure
2010s+	Replication	Available but complex failure modes

2.3. Single-Object and Multi-Object Operations

In plain English: Sometimes you need to update multiple related pieces of data at once. Imagine an email app that shows "3 unread messages" in a counter. When a new email arrives, you need to: (1) insert the email, and (2) increment the counter. If only one happens, the app shows inconsistent data.

In technical terms: Multi-object transactions are needed when several pieces of data must be kept in sync. ACID's atomicity and isolation describe what happens when a transaction modifies multiple objects (rows, documents, records).

MULTI-OBJECT TRANSACTION EXAMPLE

Transaction Start

emails table

100 rows

unread_count = 3

New email arrives→

Transaction End

emails table

101 rows

unread_count = 4

Without proper isolation:

DIRTY READ ANOMALY

Transaction 1: Start

Begin inserting new email

↓

Transaction 1: Insert Email

Email added, but counter not yet updated

↓

Transaction 2: Read Data

Sees new email but old counter (dirty read!)

↓

Transaction 1: Update Counter

Finally increments counter

↓

Transaction 1: Commit

Now consistent, but Transaction 2 saw inconsistent state

User 2 sees: Unread counter shows 3, but mailbox shows 4 unread emails. This violates isolation.

If an error occurs:

ATOMICITY VIOLATION

Start transaction

↓

✓ Insert email

↓

✗ Network timeout updating counter

↓

Without atomicity: email exists, counter wrong

With atomicity: email rolled back, nothing changed

How transactions are typically grouped:

In relational databases, everything between BEGIN TRANSACTION and COMMIT on a TCP connection belongs to the same transaction. If the TCP connection is interrupted, the transaction must be aborted.

In nonrelational databases: Many don't have a way of grouping operations. Even if there's a multi-put API, it may not have transaction semantics—some keys might succeed while others fail, leaving partial state.

2.3.1. Single-object writes

In plain English: Even writing a single document can fail partway through. What if you're saving a 20 KB JSON file and the network dies after 10 KB? Without atomicity, you'd get corrupted half-written data.

In technical terms: Storage engines almost universally provide atomicity and isolation at the single-object level (one key-value pair on one node):

Atomicity: Implemented using a write-ahead log for crash recovery
Isolation: Implemented using a lock on each object (one thread accesses at a time)

Useful single-object operations:

SINGLE-OBJECT OPERATIONS

Atomic Increment

UPDATE counters SET value = value + 1

No read-modify-write cycle
Database handles atomically
Prevents lost updates

Compare-and-Set (CAS)

Write only if value unchanged

Conditional write operation
Detects concurrent modifications
Similar to CPU CAS instructions

Important limitation: These single-object operations are useful but are NOT transactions in the traditional sense:

Cassandra/ScyllaDB "lightweight transactions": Linearizable reads and conditional writes on single object only
Aerospike "strong consistency": Single-object guarantees, no cross-object guarantees

2.3.2. The need for multi-object transactions

When you need multi-object transactions:

USE CASES FOR MULTI-OBJECT TRANSACTIONS

🔗

Foreign Keys

Relational data model

Row in table A references row in table B
Must keep references valid
Inserting related records must be atomic

📄

Denormalization

Document databases without joins

Same data stored in multiple places
Updates must hit all copies
Prevent denormalized data from diverging

📇

Secondary Indexes

Almost all databases

Indexes are separate database objects
Must update with every value change
Without isolation: record in one index, missing from another

💡 Insight

Applications CAN be implemented without multi-object transactions, but error handling becomes vastly more complicated without atomicity, and the lack of isolation causes concurrency problems. Many developers underestimate this complexity until production issues arise.

2.3.3. Handling errors and aborts

In plain English: A key feature of transactions is they can be aborted and safely retried if an error occurs. This is like an "undo" button for database operations.

In technical terms: ACID databases follow the philosophy: if the database is in danger of violating atomicity, isolation, or durability, abandon the transaction entirely rather than allow it to remain half-finished.

Not all systems follow this philosophy:

Leaderless replication datastores work on "best effort" basis
Won't undo what's already done if an error occurs
Application's responsibility to recover

The retry problem: Although retrying is simple and effective, it isn't perfect:

RETRY CHALLENGES

📡

Network Acknowledgment Lost

Transaction succeeded but client didn't know

Client times out waiting for ACK
Retries, executing twice
Need deduplication mechanism

🔥

Overload

Retrying makes problem worse

Error due to high contention
Retry adds more load
Use exponential backoff

⚠️

Transient vs Permanent Errors

Only retry transient errors

Deadlock: worth retrying
Constraint violation: pointless to retry
Distinguish error types

📧

Side Effects

Outside database actions

Sending email
Calling external API
Need two-phase commit for atomicity

ORM framework problem: Popular ORMs like Rails ActiveRecord and Django don't retry aborted transactions—exceptions bubble up and user input is lost. This defeats the whole point of aborts!

3. Weak Isolation Levels

In plain English: Perfect isolation (serializability) means transactions can't interfere with each other at all—like each transaction runs alone on the database. But this is expensive, so most databases use "weaker" isolation that allows some interference. Understanding these levels is crucial because they can cause subtle bugs.

In technical terms: If two transactions don't touch the same data, or both are read-only, they can safely run in parallel. Concurrency issues only arise when one transaction reads data being modified by another, or when transactions modify the same data simultaneously.

Why this matters:

💡 Insight

Concurrency bugs are hard to find by testing because they only trigger with unlucky timing—they're rare and difficult to reproduce. Weak isolation levels have caused substantial financial losses, corrupted customer data, and led to regulatory investigations. Even "ACID" databases often use weak isolation by default!

Concurrency control is relevant for security too: an attacker might deliberately send bursts of concurrent requests to exploit race conditions. Building reliable and secure applications requires systematically preventing these bugs.

3.1. Read Committed

The most basic isolation level. Read committed makes two guarantees:

No dirty reads: You only see committed data when reading
No dirty writes: You only overwrite committed data when writing

Some databases support an even weaker level called read uncommitted (prevents dirty writes but allows dirty reads). Let's explore these guarantees:

3.1.1. No dirty reads

In plain English: Imagine you're transferring money between accounts. Without dirty read prevention, someone else could see the money deducted from your account but not yet added to the recipient's account—it looks like money vanished. Dirty read prevention ensures you only see committed states.

In technical terms: A dirty read occurs when a transaction reads data written by another transaction that hasn't yet committed. Read committed prevents this: writes only become visible to others when the transaction commits (and then all writes become visible at once).

PREVENTING DIRTY READS

Initial: x = 2

↓

User 1

Begin transaction

↓

Set x = 3

↓

Not yet committed

User 2

Begin transaction

↓

Get x

↓

Returns: 2 (old value)

↓

User 1 commits → Now everyone sees x = 3

Why prevent dirty reads:

Partial updates visible: Transaction updates several rows; another transaction sees some but not others (like the email counter example)
Reading data that gets rolled back: If a transaction aborts, any reads of its uncommitted data would also need to be aborted (cascading aborts)

3.1.2. No dirty writes

In plain English: Imagine two people trying to buy the same car on a website simultaneously. Without dirty write prevention, one person might win the sale but the other person might get the invoice—a nonsensical outcome.

In technical terms: A dirty write occurs when one transaction overwrites an uncommitted value written by another transaction. Read committed prevents this, usually by delaying the second write until the first transaction commits or aborts.

DIRTY WRITE EXAMPLE: CAR PURCHASE

Available: Car #1234

↓

Aaliyah

Start buying car

↓

Update listings

buyer = Aaliyah

Bryce

Start buying car

↓

Update listings

buyer = Bryce

↓

Update invoices
(invoice to Aaliyah)

Update invoices
(invoice to Bryce)

↓

Without dirty write prevention: Sale to Bryce, invoice to Aaliyah!

With dirty write prevention: One person waits, only one completes

Important note: Read committed does NOT prevent all race conditions. The counter increment problem (Figure 8-1) can still occur because the second write happens after the first transaction commits—it's not a dirty write, but it's still incorrect. We'll discuss this in "Preventing Lost Updates."

3.1.3. Implementing read committed

Read committed is very popular—it's the default in Oracle, PostgreSQL, SQL Server, and many others.

Preventing dirty writes:

Most databases use row-level locks:

Transaction must acquire lock before modifying a row
Hold lock until commit or abort
Only one transaction can hold lock per row
Other transactions must wait

This locking happens automatically in read committed mode.

Preventing dirty reads—two approaches:

APPROACHES TO PREVENT DIRTY READS

Approach 1: Read Locks

Approach 2: Remember Old Values

Mechanism

Acquire lock briefly to read

Store both old and new values

Concurrency

One long write blocks many reads

Readers never blocked

Performance

Poor for read-heavy workloads

Excellent for read-heavy workloads

Used by

IBM Db2, MS SQL Server (some modes)

PostgreSQL, Oracle, most modern DBs

Approach 2 (most common): For every row being written:

Remember both old committed value and new value
Transactions writing hold the write lock
Transactions reading get the old value
Switch to reading new value after commit

This is related to multi-version concurrency control (MVCC), which we'll explore in depth next.

💡 Insight

The reason most databases prefer Approach 2 (remembering old values) is that long-running writes are common (analytics queries, backups). With read locks, one slow write would freeze the entire database for reads—unacceptable for production systems.

3.2. Snapshot Isolation and Repeatable Read

In plain English: Imagine checking your bank account balance across two accounts. You see Account A has $500, but by the time you check Account B, a transfer has moved $100 from B to A. Account B now shows $400, so your total appears to be $900 instead of $1,000. The money seems to have vanished!

In technical terms: Read committed prevents dirty reads but allows read skew (also called nonrepeatable read): if you read the same data twice in a transaction, you might get different results. Some situations can't tolerate this temporary inconsistency.

READ SKEW EXAMPLE

Initial State

Account 1: $500, Account 2: $500

↓

Aaliyah Reads Account 1

Sees $500

↓

Transfer Transaction Starts

Moving $100 from Account 2 to Account 1

↓

Transfer Commits

Account 1: $600, Account 2: $400

↓

Aaliyah Reads Account 2

Sees $400 (new value)

↓

Aaliyah's View

Total appears to be $900—$100 vanished!

When read skew is unacceptable:

SITUATIONS REQUIRING SNAPSHOT ISOLATION

💾

Backups

Taking backups may take hours

Writes continue during backup
Some parts contain old version
Other parts contain new version
Restoring gives permanent inconsistency

📊

Analytics Queries

Queries scanning large portions of database

Common in data warehouses
Integrity checks scanning all data
Return nonsensical results if observing different time points

The solution: Snapshot Isolation

In plain English: Imagine taking a photograph of your database. That photo captures everything at one specific moment. Even if things change in real life after you take the photo, the photo itself never changes. Snapshot isolation gives each transaction its own "photograph" of the database.

In technical terms: Each transaction reads from a consistent snapshot of the database—the transaction sees all data that was committed at the start of the transaction. Even if data is subsequently changed by other transactions, each transaction sees only the old data from its particular point in time.

Why it matters: Long-running read-only queries (backups, analytics) are much easier to reason about when operating on a consistent snapshot frozen at a particular moment.

Popularity: Supported by PostgreSQL, MySQL (InnoDB), Oracle, SQL Server. Some databases (Oracle, TiDB, Aurora DSQL) even use snapshot isolation as their highest isolation level.

3.2.1. Multi-version concurrency control (MVCC)

In plain English: Instead of overwriting data in place, the database keeps multiple versions side-by-side. When you read, you see the version that was current when your transaction started. When you write, you create a new version. Old versions stick around for transactions that started earlier.

In technical terms: To implement snapshot isolation, databases keep several different committed versions of each row because various in-progress transactions may need to see the database at different points in time. This technique is called multi-version concurrency control (MVCC).

Key principle: Readers never block writers, and writers never block readers. This allows long-running read queries on a consistent snapshot while processing writes normally, with no lock contention.

MVCC IMPLEMENTATION (PostgreSQL Example)

Each transaction gets unique, increasing transaction ID (txid)

↓

Table Structure

account_id

balance

inserted_by (txid)

deleted_by (txid)

Example Versions

Account 2, $500, txid=12, deleted=13

Account 2, $400, txid=13, deleted=NULL

↓

Update = Delete + Insert (both tagged with same txid)

How MVCC works:

Each transaction gets a unique, always-increasing ID when it starts
Every write is tagged with the transaction ID of the writer
Each row has inserted_by field: ID of transaction that inserted this row
Each row has deleted_by field: Initially empty; set to deleting transaction's ID
Deletes don't remove rows: Just mark for deletion; garbage collection removes later
Updates are delete + insert: Old row marked deleted, new row inserted

MVCC EXAMPLE: TRANSACTION 13 UPDATES ACCOUNT

Before

Row: Account 2

balance = $500

inserted_by = 12

deleted_by = NULL

Transaction 13: Deduct $100→

After

Old row (marked deleted)

balance = $500

inserted_by = 12

deleted_by = 13

New row (inserted)

balance = $400

inserted_by = 13

deleted_by = NULL

Storage: All versions stored in same database heap (data structure), regardless of commit status. Versions of same row form a linked list for easy iteration.

3.2.2. Visibility rules for observing a consistent snapshot

In plain English: When your transaction reads data, the database uses special rules to decide which version of each row you're allowed to see. These rules ensure you see a consistent snapshot—as if you took a photo at the moment your transaction started.

In technical terms: The database uses transaction IDs and visibility rules to present a consistent snapshot:

VISIBILITY RULES

Rule 1: Ignore In-Progress Transactions

At transaction start, list all other in-progress transactions. Ignore their writes, even if they later commit.

↓

Rule 2: Ignore Future Transactions

Ignore writes from transactions with higher IDs (started after current transaction).

↓

Rule 3: Ignore Aborted Transactions

Ignore all writes from aborted transactions, regardless of when they aborted.

↓

Rule 4: Everything Else Is Visible

All other writes are visible to the transaction's queries.

Simplified visibility check for a row:

A row is visible if BOTH conditions are true:

At transaction start time: The transaction that inserted the row had already committed
Not deleted, or: The transaction that deleted the row hadn't yet committed at transaction start time

Example from our MVCC diagram:

Transaction 12 reading Account 2 sees balance of $500
Why? Transaction 13's deletion hasn't committed yet (from Transaction 12's perspective)
The $400 balance inserted by Transaction 13 is also invisible (Transaction 13 hasn't committed)

💡 Insight

Long-running transactions can continue using old snapshots for extended periods, reading "stale" data that other transactions have long since overwritten. By never updating values in place but creating new versions instead, the database provides consistent snapshots with minimal overhead.

3.2.3. Indexes and snapshot isolation

The challenge: How do indexes work when there are multiple versions of each row?

Most common approach:

INDEX STRATEGIES FOR MVCC

Index entry points to ONE version (oldest or newest)

↓

Row version contains reference to next version

↓

Query follows chain to find visible version

↓

Garbage collection removes old versions and index entries

Optimizations:

PostgreSQL: Avoid index updates if different versions fit on same page
Some databases: Store only diffs between versions instead of full copies

Alternative approach (CouchDB, Datomic, LMDB):

Use immutable (copy-on-write) B-trees:

Don't overwrite pages when updated
Create new copy of each modified page
Parent pages copied and updated to point to new children
Every write creates a new B-tree root
Each root is a consistent snapshot at a point in time
No need to filter based on transaction IDs
Requires background compaction and garbage collection

3.2.4. Snapshot isolation, repeatable read, and naming confusion

The problem: Different databases use different terms for the same thing, and the same terms for different things!

NAMING CONFUSION

PostgreSQL

Calls snapshot isolation 'repeatable read'

Meets SQL standard requirements
Claims standards compliance
Actually implements snapshot isolation

Oracle

Calls snapshot isolation 'serializable'

Confusing terminology
NOT true serializability
Actually snapshot isolation

MySQL

'Repeatable read' = MVCC (weaker than snapshot isolation)

Different from PostgreSQL
Same term, different meaning
Weaker consistency than snapshot isolation

IBM Db2

'Repeatable read' = serializability

Completely different meaning
Strongest guarantee
Same term as PostgreSQL/MySQL!

Why the confusion?

The SQL standard doesn't have the concept of snapshot isolation (it was invented after the standard was written in 1975). The standard defines "repeatable read," which looks superficially similar. The standard's definition is flawed—ambiguous, imprecise, and not implementation-independent.

Result: Nobody really knows what "repeatable read" means!

💡 Insight

Don't rely on isolation level names alone. Always check your database's documentation for what guarantees it actually provides. The SQL standard's isolation level definitions are fundamentally broken—they don't match how modern databases actually work.

3.3. Preventing Lost Updates

In plain English: The lost update problem happens when two people edit the same thing at once. Both read the current value, make their changes, and write back. The second write "clobbers" the first one, losing those changes. It's like two people editing a Google Doc without real-time sync—the last one to save overwrites the other's work.

In technical terms: The lost update problem occurs when an application reads a value, modifies it, and writes back the modified value (a read-modify-write cycle). If two transactions do this concurrently, one modification can be lost because the second write doesn't include the first modification.

LOST UPDATE PROBLEM

Transaction 1

Read counter: 42

↓

Compute: 42 + 1 = 43

↓

Write: 43

Transaction 2

Read counter: 42

↓

Compute: 42 + 1 = 43

↓

Write: 43

↓

Result: 43 (should be 44—one update lost!)

Common scenarios:

Incrementing a counter or updating an account balance
Making local changes to complex values (e.g., adding element to JSON array)
Wiki page editing: Two users edit simultaneously, each sends full page content

A variety of solutions have been developed:

3.3.1. Atomic write operations

In plain English: Instead of reading the current value, modifying it in your application code, and writing it back, use a special database command that does the whole operation atomically in one step.

In technical terms: Many databases provide atomic update operations that remove the need to implement read-modify-write cycles in application code. They're usually the best solution if your code can be expressed using them.

ATOMIC OPERATIONS

SQL Databases

Atomic increment/update

UPDATE counters SET value = value + 1 WHERE key = 'foo'
Entire operation is concurrency-safe
Database handles locking internally

Document Databases

Atomic JSON modifications

MongoDB: atomic ops on JSON parts
Update nested fields atomically
No read-modify-write cycle needed

Redis

Atomic data structure operations

Priority queues
Lists, sets, sorted sets
All operations atomic

Implementation: Usually implemented by taking an exclusive lock on the object when read, or forcing all atomic operations on a single thread.

ORM framework danger: ORMs make it easy to accidentally write unsafe read-modify-write cycles instead of using atomic operations—a source of subtle bugs difficult to find by testing.

3.3.2. Explicit locking

In plain English: If atomic operations aren't flexible enough, manually lock the rows you're about to update. Other transactions trying to update the same rows must wait for your lock to be released.

In technical terms: The application explicitly locks objects that will be updated, performs a read-modify-write cycle, and any concurrent transaction trying to access the same object is forced to wait.

Example: Multiplayer game

BEGIN TRANSACTION;

-- Lock the game piece
SELECT * FROM figures
  WHERE name = 'robot' AND game_id = 222
  FOR UPDATE;  -- Tells database to lock these rows

-- Check if move is valid (custom game logic)
-- Then update the position
UPDATE figures SET position = 'c4' WHERE id = 1234;

COMMIT;

EXPLICIT LOCKING EXAMPLE

Player 1: SELECT...FOR UPDATE

Acquires lock on robot piece

↓

Player 2: Tries to Move Same Piece

Blocked waiting for Player 1's lock

↓

Player 1: Validates Move

Checks game rules in application code

↓

Player 1: UPDATE and COMMIT

Releases lock

↓

Player 2: Acquires Lock

Now can proceed with their move

Challenges:

Easy to forget locks somewhere in code → race condition
Deadlock risk if locking multiple objects (databases auto-detect and abort one transaction)
Must retry aborted transactions at application level

3.3.3. Automatically detecting lost updates

In plain English: Let transactions run in parallel and do their read-modify-write cycles. If the database detects a lost update, automatically abort the transaction and force it to retry. You don't need special code—the database just handles it.

In technical terms: Allow transactions to execute read-modify-write cycles in parallel. If the transaction manager detects a lost update, abort the transaction and force retry. Databases can perform this check efficiently with snapshot isolation.

AUTOMATIC LOST UPDATE DETECTION

Detects Lost Updates

Does NOT Detect

PostgreSQL

✓ repeatable read level

Oracle

✓ serializable level

SQL Server

✓ snapshot isolation

MySQL/InnoDB

✗ repeatable read level

Advantages:

No special database features required in application code
Less error-prone than forgetting locks
Happens automatically

Disadvantage: Must retry aborted transactions at application level

💡 Insight

Some argue a database must prevent lost updates to qualify as providing snapshot isolation, so by this definition MySQL doesn't provide true snapshot isolation. Lost update detection is a great feature because it's automatic and doesn't require special application code.

3.3.4. Conditional writes (compare-and-set)

In plain English: Update the value only if it hasn't changed since you last read it. If someone else changed it in the meantime, your update fails and you need to retry. It's like saying "I'll buy this if the price is still $10" instead of blindly buying regardless of current price.

In technical terms: In databases without transactions, a conditional write operation prevents lost updates by allowing an update only if the value hasn't changed since last read. If the current value doesn't match what you previously read, the update has no effect and the read-modify-write cycle must be retried.

Example: Preventing concurrent wiki edits

-- May or may not be safe, depending on database implementation
UPDATE wiki_pages
SET content = 'new content'
WHERE id = 1234
  AND content = 'old content';  -- Only update if content unchanged

Better approach with version numbers:

UPDATE wiki_pages
SET content = 'new content',
    version = version + 1
WHERE id = 1234
  AND version = 42;  -- Only update if version still 42

COMPARE-AND-SET OPERATION

Read

Get current value and version: (content='old', version=42)

↓

Modify

User edits content in application

↓

Write with Condition

UPDATE WHERE version=42

↓

Success or Failure

Success if version still 42, failure if someone else updated

Sometimes called "optimistic locking" (as opposed to pessimistic locking with SELECT FOR UPDATE).

MVCC caveat: In MVCC databases, there's often an exception to visibility rules for compare-and-set: values written by other transactions may be visible to the WHERE clause of UPDATE/DELETE, even if not otherwise visible in the snapshot.

3.3.5. Conflict resolution and replication

In plain English: In replicated databases where data can be modified on multiple servers simultaneously, preventing lost updates becomes even harder. You can't use locks or conditional writes because there's no single "main copy" of the data—each server has its own copy that can be modified independently.

In technical terms: In replicated databases with multi-leader or leaderless replication, concurrent writes can happen on different nodes and replicate asynchronously. Locks and conditional writes don't apply because there's no guarantee of a single up-to-date copy.

Alternative approaches:

CONFLICT RESOLUTION IN REPLICATION

🔄

Commutative Operations (CRDTs)

Operations that can be applied in any order

Incrementing a counter
Adding element to a set
Can be safely replicated
Same result regardless of order

🌿

Multi-version Conflict Resolution

Create conflicting versions (siblings)

Allow concurrent writes
Create multiple versions
Application merges after the fact
Complex but flexible

⚠️

Last Write Wins (LWW)

Default in many systems (problematic!)

Simple but loses updates
Based on timestamp
Prone to lost updates
Default in many replicated DBs

💡 Insight

Last Write Wins (LWW) is the default in many replicated databases, but it's prone to lost updates. If you're using multi-leader or leaderless replication, check whether your database uses LWW and consider whether that's acceptable for your use case.

3.4. Write Skew and Phantoms

In plain English: Write skew is a subtle race condition where two transactions read the same data, make decisions based on what they read, then each write to different places. Neither transaction sees the other's write, leading to a violated constraint. It's like two doctors both checking "are there 2 doctors on-call?" and both seeing "yes," then both going off-call, leaving zero doctors.

In technical terms: Write skew occurs when two transactions read the same objects, then update some of those objects (different transactions may update different objects). It's a generalization of the lost update problem. In the special case where transactions update the same object, you get a dirty write or lost update; when they update different objects, you get write skew.

3.4.1. Characterizing write skew

Classic example: Hospital on-call scheduling

WRITE SKEW: ON-CALL DOCTORS

Initial: Aaliyah and Bryce both on-call

↓

Aaliyah's Transaction

SELECT count from doctors
WHERE on_call = true

Result: 2

↓

Check: 2 ≥ 1? ✓

↓

UPDATE doctors
SET on_call = false
WHERE name = 'Aaliyah'

Bryce's Transaction

SELECT count from doctors
WHERE on_call = true

Result: 2

↓

Check: 2 ≥ 1? ✓

↓

UPDATE doctors
SET on_call = false
WHERE name = 'Bryce'

↓

Result: 0 doctors on-call (constraint violated!)

Why it happens:

Both transactions check the same condition (two doctors on-call)
Both see the database in the same state (snapshot isolation)
Both make different updates (to their own records)
Constraint requires ≥1 doctor on-call, but result is 0

Why it's subtle:

Not a dirty write (different objects updated)
Not a lost update (different objects updated)
Only obvious in hindsight: if transactions ran sequentially, the second doctor would be prevented from going off-call

How write skew generalizes lost updates:

WRITE-WRITE CONFLICT SPECTRUM

Lost Update

Both transactions update SAME object

→

Write Skew

Transactions read same data, update DIFFERENT objects

→

Phantom

Write changes result of search query

Options for preventing write skew are limited:

PREVENTION STRATEGIES

Does NOT Work

Works

Atomic operations

✗ Multiple objects involved

Automatic detection

✗ PostgreSQL, MySQL, Oracle, SQL Server

Constraints

✗ Most DBs don't support multi-object constraints

Serializable isolation

✓ True serializability prevents it

Explicit locks

✓ SELECT...FOR UPDATE on dependent rows

Using explicit locks to prevent write skew:

BEGIN TRANSACTION;

-- Lock all doctors on-call in this shift
SELECT * FROM doctors
  WHERE on_call = true
    AND shift_id = 1234
  FOR UPDATE;

-- Now safe to update
UPDATE doctors
  SET on_call = false
  WHERE name = 'Aaliyah'
    AND shift_id = 1234;

COMMIT;

3.4.2. More examples of write skew

WRITE SKEW SCENARIOS

🏢

Meeting Room Booking

Preventing double-booking

Check for overlapping bookings
If none found, create booking
Two transactions can both see no conflict
Both insert, creating double-booking

🎮

Multiplayer Game

Moving pieces on a board

Lock prevents moving same piece twice
Doesn't prevent moving to same position
Two players move different pieces to same spot
Violates game rules

👤

Username Registration

Claiming unique usernames

Check if username taken
If not, create account
Two users can both see name available
Solution: unique constraint

💰

Preventing Double-Spending

Account balance checks

Insert spending item
Check total balance is positive
Two concurrent spends can both pass check
Balance goes negative

3.4.3. Phantoms causing write skew

In plain English: A phantom is when one transaction's write changes the result of another transaction's search query. The searched-for data is like a ghost—it wasn't there when you looked, but appears (or disappears) when another transaction writes.

In technical terms: All write skew examples follow a similar pattern:

WRITE SKEW PATTERN

Step 1: SELECT

Check if requirement is satisfied (e.g., ≥2 doctors on-call, no room booking conflicts)

↓

Step 2: Decide

Based on query result, decide whether to proceed

↓

Step 3: Write

INSERT, UPDATE, or DELETE and commit

↓

Step 4: Precondition Changed

The write changes the precondition—if you repeated the SELECT, you'd get a different result

The phantom problem:

In the doctor example, we could lock the rows in step 1 (SELECT...FOR UPDATE) to prevent write skew. But in other examples (meeting room booking, username registration, double-spending), the query checks for the absence of rows. SELECT...FOR UPDATE can't lock something that doesn't exist!

Example: Meeting room booking

BEGIN TRANSACTION;

-- Check for conflicting bookings (returns zero rows if room is free)
SELECT COUNT(*) FROM bookings
  WHERE room_id = 123
    AND end_time   > '2025-01-01 12:00'
    AND start_time < '2025-01-01 13:00';

-- If zero, proceed to book
INSERT INTO bookings
  (room_id, start_time, end_time, user_id)
  VALUES (123, '2025-01-01 12:00', '2025-01-01 13:00', 666);

COMMIT;

The problem: SELECT returns zero rows, so FOR UPDATE has nothing to lock. Two concurrent transactions can both see zero conflicts and both insert bookings.

💡 Insight

Phantoms are particularly tricky because they involve data that doesn't exist yet. You can't lock something that isn't there. This is why even snapshot isolation, which prevents phantoms in read-only queries, struggles with phantoms in read-write transactions.

3.4.4. Materializing conflicts

In plain English: If the problem is "there's nothing to lock," create something to lock! For meeting rooms, create a table with rows for every room and time slot (e.g., 15-minute intervals for the next 6 months). Now you can lock the specific rows you need.

In technical terms: Materializing conflicts means taking a phantom and turning it into a lock conflict on a concrete set of rows that exist in the database.

MATERIALIZING CONFLICTS FOR MEETING ROOMS

Create Lock Table

Table with all room-time combinations (e.g., Room 123, 12:00-12:15, 12:15-12:30, etc.)

↓

Lock Relevant Rows

SELECT...FOR UPDATE on rows for desired room and time period

↓

Check Conflicts

After acquiring locks, check for overlapping bookings

↓

Insert Booking

Create booking if no conflicts

Problems with materializing conflicts:

Hard to figure out how to materialize them correctly
Error-prone to implement
Ugly because concurrency control leaks into application data model
Last resort when no alternative is possible

Better solution: Serializable isolation level (much preferable in most cases)

4. Serializability

In plain English: We've seen transactions are prone to many race conditions—some prevented by read committed and snapshot isolation, but others (like write skew and phantoms) are not. Wouldn't it be great if the database just prevented all race conditions automatically? That's serializability: the strongest isolation level.

In technical terms: Serializable isolation guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially, without any concurrency. The database prevents all possible race conditions.

The frustrating reality:

CHALLENGES WITH ISOLATION LEVELS

🤔

Hard to Understand

Inconsistently implemented across databases

'Repeatable read' varies significantly
Snapshot isolation vs. serializability
Marketing terms vs. actual guarantees

🧪

Hard to Test

Difficult to verify safety

Can't tell if code is safe at a given level
Large apps: unaware of all concurrent access
No good tools to detect race conditions

🎲

Nondeterministic

Problems only occur with unlucky timing

Testing doesn't reliably catch issues
Race conditions are rare in testing
Appear in production under load

Since the 1970s, researchers have had a simple answer: Use serializable isolation!

But if serializability is so much better, why isn't everyone using it?

To answer this, we need to examine the options for implementing serializability and how they perform. Three techniques:

SERIALIZABILITY IMPLEMENTATION APPROACHES

➡️

1. Actual Serial Execution

Literally execute transactions one at a time on a single thread

↓

🔒

2. Two-Phase Locking (2PL)

For decades the only viable option; readers block writers and vice versa

↓

✨

3. Serializable Snapshot Isolation (SSI)

Optimistic approach; detect conflicts and abort instead of blocking

4.1. Actual Serial Execution

In plain English: The simplest way to avoid concurrency problems is to eliminate concurrency entirely—just execute one transaction at a time. No parallelism, no race conditions. It sounds naive, but with modern hardware and careful design, it actually works!

In technical terms: Execute only one transaction at a time, in serial order, on a single thread. By removing concurrency entirely, the resulting isolation is by definition serializable.

Why this seemed impossible until the 2000s:

For 30 years, multi-threaded concurrency was considered essential for performance. What changed?

DEVELOPMENTS ENABLING SERIAL EXECUTION

💾

RAM Became Cheap

Keep entire active dataset in memory

No waiting for disk I/O
Transactions execute much faster
Single-threaded throughput sufficient

⚡

OLTP Are Short

Transaction characteristics recognized

OLTP: short, small number of reads/writes
Analytics: read-only, can use snapshots
Long-running queries outside serial loop

Systems using this approach: VoltDB/H-Store, Redis, Datomic

Performance: Can sometimes perform better than multi-threaded systems by avoiding coordination overhead of locking. However, throughput is limited to a single CPU core.

4.1.1. Encapsulating transactions in stored procedures

The traditional approach problem:

TRADITIONAL VS. STORED PROCEDURE TRANSACTIONS

Interactive Client/Server

Stored Procedures

Execution

Query → Result → Think → Query → ...

Submit entire transaction code upfront

Network

Many round-trips

One round-trip

Concurrency

Must process multiple transactions to hide network latency

Can execute serially without performance penalty

Speed

Slow due to network waits

Fast if all data in memory

Traditional interactive transaction:

Application makes query
Waits for network round-trip
Processes result
Makes another query based on result
Repeat...

Problem for serial execution: Database would spend most time waiting for network I/O. Throughput would be terrible.

Solution: Submit entire transaction code to database as a stored procedure.

INTERACTIVE VS STORED PROCEDURE EXECUTION

Interactive Transaction

Client: SELECT

network↓

DB: Result

network↓

Client: UPDATE

network↓

DB: OK

Multiple network round-trips

Stored Procedure

Client: Submit procedure

network↓

DB: Execute entirely

SELECT, UPDATE, etc.

network↓

DB: Final result

One network round-trip

4.1.2. Pros and cons of stored procedures

Stored procedures have a bad reputation:

STORED PROCEDURE PROBLEMS (TRADITIONAL)

🔧

Vendor-Specific Languages

Oracle: PL/SQL
SQL Server: T-SQL
PostgreSQL: PL/pgSQL
Archaic, ugly syntax
Lack modern language features

😰

Hard to Manage

Difficult to debug
Awkward version control
Tricky to test
Hard to integrate with monitoring

🔥

Performance Sensitive

Database shared by many app servers
Badly written procedure affects everyone
More trouble than app server code

🔒

Security Risk

In multitenant systems
Untrusted code in DB kernel
Security vulnerability

Modern implementations overcome these issues:

MODERN STORED PROCEDURE IMPLEMENTATIONS

VoltDB

Java or Groovy

Datomic

Java or Clojure

Redis

Lua scripting

MongoDB

JavaScript

Additional benefits:

GraphQL deployments: Validation logic can live in database stored procedures when proxy doesn't support complex logic
State machine replication: VoltDB executes same stored procedure on each replica instead of copying writes (requires deterministic execution)

4.1.3. Sharding

In plain English: A single CPU core can only process so many transactions per second. To scale beyond that, split your data across multiple CPU cores or machines, giving each its own independent transaction processing thread.

In technical terms: Single-threaded execution limits throughput to one CPU core's speed. To scale, shard your data so each transaction only reads/writes data within a single shard. Each shard has its own transaction processing thread running independently.

SHARDING FOR SERIAL EXECUTION

Single Shard Transactions

Shard A

CPU Core 1

Fast throughput

Single Shard Transactions

Shard B

CPU Core 2

Fast throughput

Cross-Shard Transaction

Shards A + B

Coordination needed

Much slower

Performance characteristics:

Transaction Type	Throughput	Scalability
Single-shard	Very high	Linear with CPU cores
Cross-shard	~1,000/sec (VoltDB)	Does not scale with machines

Why cross-shard is slow: Must coordinate across all touched shards in lock-step to ensure serializability. Requires additional coordination overhead.

Whether this works depends on data structure:

Simple key-value data: Easily sharded
Data with secondary indexes: Requires lots of cross-shard coordination

💡 Insight

The success of serial execution depends heavily on whether your workload can be partitioned to avoid cross-shard transactions. For some applications this works great; for others, cross-shard coordination becomes a bottleneck that negates the benefits.

4.1.4. Summary of serial execution

Serial execution of transactions is viable within certain constraints:

REQUIREMENTS FOR SERIAL EXECUTION

⚡

Fast Transactions

Every transaction must be small and fast

One slow transaction stalls everything
Use stored procedures
All data in memory

💾

Active Dataset in Memory

Working set must fit in RAM

Rarely accessed data can be on disk
But accessing it would be very slow
Limits dataset size

✍️

Low Write Throughput

Single CPU core must handle load

Or transactions must be shardable
Without cross-shard coordination
Cross-shard throughput hard to scale

4.2. Two-Phase Locking (2PL)

In plain English: Two-phase locking is like a strict library where readers and writers can never be in the same room at the same time. If someone is reading a book, nobody can write in it. If someone is writing in a book, nobody can read it. Everyone has to wait their turn. This prevents all race conditions but can make things very slow.

In technical terms: For around 30 years, two-phase locking (2PL) was the only widely used algorithm for serializability in databases. It makes lock requirements much stronger than simple dirty write prevention: readers block writers AND writers block readers.

Important note: 2PL is NOT 2PC!

2PL vs 2PC - DIFFERENT THINGS!

Two-Phase Locking (2PL)

Provides:

Serializable isolation

Purpose:

Prevent race conditions

Two-Phase Commit (2PC)

Provides:

Atomic commit

Purpose:

Distributed transactions

2PL locking rules:

TWO-PHASE LOCKING RULES

Transaction A Reads

Transaction B wants to write → B must wait for A to commit/abort

↓

Transaction A Writes

Transaction B wants to read → B must wait for A to commit/abort

Key difference from snapshot isolation:

Approach	Mantra
Snapshot isolation	Readers never block writers, writers never block readers
Two-phase locking	Readers block writers, writers block readers

Benefit: Protects against all race conditions (lost updates, write skew, phantoms, etc.)

4.2.1. Implementation of two-phase locking

Used by: MySQL (InnoDB) serializable level, SQL Server serializable level, Db2 repeatable read level

Lock mechanism:

SHARED VS EXCLUSIVE LOCKS

Shared Mode (Read Lock)

Exclusive Mode (Write Lock)

Who can hold

Multiple transactions simultaneously

Only one transaction

Acquired for

Reading an object

Writing an object

Blocks

Exclusive locks only

All other locks (shared and exclusive)

Upgrade

Can upgrade to exclusive

N/A

Locking protocol:

To read: Acquire shared lock (multiple transactions can hold simultaneously, but must wait if exclusive lock exists)
To write: Acquire exclusive lock (no other locks can exist; must wait for all existing locks)
Upgrade: Transaction can upgrade shared → exclusive lock
Hold until end: After acquiring locks, must hold until commit or abort

Where "two-phase" comes from:

TWO PHASES OF 2PL

Phase 1: Growing

While transaction executes

↓

Locks are acquired

→

Phase 2: Shrinking

At end of transaction

↓

All locks are released

Deadlocks:

With many locks in use, deadlocks happen easily:

Transaction A waiting for B's lock
Transaction B waiting for A's lock
Database auto-detects and aborts one transaction
Application must retry aborted transaction

4.2.2. Performance of two-phase locking

Why 2PL has poor performance:

2PL PERFORMANCE PROBLEMS

🔒

Lock Overhead

Acquiring and releasing many locks

CPU overhead
Memory for lock tracking
Coordination between threads

🚦

Reduced Concurrency

Transactions waiting for each other

Any potential race condition causes waiting
One waits for other to complete
Throughput limited

📈

Unstable Latencies

Unpredictable response times

Very slow at high percentiles
One slow transaction blocks many others
Cascading delays

💥

Frequent Deadlocks

Much more common than with weak isolation

Aborted transactions must retry
Work done all over again
Significant wasted effort

Worst-case scenario example:

A transaction reading an entire table (backup, analytics, integrity check):

Takes shared lock on entire table
Must wait for all in-progress writes to complete
While reading (potentially hours), all writes are blocked
Database becomes unavailable for writes

💡 Insight

2PL's fundamental problem is that it's pessimistic—if anything might possibly go wrong, wait. This conservative approach guarantees safety but sacrifices performance. In high-contention workloads, the database can grind to a halt.

4.2.3. Predicate locks

The phantom problem (revisited):

In write skew examples, one transaction changes the results of another's search query. To prevent this, we need to lock not just existing rows, but all objects that match some search condition.

In plain English: Instead of locking specific rows (like "Account #123"), lock a pattern (like "all accounts with balance < $100"). This prevents other transactions from inserting, updating, or deleting anything that matches your pattern.

In technical terms: A predicate lock belongs to all objects that match some search condition, rather than a particular object.

Example: Meeting room booking

SELECT * FROM bookings
  WHERE room_id = 123
    AND end_time   > '2025-01-01 12:00'
    AND start_time < '2025-01-01 13:00';

Predicate lock works as follows:

PREDICATE LOCK MECHANISM

Transaction A: Read with Condition

Acquires shared predicate lock on search condition

↓

Transaction B: Wants to Insert/Update

Must check if new/old value matches any predicate lock

↓

If Match Found

Transaction B must wait until A commits/aborts

Key property: Applies even to objects that don't yet exist in the database (phantoms)!

If two-phase locking includes predicate locks, the database prevents all forms of write skew and other race conditions → isolation becomes serializable.

4.2.4. Index-range locks

The problem with predicate locks: Performance is poor. Checking for matching locks becomes time-consuming with many active transactions.

The solution: Index-range locking (also called next-key locking)—a simplified approximation of predicate locking.

In plain English: Instead of locking exactly what you need, lock a slightly bigger range. It's safe because anything matching your exact condition will definitely be in the bigger range. It's like locking the entire bookshelf instead of just one book.

In technical terms: Safe to simplify a predicate by making it match a greater set of objects. Any write matching the original predicate will also match the approximation.

INDEX-RANGE LOCKING EXAMPLE

Original predicate: Room 123 between 12:00-13:00

↓

Approximation 1

Lock all bookings

for Room 123

at any time

Approximation 2

Lock all rooms

between

12:00-13:00

↓

Both approximations are safe—they include the original condition

How it works with indexes:

Scenario 1: Index on room_id

Database finds existing bookings for room 123 using index
Attaches shared lock to this index entry
Any transaction trying to insert/update room 123 booking must update same index entry
Encounters the shared lock and must wait

Scenario 2: Index on time

Database finds bookings overlapping 12:00-13:00 using time index
Attaches shared lock to range of values in index
Any transaction trying to insert/update overlapping booking must touch same index range
Encounters the lock and must wait

Fallback: If no suitable index exists, database can use shared lock on entire table (bad for performance but safe).

💡 Insight

Index-range locks aren't as precise as predicate locks (they may lock more than strictly necessary), but they have much lower overhead. This trade-off makes them practical for production systems, whereas pure predicate locks would be too slow.

4.3. Serializable Snapshot Isolation (SSI)

In plain English: SSI is like having everyone proceed optimistically (no blocking), but with a security camera checking if anything went wrong. If two people's actions would have conflicted, one gets caught by the camera and has to redo their work. Most of the time everyone proceeds smoothly; conflicts are rare enough that the occasional retry is worth the better performance.

In technical terms: Serializable Snapshot Isolation provides full serializability with only a small performance penalty compared to snapshot isolation. It's an optimistic concurrency control technique—instead of blocking when something might go wrong, transactions proceed anyway and are checked at commit time.

The breakthrough: SSI was first described in 2008—comparatively new! Prior to this, the choice was:

Serializable isolation with poor performance (2PL) or limited scale (serial execution)
Weak isolation with good performance but prone to race conditions

SSI bridges this gap.

Used by:

SSI IMPLEMENTATIONS

Single-Node Databases

PostgreSQL (serializable level)
SQL Server In-Memory OLTP/Hekaton
HyPer

Distributed Databases

CockroachDB
FoundationDB

Embedded Storage

BadgerDB

4.3.1. Pessimistic versus optimistic concurrency control

PESSIMISTIC VS OPTIMISTIC CONCURRENCY

Pessimistic (2PL, Serial Execution)

Optimistic (SSI)

Philosophy

If anything might go wrong, wait until safe

Continue anyway, check at commit time

Blocking

Transactions wait for locks

Transactions proceed without blocking

When conflict detected

Before operation

At commit (after work done)

High contention

Slow but no wasted work

Many aborts, wasted work

Spare capacity

Underutilizes resources

Better throughput

Serial execution is pessimistic to the extreme: equivalent to an exclusive lock on entire database for each transaction's duration. Compensate by making transactions very fast.

SSI optimistic approach: When transaction wants to commit, database checks whether anything bad happened (isolation violated). If so, abort and retry. Only serially-executed transactions allowed to commit.

Performance characteristics:

High contention: Optimistic performs badly (many aborts, wasted work from retries)
Spare capacity + low contention: Optimistic performs better than pessimistic
Commutative operations: Reduce contention (e.g., concurrent counter increments don't conflict if not read in same transaction)

💡 Insight

The key insight: SSI is based on snapshot isolation (all reads from consistent snapshot) + an algorithm for detecting serialization conflicts. It combines the performance benefits of snapshot isolation with the safety guarantees of serializability.

4.3.2. Decisions based on an outdated premise

The write skew pattern (revisited):

Recall from write skew: a transaction reads data, makes a decision based on what it saw, then writes. Under snapshot isolation, the data may have changed by commit time—the premise may no longer be true.

In plain English: You made a decision based on old information (like checking "are there 2 doctors on-call?" and seeing "yes"), but by the time you act on that decision (going off-call), the information is outdated (the other doctor already went off-call).

In technical terms: The transaction takes an action based on a premise (a fact true at transaction start). When committing, the premise may no longer be true because other transactions modified the data. To provide serializability, the database must detect when a transaction acted on an outdated premise and abort it.

The challenge: The database doesn't know how application logic uses query results. To be safe, it must assume any change in query results means the transaction's writes may be invalid—there may be a causal dependency between queries and writes.

Two cases to consider:

SSI DETECTION SCENARIOS

👁️

Stale MVCC Reads

Uncommitted write occurred before the read

Transaction reads from snapshot
Ignores uncommitted writes (MVCC rule)
By commit time, those writes have committed
Read was based on stale data

✍️

Writes Affecting Prior Reads

Write occurs after the read

Transaction reads data
Another transaction modifies that data
Original transaction's premise invalidated
Decision based on now-changed data

4.3.3. Detecting stale MVCC reads

The scenario:

DETECTING STALE MVCC READS

Transaction 42 Starts

Modifies Aaliyah's on-call status (uncommitted)

↓

Transaction 43 Starts

Reads Aaliyah's status, sees on_call=true (ignoring txn 42 per MVCC)

↓

Transaction 42 Commits

Aaliyah's status now committed as on_call=false

↓

Transaction 43 Wants to Commit

Database detects: ignored write from txn 42 has now committed

↓

Transaction 43 Aborted

Premise no longer valid—must retry

How it works:

Track ignored writes: Database remembers when a transaction ignores another's writes due to MVCC visibility rules
Check at commit: When transaction wants to commit, check if any ignored writes have now committed
Abort if stale: If so, the transaction must be aborted

Why wait until commit? Why not abort immediately when stale read detected?

Read-only transactions don't need to abort (no risk of write skew)
Database doesn't know yet if transaction will later write
Transaction 42 may still abort (read turns out not to be stale after all)
Avoiding unnecessary aborts preserves snapshot isolation's support for long-running reads

4.3.4. Detecting writes that affect prior reads

The scenario:

DETECTING WRITES AFFECTING PRIOR READS

Transaction 42

Read doctors WHERE shift_id=1234

↓

Sees 2 doctors on-call

↓

UPDATE: Set Aaliyah off-call

Transaction 43

Read doctors WHERE shift_id=1234

↓

Sees 2 doctors on-call

↓

UPDATE: Set Bryce off-call

↓

Both transactions notify each other that prior reads are affected

How it works (similar to index-range locks, but without blocking):

Record reads: When transaction searches (e.g., WHERE shift_id = 1234), use index entry to record which transactions read this data
Check on write: When a transaction writes, look in indexes for other transactions that recently read affected data
Notify conflicts: Instead of blocking, simply notify those transactions that their data may be outdated (acts as a "tripwire")
Check at commit: When notified transaction tries to commit, check if conflicting write has committed

Example from diagram:

Transaction 43 notifies Transaction 42: your read is outdated
Transaction 42 notifies Transaction 43: your read is outdated
Transaction 42 commits first: Successful (Transaction 43's write hasn't committed yet)
Transaction 43 tries to commit: Conflicting write from 42 already committed → must abort

Tracking granularity trade-off:

Detailed tracking: Precise about which transactions abort, but high bookkeeping overhead
Coarse tracking: Fast, but may abort more transactions than strictly necessary

4.3.5. Performance of serializable snapshot isolation

Engineering details matter: Many factors affect practical performance. Trade-off example: granularity of tracking reads/writes.

Compared to two-phase locking:

SSI VS 2PL

Two-Phase Locking

Serializable Snapshot Isolation

Blocking

Transactions wait for locks

No blocking (one doesn't wait for another)

Query latency

Unpredictable, high variance

More predictable, lower variance

Read-only queries

May need locks

Run on snapshot without any locks

Read-heavy workloads

Poor performance

Excellent performance

Compared to serial execution:

Not limited to single CPU core throughput
FoundationDB distributes conflict detection across machines
Can scale to very high throughput
Transactions can read/write data in multiple shards while ensuring serializability

Compared to non-serializable snapshot isolation:

Checking for violations introduces overhead
Debate about whether it's worth it:
- Some: "Serializability checking not worth the cost"
- Others: "Performance now so good, no need for weaker isolation"

Abort rate significantly affects performance:

Long-running read-write transactions likely to conflict and abort
SSI requires fairly short read-write transactions
Long-running read-only transactions are fine
Less sensitive to slow transactions than 2PL or serial execution

💡 Insight

SSI represents a major advancement in database research—it's the first practical algorithm offering true serializability with performance approaching snapshot isolation. The key innovation: detect conflicts instead of preventing them, allowing much more concurrency.

5. Distributed Transactions

In plain English: Everything we've discussed so far works whether your database is on one machine or spread across many. But when a single transaction needs to update data on multiple machines, ensuring atomicity becomes much harder. You can't just tell each machine "commit now" because some might succeed while others fail, leaving your data inconsistent across machines.

In technical terms: Concurrency control for isolation (the "I" in ACID) applies similarly to both single-node and distributed databases. Consistency and durability also don't change much. However, atomicity requires special care in distributed transactions.

5.1. The Atomic Commitment Problem

Single-node atomicity:

On a single database node, atomicity is implemented by the storage engine:

SINGLE-NODE ATOMIC COMMIT

Client Requests Commit

Application asks database to commit transaction

↓

Write Data to Disk

Database makes writes durable (write-ahead log)

↓

Write Commit Record

Append commit record to log on disk (the deciding moment!)

↓

Result Determined

If commit record written: committed. If crash before: aborted.

Key point: It's a single device (one disk controller, on one machine) that makes the commit atomic. Before the commit record is written, abort is still possible. After it's written, the transaction is committed (even if database crashes).

Multi-node problem:

What if multiple nodes are involved? Examples:

Multi-object transaction in sharded database
Global secondary index (index on different node from primary data)

Simply sending commit requests to all nodes doesn't work:

NAIVE MULTI-NODE COMMIT FAILS

Application sends "commit" to all nodes

↓

Possible Failures

Constraint violation on some nodes

Network loses some commit requests

Some nodes crash before commit

↓

Result: Transaction commits on some nodes, aborts on others!

Why this is catastrophic:

INCONSISTENT COMMIT PROBLEM

User 1: Transaction Partially Commits

Commits on DB 1, fails on DB 2

↓

User 2: Reads from DB 2

Reads data from successful commit on DB 2

↓

User 1: Discovers Failure

Learns commit failed on DB 1

↓

Impossible to Undo

Can't abort—User 2 already saw committed data on DB 2!

Once data is committed on one node, it becomes visible to other transactions (under read committed or stronger isolation). If it's later aborted on another node, transactions that read the "committed" data would have to be reverted too—impossible!

The solution: Ensure all nodes involved in a transaction either all commit or all abort. This is the atomic commitment problem.

5.2. Two-Phase Commit (2PC)

In plain English: Two-phase commit is like a wedding ceremony. The officiant asks both the bride and groom "Do you take this person?" individually. Only after both say "I do" does the officiant pronounce them married. If either says "no," the ceremony is aborted. Once the officiant makes the pronouncement, the marriage is final—no take-backs!

In technical terms: Two-phase commit is a classic algorithm for achieving atomic transaction commit across multiple nodes. Instead of a single commit request, the commit/abort process is split into two phases.

Used by:

Internal to some databases
XA transactions (Java Transaction API)
WS-AtomicTransaction (SOAP web services)

TWO-PHASE COMMIT FLOW

Phase 0: Application Writes

Application reads/writes data on multiple nodes (participants)

↓

Phase 1: Prepare

Coordinator asks each participant: 'Are you ready to commit?'

↓

Participants Vote

Each replies 'yes' (ready) or 'no' (abort)

↓

Phase 2: Commit or Abort

If all voted 'yes': coordinator sends commit. If any voted 'no': coordinator sends abort.

Key component: The Coordinator (also called transaction manager)

Often implemented as a library within application process
Can be separate process or service
Examples: Narayana, JOTM, BTM, MSDTC

5.2.1. A system of promises

Why does 2PC work when simple commit doesn't?

To understand, let's break down the process in detail:

DETAILED 2PC PROTOCOL

1. Request Transaction ID

Application requests globally unique transaction ID from coordinator

↓

2. Begin on Each Participant

App begins single-node transaction on each participant, attaching global txn ID

↓

3. Reads and Writes

All reads/writes done in single-node transactions. If anything fails, coordinator or participant can abort.

↓

4. Coordinator Sends Prepare

When app ready to commit, coordinator sends prepare to all participants

↓

🤝

5. Participants Make Promise

Each participant ensures it CAN commit (write to disk, check constraints, no conflicts)

↓

✋

6. Participants Vote

Reply 'yes' = promise to commit. Reply 'no' = will abort. By saying 'yes', participant surrenders right to abort!

↓

⚖️

7. Coordinator Decides

Write decision to transaction log on disk (commit point!)

↓

♾️

8. Coordinator Sends Decision

Send commit/abort to all participants. If this fails, retry FOREVER.

Two crucial "points of no return":

2PC PROMISES

🤝

Participant's Promise

When voting 'yes'

Promises it will definitely commit if asked
Surrenders right to abort
Must keep promise even after crash/recovery
Coordinator may still abort

⚖️

Coordinator's Decision

Once decision written to log

Decision is irrevocable
Must be enforced no matter what
Even if requires infinite retries
No going back

Marriage analogy:

Before "I do": Both parties can abort ("No way!")
After "I do": Cannot retract, even if you faint
Recovering from fainting: Ask officiant for status of your transaction ("Am I married?")
Officiant's retries: Continue pronouncing you married until you acknowledge

💡 Insight

The key insight: 2PC separates the "I promise I can commit" moment (prepare phase) from the "actually commit now" moment (commit phase). This separation, combined with durable logging of the coordinator's decision, makes atomic commitment across nodes possible.

5.2.2. Coordinator failure

What happens if participants or network fail?

Prepare request fails/times out → Coordinator aborts
Commit/abort request fails → Coordinator retries indefinitely

But what if the coordinator crashes?

COORDINATOR FAILURE SCENARIOS

Before Sending Prepare

Participant can safely abort (hasn't promised anything)

→

❓

After Participant Votes 'Yes'

Participant is IN DOUBT—cannot commit or abort unilaterally

The in-doubt (uncertain) state:

IN-DOUBT PARTICIPANT PROBLEM

Scenario: Coordinator decides to commit, crashes before telling all participants

↓

Database 2

Received commit

Transaction committed

Database 1

Didn't receive commit

Coordinator crashed

What should I do?

↓

Database 1 cannot decide!

• Can't commit: other participant may have aborted

• Can't abort: DB 2 has committed (would be inconsistent)

• Timeout doesn't help: still don't know what happened

The only solution: Wait for coordinator to recover.

Coordinator recovery process:

Coordinator must write commit/abort decision to log before sending to participants
When coordinator recovers, reads transaction log
Determines status of all in-doubt transactions from log
Transactions without commit record in log → abort
Transactions with commit record → send commit to participants

Why 2PC is called a "blocking" protocol: Can become stuck waiting for coordinator to recover.

💡 Insight

The in-doubt state is 2PC's Achilles' heel. Participants hold locks and resources while waiting for the coordinator. If the coordinator takes 20 minutes to recover, those locks are held for 20 minutes, blocking other transactions and potentially making large parts of your application unavailable.

5.2.3. Three-phase commit

Alternative to 2PC: Three-phase commit (3PC) is a nonblocking atomic commit protocol.

However: 3PC assumes:

Network with bounded delay
Nodes with bounded response times

In practice: Most real systems have:

Unbounded network delay
Process pauses (see Chapter 9)

Result: 3PC cannot guarantee atomicity in practical systems.

Better solution: Replace single-node coordinator with a fault-tolerant consensus protocol (see Chapter 10).

5.3. XA Transactions

In plain English: XA is a standard that allows completely different technologies (Oracle database, IBM message queue, PostgreSQL, etc.) to participate in the same two-phase commit transaction. It defines a common language for coordinators and participants to communicate, even if they're built by different vendors.

In technical terms: X/Open XA (eXtended Architecture) is a standard for implementing two-phase commit across heterogeneous technologies. Introduced in 1991, widely implemented by traditional relational databases and message brokers.

5.3.1. Distributed transactions across different systems

Two types of distributed transactions (often conflated):

TYPES OF DISTRIBUTED TRANSACTIONS

Database-Internal

Heterogeneous (XA)

Participants

All nodes run same database software

Two or more different technologies

Protocol

Can use any protocol, apply optimizations

Must use standard XA protocol

Performance

Can work quite well

More challenging

Examples

YugabyteDB, TiDB, FoundationDB, Spanner, VoltDB

PostgreSQL + ActiveMQ, Oracle + IBM MQ

Heterogeneous transaction example:

HETEROGENEOUS TRANSACTION USE CASE

Message Arrives

Message queue delivers message to application

↓

Process Message

Application processes message, writes to database

↓

Atomic Commit

Database write AND message acknowledgment in single transaction

↓

Result

Either both happen or neither happens (exactly-once semantics)

Exactly-once message processing:

If message delivery OR database write fails → both aborted
Message broker can safely redeliver later
By atomically committing message acknowledgment + database writes, message is processed exactly once
Partial completion is rolled back

Limitation: Only possible if all systems support same atomic commit protocol (e.g., XA).

Example problem without XA: If email server doesn't support 2PC, sending email can't be part of atomic transaction. Email might be sent twice if processing fails and is retried.

5.3.2. XA transactions

What is XA?

Not a network protocol—it's a C API for interfacing with transaction coordinator
Bindings exist for other languages (Java Transaction API/JTA in Java EE)

Supported by:

XA SUPPORT

Databases

PostgreSQL

MySQL

Db2

SQL Server

Oracle

Message Brokers

ActiveMQ

HornetQ

MSMQ

IBM MQ

How it works:

XA ARCHITECTURE

Application uses network driver/client library

↓

Driver supports XA → calls XA API

↓

Driver sends necessary info to database/message broker

↓

Driver exposes callbacks for coordinator (prepare, commit, abort)

Transaction coordinator implementation:

Often a library loaded into same process as application
Not a separate service
Keeps track of participants
Collects responses after prepare
Uses log on local disk to track commit/abort decisions

If application crashes:

Coordinator goes with it (same process)
Participants stuck in doubt
Must restart application server
Coordinator reads log to recover decisions
Uses driver callbacks to tell participants to commit/abort

Important limitation: Database cannot contact coordinator directly—all communication via client library.

5.3.3. Holding locks while in doubt

In plain English: When a participant is in doubt, it's like a person standing in a doorway—they're blocking the way but don't know whether to go forward (commit) or backward (abort). Meanwhile, everyone else who needs to use that doorway (access that data) has to wait.

In technical terms: Why is being stuck in doubt such a problem?

Lock requirements:

Transactions take row-level exclusive locks on modified rows (prevent dirty writes)
With serializable isolation, also take shared locks on read rows
Cannot release locks until commit or abort

LOCKS HELD WHILE IN DOUBT

Normal Transaction

Acquire locks → Do work → Commit/Abort → Release locks (seconds)

↓

In-Doubt Transaction

Acquire locks → Vote 'yes' → Coordinator crashes → Hold locks forever

↓

Impact

Other transactions blocked from reading or writing those rows

Example scenario:

2PC transaction modifies rows, votes "yes"
Coordinator crashes, takes 20 minutes to restart
Locks held for 20 minutes
Other transactions wanting to access those rows: blocked for 20 minutes
Depending on isolation level: may block even read-only queries

Result: Large parts of application become unavailable.

💡 Insight

This is why 2PC has such a bad reputation in practice. In theory it provides atomicity; in practice, coordinator failures can cause prolonged outages. Unlike other failure modes that affect only the failed component, 2PC coordinator failures can bring down large parts of your entire application.

5.3.4. Recovering from coordinator failure

In theory: Coordinator should cleanly recover from its log and resolve in-doubt transactions.

In practice: Orphaned in-doubt transactions occur—coordinator cannot decide outcome because:

Transaction log lost or corrupted
Software bug
Other issues

These transactions:

Cannot be resolved automatically
Sit forever in database
Hold locks indefinitely
Block other transactions perpetually

Rebooting doesn't fix it: 2PC must preserve locks of in-doubt transactions even across restarts (otherwise violates atomicity guarantee).

The only solution: Manual intervention

Administrator must:

Examine participants of each in-doubt transaction
Determine if any participant has committed or aborted
Manually apply same outcome to other participants

Challenges:

Requires manual effort
Done under stress during production outage
High time pressure
Complex and error-prone

Emergency escape hatch: Heuristic decisions

Participants can unilaterally decide to commit or abort
Without coordinator's definitive decision
"Heuristic" = euphemism for "probably breaking atomicity"
Violates 2PC's system of promises
Intended only for catastrophic situations, not regular use

💡 Insight

The combination of holding locks during in-doubt periods and potential for manual recovery makes XA transactions operationally risky. Many cloud services choose not to implement distributed transactions specifically to avoid these operational problems.

5.3.5. Problems with XA transactions

FUNDAMENTAL XA PROBLEMS

❌

Coordinator Single Point of Failure

For the entire system

Part of application server
Coordinator logs are crucial
As important as databases themselves
Hard to make highly available

🚫

No Direct Communication

Fundamental protocol limitation

Coordinator and participants can't talk directly
Must communicate via application code
Application becomes single point of failure
Even if coordinator is replicated

⬇️

Lowest Common Denominator

Must work with wide range of systems

Can't detect deadlocks across systems
Doesn't work with SSI
No advanced features
Limited to what all systems support

The application code problem:

Even if coordinator were highly available and replicated, application code would still be a single point of failure. Solving this would require totally redesigning how application code runs (similar to durable execution), but no practical tools take this approach.

Limitations across heterogeneous systems:

Deadlock detection: Would require standardized protocol for exchanging lock information
SSI: Would require protocol for identifying conflicts across systems
Both would need to be standardized and implemented by all participants

💡 Insight

These problems are somewhat inherent in performing transactions across heterogeneous technologies. However, keeping heterogeneous systems consistent is still a real problem. The solution isn't XA—we need different approaches (covered in next section and Chapter 12).

5.4. Database-Internal Distributed Transactions

In plain English: When a database is designed from the ground up to be distributed (like modern "NewSQL" databases), it can use much better protocols for distributed transactions than XA. Because all the nodes run the same software and can talk directly to each other, they can use optimized protocols and avoid XA's problems.

In technical terms: Database-internal distributed transactions are those where all participating nodes are shards of the same database running the same software. These avoid XA's limitations because they don't need to interface with other technologies.

Examples: CockroachDB, TiDB, Spanner, FoundationDB, YugabyteDB, Kafka

Key difference: Many use 2PC internally but don't suffer XA problems because they can optimize the protocol.

DATABASE-INTERNAL VS XA TRANSACTIONS

XA (Heterogeneous)

Database-Internal

Coordinator

Single-node, part of application

Replicated with automatic failover

Communication

Via application code only

Direct communication between nodes

Participant replication

Up to individual systems

All shards replicated

Concurrency control

Limited to lowest common denominator

Integrated deadlock detection, SSI support

Performance

Poor

Good

How database-internal distributed transactions fix XA problems:

IMPROVEMENTS OVER XA

🔄

Replicated Coordinator

Automatic failover

Coordinator replicated using consensus
Fails over automatically if primary crashes
No manual intervention needed
No prolonged unavailability

↔️

Direct Communication

Between coordinator and shards

Don't need to route via application
Application not single point of failure
More reliable communication
Better performance

📦

Replicated Shards

Fault tolerance

Each shard is replicated
Reduces risk of abort due to faults
Better availability
Faster recovery

🔗

Integrated Protocols

Advanced features

Deadlock detection across shards
Snapshot isolation across shards
SSI support
Optimized for single system

Consensus algorithms: Commonly used to replicate coordinator and database shards. We'll see in Chapter 10 how to implement atomic commitment using consensus (Chapter 10).

Isolation levels: Snapshot isolation and SSI are both possible across shards.

💡 Insight

The key lesson: Distributed transactions aren't inherently bad—XA transactions are bad. When the entire distributed database is designed as a cohesive system, distributed transactions can work well. This is a major achievement of NewSQL databases.

5.4.1. Exactly-once message processing revisited

Recall: An important use case for distributed transactions is ensuring operations take effect exactly once, even if crashes occur and processing needs retry.

With distributed transactions: Atomically commit across message broker and database—acknowledge message if and only if database writes succeeded.

Alternative approach (without distributed transactions):

Only requires transactions within the database:

EXACTLY-ONCE WITHOUT DISTRIBUTED TRANSACTIONS

1. Check Message ID

Begin DB transaction, check if message ID already in processed table

↓

2. If Already Processed

Drop message, acknowledge to broker (idempotent retry)

↓

3. If Not Processed

Add message ID to table, process message, commit DB transaction

↓

4. Acknowledge to Broker

After successful DB commit, acknowledge message

↓

5. Cleanup (Optional)

Delete message ID from table (separate transaction)

Crash scenarios:

Crash Point	What Happens	Result
Before DB commit	Transaction aborted, broker retries	Message reprocessed, ID check prevents duplicate
After DB commit, before ACK	Broker retries	Message ID in DB, dropped as duplicate
After ACK, before cleanup	Old message ID remains	Harmless (just takes storage space)
Network interruption	If before commit aborted	Uniqueness constraint prevents concurrent inserts

Key insight: Recording the message ID in the database makes processing idempotent—can safely retry without duplicating side effects.

Similar approach: Used by Kafka Streams for exactly-once semantics (Chapter 12).

Internal distributed transactions still useful: Allow message IDs on one shard, main data on other shards, with atomicity across shards.

💡 Insight

You don't always need distributed transactions across different systems to achieve exactly-once semantics. With idempotent operations and message IDs, you can achieve the same guarantee with only single-system transactions. However, distributed transactions within a database (across shards) are still valuable for scalability.

6. Summary

In plain English: Transactions are a lifesaver for application developers. Instead of manually handling crashes, network failures, concurrent users, and countless edge cases, transactions let you pretend these problems don't exist. If something goes wrong, just retry. This simplifies programming enormously.

In technical terms: Transactions are an abstraction layer that allows applications to pretend certain concurrency problems and hardware/software faults don't exist. A large class of errors reduces down to simple transaction abort, requiring only retry logic.

What we covered:

TRANSACTION CONCEPTS COVERED

📖

ACID Meaning

What the letters really mean

Atomicity = abortability
Consistency = application invariants
Isolation = concurrency control
Durability = no data loss

⚠️

Weak Isolation Levels

Common but dangerous

Read committed
Snapshot isolation
Still allow many race conditions
Naming confusion across databases

🏁

Race Conditions

Problems with weak isolation

Dirty reads/writes
Read skew
Lost updates
Write skew and phantoms

✅

Serializability

Strongest isolation

Serial execution
Two-phase locking (2PL)
Serializable snapshot isolation (SSI)
Prevents all race conditions

Isolation Levels Summary

We discussed several widely-used isolation levels and characterized them by race conditions they allow or prevent:

ISOLATION LEVELS AND ANOMALIES

Isolation Level	Dirty Reads	Read Skew	Phantom Reads	Lost Updates	Write Skew
Read Uncommitted	✗ Possible	✗ Possible	✗ Possible	✗ Possible	✗ Possible
Read Committed	✓ Prevented	✗ Possible	✗ Possible	✗ Possible	✗ Possible
Snapshot Isolation	✓ Prevented	✓ Prevented	✓ Prevented	? Depends	✗ Possible
Serializable	✓ Prevented	✓ Prevented	✓ Prevented	✓ Prevented	✓ Prevented

Anomaly definitions:

Dirty reads: Reading another client's uncommitted writes
Dirty writes: Overwriting data another client wrote but hasn't committed
Read skew: Seeing different parts of database at different points in time (nonrepeatable reads)
Lost updates: Two concurrent read-modify-write cycles; one overwrites other's write
Write skew: Transaction reads data, makes decision, writes—but premise no longer true at commit time
Phantom reads: Transaction reads objects matching search; another transaction's write affects search results

Serializability Implementation Summary

Only serializable isolation protects against all anomalies. We discussed three approaches:

IMPLEMENTING SERIALIZABILITY

➡️

Serial Execution

One transaction at a time

Use stored procedures
Fast if transactions short
Limited to single CPU core
Can shard if cross-shard rare

🔒

Two-Phase Locking (2PL)

Decades-old standard

Readers block writers, vice versa
Poor performance
Avoided by many applications
Unstable latencies

✨

Serializable Snapshot Isolation (SSI)

Modern optimistic approach

Detect conflicts, abort instead of blocking
Avoids most downsides of previous approaches
Good performance
Fairly new (2008)

Distributed Transactions Summary

We examined achieving atomicity when transactions span multiple nodes:

Two-phase commit (2PC):

Coordinator + participants
Prepare phase (voting)
Commit/abort phase (decision)

Database-internal distributed transactions (same database software on all nodes): Can work quite well using 2PC with optimizations.

XA transactions (across different storage technologies):

Very sensitive to coordinator and application code faults
Interacts poorly with concurrency control
Alternative: Idempotence can achieve exactly-once semantics without atomic commit across different systems

💡 Insight

Transactions are a valuable feature regardless of data model. While not every application needs transactions, they hugely reduce the number of error cases you need to think about. Without transactions, complex interacting accesses become very difficult to reason about, and data can easily become inconsistent.

Looking ahead: The examples in this chapter used relational data models, but as discussed in "The need for multi-object transactions," transactions are valuable for any data model.

Previous: Chapter 7 | Next: Chapter 9

Table of Contents​

1. Introduction​

1.1. What Can Go Wrong​

1.2. The Purpose of Transactions​

2. What Exactly Is a Transaction?​

2.1. The Evolution of Transactions​

2.2. The Meaning of ACID​

2.2.1. Atomicity​

2.2.2. Consistency​

2.2.3. Isolation​

2.2.4. Durability​

2.3. Single-Object and Multi-Object Operations​

2.3.1. Single-object writes​

2.3.2. The need for multi-object transactions​

2.3.3. Handling errors and aborts​

3. Weak Isolation Levels​

3.1. Read Committed​

3.1.1. No dirty reads​

3.1.2. No dirty writes​

3.1.3. Implementing read committed​

3.2. Snapshot Isolation and Repeatable Read​

3.2.1. Multi-version concurrency control (MVCC)​

3.2.2. Visibility rules for observing a consistent snapshot​

3.2.3. Indexes and snapshot isolation​

3.2.4. Snapshot isolation, repeatable read, and naming confusion​

3.3. Preventing Lost Updates​

3.3.1. Atomic write operations​

3.3.2. Explicit locking​

3.3.3. Automatically detecting lost updates​

3.3.4. Conditional writes (compare-and-set)​

3.3.5. Conflict resolution and replication​

3.4. Write Skew and Phantoms​

3.4.1. Characterizing write skew​

3.4.2. More examples of write skew​

3.4.3. Phantoms causing write skew​

3.4.4. Materializing conflicts​

4. Serializability​

4.1. Actual Serial Execution​

4.1.1. Encapsulating transactions in stored procedures​

4.1.2. Pros and cons of stored procedures​

4.1.3. Sharding​

4.1.4. Summary of serial execution​

4.2. Two-Phase Locking (2PL)​

4.2.1. Implementation of two-phase locking​

4.2.2. Performance of two-phase locking​

4.2.3. Predicate locks​

4.2.4. Index-range locks​

4.3. Serializable Snapshot Isolation (SSI)​

4.3.1. Pessimistic versus optimistic concurrency control​

4.3.2. Decisions based on an outdated premise​

4.3.3. Detecting stale MVCC reads​

4.3.4. Detecting writes that affect prior reads​

4.3.5. Performance of serializable snapshot isolation​

5. Distributed Transactions​

5.1. The Atomic Commitment Problem​

5.2. Two-Phase Commit (2PC)​

5.2.1. A system of promises​

5.2.2. Coordinator failure​

5.2.3. Three-phase commit​

5.3. XA Transactions​

5.3.1. Distributed transactions across different systems​

5.3.2. XA transactions​

5.3.3. Holding locks while in doubt​

5.3.4. Recovering from coordinator failure​

5.3.5. Problems with XA transactions​

5.4. Database-Internal Distributed Transactions​

5.4.1. Exactly-once message processing revisited​

6. Summary​

Isolation Levels Summary​

Serializability Implementation Summary​

Distributed Transactions Summary​

Table of Contents

1. Introduction

1.1. What Can Go Wrong

1.2. The Purpose of Transactions

2. What Exactly Is a Transaction?

2.1. The Evolution of Transactions

2.2. The Meaning of ACID

2.2.1. Atomicity

2.2.2. Consistency

2.2.3. Isolation

2.2.4. Durability

2.3. Single-Object and Multi-Object Operations

2.3.1. Single-object writes

2.3.2. The need for multi-object transactions

2.3.3. Handling errors and aborts

3. Weak Isolation Levels

3.1. Read Committed

3.1.1. No dirty reads

3.1.2. No dirty writes

3.1.3. Implementing read committed

3.2. Snapshot Isolation and Repeatable Read

3.2.1. Multi-version concurrency control (MVCC)

3.2.2. Visibility rules for observing a consistent snapshot

3.2.3. Indexes and snapshot isolation

3.2.4. Snapshot isolation, repeatable read, and naming confusion

3.3. Preventing Lost Updates

3.3.1. Atomic write operations

3.3.2. Explicit locking

3.3.3. Automatically detecting lost updates

3.3.4. Conditional writes (compare-and-set)

3.3.5. Conflict resolution and replication

3.4. Write Skew and Phantoms

3.4.1. Characterizing write skew

3.4.2. More examples of write skew

3.4.3. Phantoms causing write skew

3.4.4. Materializing conflicts

4. Serializability

4.1. Actual Serial Execution

4.1.1. Encapsulating transactions in stored procedures

4.1.2. Pros and cons of stored procedures

4.1.3. Sharding

4.1.4. Summary of serial execution

4.2. Two-Phase Locking (2PL)

4.2.1. Implementation of two-phase locking

4.2.2. Performance of two-phase locking

4.2.3. Predicate locks

4.2.4. Index-range locks

4.3. Serializable Snapshot Isolation (SSI)

4.3.1. Pessimistic versus optimistic concurrency control

4.3.2. Decisions based on an outdated premise

4.3.3. Detecting stale MVCC reads

4.3.4. Detecting writes that affect prior reads

4.3.5. Performance of serializable snapshot isolation

5. Distributed Transactions

5.1. The Atomic Commitment Problem

5.2. Two-Phase Commit (2PC)

5.2.1. A system of promises

5.2.2. Coordinator failure

5.2.3. Three-phase commit

5.3. XA Transactions

5.3.1. Distributed transactions across different systems

5.3.2. XA transactions

5.3.3. Holding locks while in doubt

5.3.4. Recovering from coordinator failure

5.3.5. Problems with XA transactions

5.4. Database-Internal Distributed Transactions

5.4.1. Exactly-once message processing revisited

6. Summary

Isolation Levels Summary

Serializability Implementation Summary

Distributed Transactions Summary