Chapter 13. A Philosophy of Streaming Systems

If a thing be ordained to another as to its end, its last end cannot consist in the preservation of its being. Hence a captain does not intend as a last end, the preservation of the ship entrusted to him, since a ship is ordained to something else as its end, viz. to navigation.

(Often quoted as: If the highest aim of a captain was to preserve his ship, he would keep it in port forever.)

St. Thomas Aquinas, Summa Theologica (1265–1274)

Introduction
Data Integration
- 2.1. Combining Specialized Tools by Deriving Data
- 2.2. Reasoning About Dataflows
- 2.3. Derived Data Versus Distributed Transactions
- 2.4. The Limits of Total Ordering
- 2.5. Ordering Events to Capture Causality
Batch and Stream Processing
- 3.1. Maintaining Derived State
- 3.2. Reprocessing Data for Application Evolution
- 3.3. Unifying Batch and Stream Processing
Unbundling Databases
- 4.1. Composing Data Storage Technologies
- 4.2. Federated Databases: Unifying Reads
- 4.3. Unbundled Databases: Unifying Writes
- 4.4. Making Unbundling Work
Designing Applications Around Dataflow
Observing Derived State
- 6.1. Materialized Views and Caching
- 6.2. Stateful, Offline-Capable Clients
- 6.3. Pushing State Changes to Clients
- 6.4. End-to-End Event Streams
- 6.5. Reads Are Events Too
Aiming for Correctness
- 7.1. The End-to-End Argument for Databases
- 7.2. Exactly-Once Execution of an Operation
- 7.3. Duplicate Suppression
- 7.4. The End-to-End Argument
Enforcing Constraints
- 8.1. Uniqueness Constraints Require Consensus
- 8.2. Uniqueness in Log-Based Messaging
- 8.3. Multi-Shard Request Processing
Timeliness and Integrity
- 9.1. Correctness of Dataflow Systems
- 9.2. Loosely Interpreted Constraints
- 9.3. Coordination-Avoiding Data Systems
Trust, But Verify
- 10.1. Maintaining Integrity in the Face of Software Bugs
- 10.2. Don't Just Blindly Trust What They Promise
- 10.3. Designing for Auditability
- 10.4. Tools for Auditable Data Systems
Summary

1. Introduction

In plain English: Imagine trying to fix a ship while it's sailing across the ocean. You can't just keep it in port forever to make it perfect—you need to maintain and improve it while it's in motion. The same is true for data systems: you can't freeze your application to make it correct; you need to build systems that can evolve, recover from errors, and stay reliable while continuously serving users.

In technical terms: This chapter synthesizes the themes of reliability, scalability, and maintainability from throughout this book, focusing on building applications around streaming and event-driven architectures. It presents a philosophy of application development that maintains correctness without sacrificing performance or operational flexibility.

Why it matters: Traditional approaches like distributed transactions and strong consistency are expensive and brittle at scale. By understanding how to build systems around dataflow principles, you can create applications that are both more reliable and more scalable than those built with conventional approaches.

In Chapter 2, we discussed creating applications that are reliable, scalable, and maintainable. In this chapter, we bring all these ideas together, building on the streaming and event-driven architecture concepts from Chapter 12 to develop a comprehensive philosophy for application development.

💡 Insight

This chapter is more opinionated than previous ones. Rather than comparing multiple approaches, it presents a deep dive into one particular philosophy: using dataflow and stream processing as the foundation for building correct, scalable systems. This represents where the industry is heading, not just academic theory.

2. Data Integration

2.1. Combining Specialized Tools by Deriving Data

In plain English: Just as you wouldn't use a hammer for every job in construction, you shouldn't use a single database for every data need. An e-commerce site might use PostgreSQL for transactions, Elasticsearch for product search, Redis for caching, and Snowflake for analytics. Each tool excels at its specific job, but now you need to keep them synchronized.

In technical terms: Complex applications use data in multiple ways, and no single piece of software suits all usage patterns. You inevitably need to integrate several different storage systems—OLTP databases, full-text search indexes, caches, data warehouses, and machine learning systems—each serving different access patterns.

Why it matters: The real challenge isn't picking the right tool for one job—it's integrating multiple specialized tools so they work together coherently. Poor integration leads to inconsistent data, manual synchronization work, and systems that break in subtle ways.

Common Data Integration Scenario

Primary System of Record

PostgreSQL Database

Change Data Capture↓

Elasticsearch Index

Redis Cache

Analytics Warehouse

ML Feature Store

All derived systems stay in sync through event streams

Common integration patterns include:

OLTP + Full-text search: PostgreSQL with built-in full-text search works for simple cases, but sophisticated search requires specialized tools like Elasticsearch
Database + Cache: Keeping cached views in sync with the database
Transactional + Analytics: Moving data from OLTP systems to data warehouses
Raw data + ML systems: Feeding data through feature engineering pipelines

2.2. Reasoning About Dataflows

In plain English: Think of your data systems like a factory assembly line. Raw materials (writes to your database) enter at one point, get processed through various stations (derived systems), and emerge as finished products (search results, analytics, recommendations). To keep everything working correctly, you need to know: Where does data enter? What transformations happen? In what order?

In technical terms: When maintaining copies of the same data in multiple storage systems, you must be explicit about inputs and outputs: where is data written first, and which representations are derived from which sources? Understanding the dataflow makes it possible to reason about consistency guarantees.

Why it matters: Without clear dataflow, you get race conditions where different systems see writes in different orders, leading to permanent inconsistencies. With clear dataflow through a single ordered source, you can make strong guarantees about system behavior.

Clear Dataflow vs. Ambiguous Dataflow

GOOD: Single Write Path

Write to Database

Single source of truth

↓

Capture Changes

CDC produces ordered log

↓

Update Search Index

Processes log in order

Consistent: index reflects database

BAD: Multiple Write Paths

Concurrent Writes

Client 1

→

Database

Client 2

→

Search Index

Race condition: different orders!

Key principle: Funnel all user input through a single system that decides on an ordering for all writes. This is an application of state machine replication—whether you use change data capture or event sourcing is less important than simply deciding on a total order.

💡 Insight

Allowing applications to write directly to multiple systems introduces the problem shown in Figure 12-4 from Chapter 12: two clients can send conflicting writes, and the storage systems may process them in different orders. When you funnel all writes through a single ordered log, derived systems can process events in the same order, guaranteeing consistency.

2.3. Derived Data Versus Distributed Transactions

In plain English: There are two ways to keep multiple systems in sync: the "ask permission" approach (distributed transactions) and the "process the log" approach (derived data). Distributed transactions are like getting everyone's signature before taking action—safe but slow and prone to deadlock. Derived data is like having a to-do list that each system works through at its own pace—eventually everyone gets in sync.

In technical terms: Both distributed transactions and derived data systems achieve similar goals through different means. Distributed transactions use locks for mutual exclusion and atomic commit for exactly-once semantics. Derived data systems use logs for ordering and deterministic retry with idempotence for the same guarantees.

Why it matters: Distributed transactions (especially XA) have poor fault tolerance and performance characteristics. Log-based derived data offers a more promising path to integrating different data systems, though it requires accepting asynchronous updates rather than immediate read-your-writes consistency.

Distributed Transactions

Log-Based Derived Data

Ordering

Locks for mutual exclusion

Log provides total order

Atomicity

Two-phase commit (2PC)

Deterministic retry + idempotence

Consistency

Read-your-writes immediately

Eventually consistent (async)

Fault Tolerance

Poor: any participant can abort

Good: isolated failures

Performance

Coordination overhead

High throughput

Adoption

XA rarely used in practice

Widely used (Kafka, CDC)

Trade-offs:

Immediate consistency: Distributed transactions guarantee you can immediately read what you wrote, while derived data systems update asynchronously
Fault tolerance: XA transactions have poor fault tolerance, while log-based systems contain failures locally
Industry adoption: XA is rarely used due to practical problems, while log-based derived data has become the standard

The challenge is building stronger guarantees (like reading your own writes) on top of asynchronously derived systems—a topic we'll explore later in this chapter.

2.4. The Limits of Total Ordering

In plain English: Imagine trying to keep a diary where every event in your life must have an exact timestamp and order. Easy when you're alone in one place, but what if you're collaborating with people on different continents, each with their own diary? Now reconciling who did what when becomes complex. The same problem occurs with distributed event logs.

In technical terms: Constructing a totally ordered event log works well for systems small enough to funnel everything through a single leader node. As systems scale, limitations emerge: sharding requires coordination between shards, geo-distribution creates ordering ambiguities between datacenters, microservices have no shared state, and client-side state means different devices see events in different orders.

Why it matters: Total ordering (equivalent to consensus) doesn't scale indefinitely. Understanding these limits helps you design systems that work correctly even when total ordering isn't feasible.

Challenges to Total Ordering at Scale

⚡

Throughput Limits

Single leader bottleneck

Single node throughput ceiling
Sharding creates ambiguous order
No mechanism to parallelize

🌍

Geographic Distribution

Cross-datacenter coordination

Network latency between regions
Separate leaders per datacenter
Undefined cross-region ordering

🔧

Microservices

Independent services

No shared durable state
Events from different services
Service autonomy vs. coordination

📱

Client-Side State

Offline-first applications

Updates without server confirmation
Work offline then sync
Clients see different orders

💡 Insight

Deciding on a total order of events is formally known as total order broadcast, which is equivalent to consensus. Most consensus algorithms assume a single node can handle the entire event stream, and don't provide mechanisms for parallelizing the ordering work across multiple nodes.

2.5. Ordering Events to Capture Causality

In plain English: Imagine posting "I got the job!" on social media, then immediately removing your ex-partner as a friend before they see it. You expect the unfriend to happen before they see the post. But if these actions update different systems (social graph and notifications), the notification system might alert your ex before the unfriend takes effect—an awkward mistake!

In technical terms: When events originate in different services or shards, there's no defined order. This becomes problematic when there are causal dependencies between events. For example, one user removes another as a friend, then sends a message intended only for remaining friends. If the systems processing these events see them in different orders, the unfriend event might be processed after the message-send event.

Why it matters: Causal dependencies can lead to violations of user expectations and application invariants. Systems need ways to capture and preserve causality even when total ordering isn't feasible.

Causality Violation Example

User unfriends ex-partner

Updates social graph

↓

User posts rude message

Expects ex won't see it

↓

Notification service processes

Message might arrive first!

↓

Ex receives notification

Causality violated

Approaches to handling causality:

Logical timestamps: Provide total ordering without coordination (see Lamport clocks), but require recipients to handle out-of-order events and carry additional metadata
Event identifiers: Log an event recording the system state the user saw before making a decision, then reference that event ID in later events to record causal dependencies
Conflict resolution algorithms: Help with processing out-of-order events when maintaining state, but don't help with external side effects (like sending notifications)

💡 Insight

Causal dependencies often arise in subtle ways. The notification example is effectively a join between messages and the friend list, making it related to the time-dependence of joins discussed in Chapter 12. Future application development patterns may emerge that capture causal dependencies efficiently without forcing all events through total order broadcast.

3. Batch and Stream Processing

3.1. Maintaining Derived State

In plain English: Think of batch and stream processing like maintaining a restaurant's inventory and menu. The inventory (raw ingredients) is your source data. The menu items (prepared dishes) are derived data. Stream processing is like cooking dishes to order as customers arrive. Batch processing is like preparing ingredients in bulk every morning. Both keep your menu offerings in sync with available ingredients.

In technical terms: The goal of data integration is ensuring data reaches the right form in all the right places. Batch and stream processors achieve this by consuming inputs, transforming, joining, filtering, aggregating, training models, and writing to appropriate outputs. Their outputs are derived datasets: search indexes, materialized views, recommendations, and aggregate metrics.

Why it matters: Both batch and stream processing have functional programming principles—deterministic functions with well-defined inputs and outputs. This makes reasoning about dataflows straightforward and enables fault tolerance through recomputation.

Batch and Stream Processing for Derived State

Source Data

System of record (immutable)

→

Processing

Stream: Low latency

Batch: Reprocess history

Deterministic transformations

→

Derived Data

Search Indexes

ML Models

Caches

Queryable outputs

Key principles:

Deterministic functions: Output depends only on input, no side effects beyond explicit outputs
Immutable inputs: Treat inputs as read-only
Append-only outputs: New outputs don't modify previous results
Asynchronous processing: Fault in one part doesn't spread to others (unlike distributed transactions)

💡 Insight

Derived data systems could be maintained synchronously (like relational databases update secondary indexes within the same transaction), but asynchrony is what makes event log-based systems robust. It allows faults to be contained locally, whereas distributed transactions amplify failures by spreading them across the system.

3.2. Reprocessing Data for Application Evolution

In plain English: Imagine renovating a house while people live in it. You can't shut it down for months, so you build the new version alongside the old, gradually move people over, and keep the old version as a safety net. Reprocessing data enables the same approach for evolving data systems: build new derived views alongside old ones, test thoroughly, and switch over gradually with the ability to roll back.

In technical terms: Stream processing reflects input changes in derived views with low delay, while batch processing reprocesses large amounts of historical data to derive new views. Reprocessing enables significant schema evolution—not just adding fields, but completely restructuring datasets into different models.

Why it matters: Without reprocessing, you're limited to simple schema changes. With reprocessing, you can fundamentally restructure your data model as requirements change, all while maintaining a working production system.

Schema Migrations in the Real World:

The railway industry provides a useful analogy. In 19th-century England, competing track gauges (the distance between rails) prevented trains built for one gauge from running on tracks of another. After standardizing in 1846, existing non-standard tracks had to be converted.

Dual-Gauge Railway Conversion

Original Track

Non-standard gauge only

→

Add Third Rail

Dual gauge supports both

→

Gradual Conversion

Old and new trains coexist

→

Remove Old Rail

Standard gauge only

The solution: convert tracks to dual gauge by adding a third rail. During the transition, trains of both gauges could operate. Eventually, once all trains converted to standard gauge, the redundant rail could be removed.

Applying this to data systems:

Gradual Data Model Migration

Old Schema in Production

Existing users and systems

↓

Deploy New Schema Alongside

Both derived from same source

↓

Route Some Users to New

A/B test performance and bugs

↓

Gradually Increase %

Expand to all users over time

↓

Drop Old Schema

Fully migrated

Benefits of gradual migration:

Reversible at every stage: Always have a working system to fall back to
Reduced risk: Test with small user percentages before full rollout
Faster iteration: Confidence to make changes enables moving faster
Production validation: Discover issues under real load

💡 Insight

The beauty of gradual migration is that every stage is easily reversible if something goes wrong. By reducing the risk of irreversible damage, you can be more confident about making changes, and thus move faster to improve your system. This is the same philosophy as feature flags and blue-green deployments.

3.3. Unifying Batch and Stream Processing

In plain English: Early attempts to combine batch and stream processing (like Lambda architecture) maintained separate batch and stream codebases that did the same computation differently—a maintenance nightmare. Modern systems (Kappa architecture) let you write processing logic once and run it over both historical batch data and live streaming data.

In technical terms: Unifying batch and stream processing in one system requires: (1) replaying historical events through the same processing engine that handles recent events, (2) exactly-once semantics even with faults, and (3) windowing by event time rather than processing time.

Why it matters: Maintaining separate batch and stream codebases doubles your implementation and testing burden. Unified systems let you use the same logic for both reprocessing historical data (batch) and processing live events (stream).

Lambda Architecture (Deprecated)

Kappa Architecture (Modern)

Codebase

Separate batch & stream code

Single unified codebase

Consistency

Different implementations may diverge

Same logic for both paths

Maintenance

Fix bugs in two places

Fix once, runs everywhere

Reprocessing

Switch to batch system

Replay log through same engine

Examples

(Largely abandoned)

Flink, Beam, Kafka Streams

Requirements for unified batch/stream processing:

Replay capability: Log-based message brokers can replay historical messages, and stream processors can read from distributed filesystems or object storage
Exactly-once semantics: Ensure output is the same as if no faults occurred, requiring discarding partial output of failed tasks (like batch processing)
Event-time windowing: When reprocessing historical data, processing time is meaningless—you need to window by the timestamp in the event itself

Example frameworks:

Apache Beam: API for expressing windowed computations over event time
Apache Flink: Unified batch and stream processing with exactly-once guarantees
Google Cloud Dataflow: Managed service based on Beam model

4. Unbundling Databases

In plain English: Traditional databases are like all-in-one kitchen appliances—they do many things, but none perfectly. Unbundling is like having specialized tools: a blender for smoothies, a food processor for chopping, a stand mixer for baking. Each excels at its job, and you can upgrade one without replacing everything. Similarly, unbundling databases means using specialized systems for storage, search, caching, and analytics, connected by event logs.

In technical terms: At an abstract level, databases, batch/stream processors, and operating systems all perform the same functions: store data and allow processing/querying. Databases and operating systems approached information management with different philosophies—Unix with low-level abstractions (files, pipes), relational databases with high-level abstractions (SQL, transactions). Unbundling reconciles these philosophies by composing specialized systems.

Why it matters: Rather than forcing all workloads into a single database, unbundling lets you use the best tool for each job while maintaining consistency through log-based integration. This provides both breadth (handling diverse workloads) and maintainability (evolving each component independently).

Traditional Database vs. Unbundled System

Traditional: All-in-One

Single Database

OLTP Storage

Secondary Indexes

Full-Text Search

Materialized Views

Replication

One system, many features

Unbundled: Specialized

PostgreSQL (OLTP)

Event Log↓

Elasticsearch

Redis

Snowflake

Best tool for each job

Historical context:

Unix philosophy (1970s): Low-level abstractions (files, pipes) that are simple and composable
Relational databases (1970s): High-level abstractions (SQL, transactions) that hide complexity
NoSQL movement (2000s): Attempted to apply Unix-like simplicity to distributed storage
Modern unbundling (2010s+): Compose specialized systems using event logs, getting the best of both approaches

4.1. Composing Data Storage Technologies

In plain English: Modern databases have many built-in features: secondary indexes for fast lookups, materialized views for precomputed queries, replication logs to keep replicas in sync, and full-text search for keyword queries. With batch and stream processing, you can build these same features as separate systems, connected by event logs.

In technical terms: Database features like secondary indexes, materialized views, replication logs, and full-text search have parallels in batch and stream processing. Creating an index in a database involves scanning a table snapshot, picking out field values, sorting them, writing the index, processing the write backlog, and continuously maintaining the index—remarkably similar to setting up a new stream processor or follower replica.

Why it matters: Recognizing the parallels between database internals and stream processing reveals that you can compose specialized systems to achieve the same functionality as an integrated database, but with more flexibility in choosing and evolving each component.

Example: Creating an Index

When you run CREATE INDEX in a database:

Snapshot Table

Read consistent snapshot

→

Extract & Sort Values

Build index structure

→

Process Backlog

Apply writes during snapshot

→

Maintain Continuously

Update with new writes

This is the same as:

Setting up a new follower replica (Chapter 6)
Bootstrapping change data capture (Chapter 12)
Stream processing with catch-up from log

The Meta-Database of Everything:

From this perspective, the dataflow across an entire organization looks like one huge database:

Organization as Distributed Database

Layered Data Systems

Data Sources (Systems of Record)

Application DBService APIsEvent Logs

Processing Layer (Like Triggers & Procedures)

Batch JobsStream ProcessorsETL Pipelines

Derived Data (Like Indexes & Views)

Search IndexesCachesAnalyticsML Models

Query Layer

REST APIsSQL InterfacesGraphQL

💡 Insight

Batch and stream processors are like elaborate implementations of database triggers, stored procedures, and materialized view maintenance. The derived systems they maintain are like different index types. Instead of implementing these as features of a single database, they're provided by various pieces of software, running on different machines, administered by different teams.

4.2. Federated Databases: Unifying Reads

In plain English: Imagine having files stored in Dropbox, Google Drive, and OneDrive, but using a single search interface to find documents across all of them. Federated databases work similarly—they provide a unified query interface to underlying storage engines while each storage engine remains independent.

In technical terms: Federated databases (also called polystores) provide a unified query interface to diverse underlying storage engines and processing methods. Applications needing specialized data models can still access storage engines directly, while users combining data from disparate sources can do so through the federated interface.

Why it matters: Federation addresses read-only queries across different systems, letting data teams query without understanding the internals of each storage system. It follows the relational tradition of elegant high-level interfaces, but with support for heterogeneous backends.

Examples:

PostgreSQL Foreign Data Wrappers: Query external databases and files through SQL
Trino (formerly Presto): Federated SQL queries across data lakes, databases, and cloud storage
Hoptimator: Query optimizer for heterogeneous data sources
Xorq: Federated query engine

Federated Query Architecture

Unified Query Interface (SQL)

Translates queries↓

Federated Query Engine

Query Planning & Optimization

Routes to backends↓

PostgreSQL

S3 / Parquet

MongoDB

Elasticsearch

Limitation: Federation solves reads but doesn't address keeping writes synchronized across these systems—that requires unbundling.

4.3. Unbundled Databases: Unifying Writes

In plain English: While federation lets you query across systems, it doesn't solve the harder problem: ensuring writes reach all systems in the right order. Unbundling is like having a to-do list (event log) that each system works through independently. Everyone processes the same events in the same order, so they all converge to the same state, even if they work at different speeds.

In technical terms: Within a single database, creating a consistent index is built-in. When composing several storage systems, you need to ensure all data changes reach all the right places, even with faults. Unbundling databases means reliably connecting storage systems through change data capture and event logs, similar to how databases internally maintain indexes.

Why it matters: Synchronizing writes is the hard engineering problem in data integration. Event logs with idempotent consumers provide a simpler, more robust solution than distributed transactions, especially across heterogeneous systems.

Unbundled Write Synchronization

Write to Primary

Single source of truth

↓

Event Log Captures

CDC produces ordered events

↓

Consumers Process

Each system at its own pace

↓

Idempotent Updates

Tolerate retries and failures

Approach:

Instead of distributed transactions across systems, unbundling uses:

Ordered event log: Provides total order within each shard/partition
Change data capture: Extracts changes from source systems into the log
Idempotent consumers: Process events with at-least-once delivery, handling duplicates gracefully
Deterministic processing: Same inputs always produce same outputs

💡 Insight

The unbundled approach follows the Unix tradition of small tools that do one thing well, communicate through a uniform low-level API (pipes → logs), and compose using a higher-level language (shell → stream processors). Federation follows the relational tradition of a single integrated system with high-level query semantics.

4.4. Making Unbundling Work

In plain English: Think of unbundling like modular furniture. Individual pieces are simple and robust, but you need good connectors (the event log) to make them work together. The advantage is that if one piece breaks, you can fix or replace it without affecting others. The trade-off is managing more moving parts.

In technical terms: Synchronizing writes requires coordination between heterogeneous systems. Traditional distributed transactions handle this but suffer from poor adoption and fault tolerance. An ordered event log with idempotent consumers provides a simpler abstraction that works across diverse technologies without standardized transaction protocols.

Why it matters: Log-based integration provides loose coupling that manifests at both system and human levels, making the overall system more robust and enabling teams to work independently.

🔧

System-Level Benefits

Technical advantages

Asynchronous event streams buffer slowdowns
Failures contained locally, not amplified
Consumers can catch up after recovery
No distributed transaction overhead

👥

Human-Level Benefits

Organizational advantages

Different teams develop independently
Specialize in doing one thing well
Well-defined interfaces between systems
Deploy and evolve at different paces

When to use integrated vs. unbundled systems:

Use an integrated system when:

Single technology satisfies all requirements
Simpler operations (fewer moving parts)
Better predictable performance for target workload
Team expertise in one system

Use unbundled systems when:

No single technology fits all needs
Need diverse workloads (OLTP + analytics + search + ML)
Different teams with different specialized tools
Requirement to evolve components independently

Tools enabling composition:

Debezium: Extract change streams from many databases
Kafka Connect: Standardized connectors for data sources/sinks
Incremental view maintenance engines: Precompute and update complex query caches
Exactly-once stream processors: Flink, Kafka Streams provide reliable event processing

💡 Insight

The goal of unbundling is not to compete with individual databases on performance for particular workloads—it's to combine several different databases to achieve good performance across a much wider range of workloads than possible with a single piece of software. It's about breadth, not depth.

5. Designing Applications Around Dataflow

5.1. Application Code as a Derivation Function

In plain English: Think of how spreadsheets work: you write a formula in one cell (like summing a column), and whenever the input cells change, the result automatically recalculates. You don't manually refresh it. Data systems should work the same way: when a record changes, indexes should automatically update, caches should refresh, and derived views should recalculate—all without manual intervention.

In technical terms: Creating derived datasets requires transformation functions. For secondary indexes, the function extracts and sorts specific field values. For full-text search, it applies NLP processing (language detection, stemming, etc.) and builds inverted indexes. For machine learning, it applies feature extraction and statistical analysis. For caches, it aggregates data in the format needed by the UI.

Why it matters: Most data systems today require manual work to keep derived data synchronized. By thinking in terms of derivation functions and dataflow, you can build systems that automatically maintain derived state—like spreadsheets, but at data system scale with durability and fault tolerance.

Different Derivation Functions

📇

Secondary Index

Straightforward transformation

Extract field values from records
Sort by those values
Build B-tree or SSTable
Built into most databases

🔍

Full-Text Search

NLP processing

Language detection
Word segmentation & stemming
Spelling correction
Build inverted index

🧠

Machine Learning Model

Statistical derivation

Feature extraction (app-specific)
Statistical analysis of training data
Model is derived from inputs
Apply model to new data

⚡

UI Cache

Display aggregation

Aggregate data for UI display
Precompute expensive queries
Must match UI field references
UI changes require cache updates

Challenge with custom derivation functions:

While databases have built-in support for secondary indexes (CREATE INDEX), custom derivations require application code. Database features for running custom code (triggers, stored procedures, user-defined functions) have been afterthoughts and don't integrate well with modern development practices:

Package and dependency management
Version control and code review
Rolling upgrades and deployment
Monitoring, metrics, and observability
Integration with external services

💡 Insight

When the derivation function is not a standard cookie-cutter operation like creating a secondary index, custom code is required to handle application-specific aspects. This is where many databases struggle—their execution environments don't fit well with modern application development requirements.

5.2. Separation of Application Code and State

In plain English: Imagine if every restaurant kept all its ingredients inside the chef's personal backpack. That would be absurd—ingredients belong in a centralized kitchen (database), while chefs (application servers) can be interchangeable. This is exactly how modern web applications work: stateless servers can handle any request, while state lives in a database.

In technical terms: Most web applications today deploy as stateless services where any request can route to any server, and servers forget everything after sending responses. State lives in databases. This separation keeps stateless application logic separate from state management—not putting application logic in the database and not putting persistent state in the application.

Why it matters: This architecture enables easy scaling (add or remove servers at will), but it treats databases as passive mutable variables. Unlike spreadsheets where cells automatically recalculate when dependencies change, you must manually poll databases to detect changes.

Traditional Stateless Application Architecture

Stateless App Server 1

Stateless App Server 2

Stateless App Server 3

Read/Write Requests↓

Shared Mutable State

Database

Any server can handle any request (good for scaling) But must poll to detect state changes (passive model)

The passive data problem:

In most programming languages, you cannot subscribe to changes in a mutable variable—you can only read it periodically. Databases inherited this passive approach:

To detect changes, you must poll (repeat queries periodically)
No automatic notifications when data changes
Observer pattern must be implemented manually
Change subscriptions are only beginning to emerge

Joke from functional programming community:

"We believe in the separation of Church and state"

Explanation: Church refers to Alonzo Church, creator of lambda calculus (basis for functional programming), which has no mutable state. The joke plays on keeping mutable state separate from pure functional computation.

5.3. Dataflow: Interplay Between State Changes and Application Code

In plain English: Instead of treating your database as a passive storage bucket that you manually read and write, imagine it as a living system where state changes automatically trigger computations. When a user clicks "buy," that event automatically flows through inventory updates, payment processing, notification systems, and analytics—each component reacting to relevant state changes like a carefully choreographed dance.

In technical terms: Thinking in terms of dataflow means renegotiating the relationship between application code and state management. Rather than treating databases as passive variables manipulated by applications, focus on the interplay between state, state changes, and code that processes them. Application code responds to state changes in one place by triggering state changes in another place.

Why it matters: This pattern (already seen in change data capture, actor model, triggers, and incremental view maintenance) can be extended to create derived datasets outside the primary database—caches, search indexes, ML systems, analytics—using stream processing and messaging systems.

Dataflow-Driven Application

User Action (Write)

Event written to log

→

Stream Processor 1

Updates search index

→

Stream Processor 2

Sends notifications

→

Stream Processor 3

Updates analytics

State changes flow through the system, triggering computations

Requirements for maintaining derived data:

Log-based message brokers can provide these essential properties:

Ordering: When maintaining derived data, the order of state changes is often important. If several views derive from an event log, they need to process events in the same order to remain consistent
Fault tolerance: Losing a single message causes the derived dataset to go permanently out of sync. Both message delivery and derived state updates must be reliable
Deterministic processing: Same inputs should produce same outputs, enabling recovery through reprocessing

Benefits vs. distributed transactions:

Stable message ordering and fault-tolerant processing are less expensive than distributed transactions
More operationally robust (failures don't cascade)
Can achieve comparable correctness guarantees
Allows arbitrary application code as stream operators

💡 Insight

Like Unix tools chained by pipes, stream operators can be composed to build large systems around dataflow. Each operator takes streams of state changes as input and produces other streams of state changes as output. This composability is powerful for building complex systems from simple components.

5.4. Stream Processors and Services

In plain English: Traditional microservices make synchronous network calls—Service A asks Service B for data, waits for a response, then continues. This is like calling someone on the phone: if they don't answer, you're blocked. Dataflow systems use asynchronous messages—services subscribe to relevant data streams and maintain local copies. This is like getting email updates: you process them when convenient, and the sender doesn't wait for you.

In technical terms: Service-oriented architecture breaks functionality into services communicating via synchronous network requests (REST APIs). This provides organizational scalability through loose coupling. Dataflow systems use asynchronous message streams instead of synchronous request/response, composing stream operators similarly to microservices but with different communication mechanisms.

Why it matters: Dataflow systems can achieve better performance and fault tolerance than traditional REST APIs by eliminating synchronous network requests. The fastest and most reliable network request is no network request at all!

Example: Currency Conversion

A customer purchases an item priced in one currency, paid in another. You need the current exchange rate.

Microservices vs. Dataflow Approach

Microservices Approach

Purchase Request

User buys item

↓

Query Exchange Service

Synchronous network call

↓

Wait for Response

Blocked until reply

↓

Process Purchase

Use received rate

Slow, brittle: depends on exchange service availability

Dataflow Approach

Subscribe to Rates

Before any purchases

↓

Maintain Local Cache

Update as rates change

↓

Purchase Request

User buys item

↓

Query Local DB

Instant, no network call

Fast, robust: no dependency on remote service

Benefits of the dataflow approach:

Performance: Local database query instead of network request
Reliability: No dependency on another service being available
Consistency: Explicit handling of time-dependent joins (exchange rates change over time)

Trade-off:

The dataflow approach has replaced a synchronous network call with a local query. The microservices approach could also cache exchange rates locally and refresh periodically—which is essentially subscribing to a change stream!

Time dependency consideration:

If you reprocess purchase events later, exchange rates will have changed. To reconstruct original outputs, you need historical exchange rates at the time of each purchase. Both approaches must handle this time dependence, but the dataflow approach makes it explicit through stream joins.

💡 Insight

Subscribing to a stream of changes, rather than querying current state when needed, brings us closer to a spreadsheet-like model of computation: when some piece of data changes, any derived data that depends on it can swiftly be updated. This is a very promising direction for building responsive applications.

6. Observing Derived State

6.1. Materialized Views and Caching

In plain English: Imagine running a restaurant. You could cook every dish from scratch when ordered (slow but fresh), or pre-cook everything and just reheat (fast but wasteful if dishes go uneaten). Real restaurants do something in between: prep common ingredients in advance, and finish dishes when ordered. Data systems work the same way: indexes are like prepped ingredients, caches are like pre-cooked popular dishes.

In technical terms: The write path is the portion of the data journey that is precomputed eagerly as soon as data arrives. The read path is the portion that only happens when someone requests it. Derived datasets represent the trade-off between work done at write time versus read time—similar to eager vs. lazy evaluation in functional programming.

Why it matters: Understanding the write/read path boundary helps you optimize for different workloads. No single position for this boundary is optimal for all use cases—different applications need different trade-offs.

Write Path Meets Read Path

Write Path (Eager)

Document Created

User writes data

↓

Linguistic Analysis

NLP processing

↓

Update Index

Add to inverted index

Precomputation before queries arrive

Derived Dataset

Search Index

Where write and read paths meet

Read Path (Lazy)

User Query

Search keywords

↓

Index Lookup

Find matching docs

↓

Boolean Logic

AND/OR operators

↓

Return Results

Ranked matches

On-demand processing when needed

Full-text search example:

The index represents a middle ground between extremes:

Approach	Write Path	Read Path	Trade-off
No index (grep)	None	Scan all documents	Fast writes, slow reads
Inverted index	Update index entries	Search index + Boolean logic	Balanced
Precompute all queries	Compute all possible results	Just lookup	Infeasible (exponential queries)
Cache common queries	Precompute popular queries	Check cache, fallback to index	Good for predictable workloads

💡 Insight

Caches, indexes, and materialized views all serve the same role: they shift the boundary between read and write paths. They allow us to do more work on the write path, by precomputing results, in order to save effort on the read path. After 500 pages, we've come full circle to the social network timeline example from Chapter 1!

6.2. Stateful, Offline-Capable Clients

In plain English: Old-school web browsers were like dumb terminals—they could only work with an internet connection, and every action required asking the server. Modern web and mobile apps are more like smartphones—they can do useful work offline, store data locally, and sync with servers in the background. This shift dramatically improves user experience, especially on slow or unreliable networks.

In technical terms: Traditional web browsers were stateless clients requiring internet connections for any functionality. Modern single-page JavaScript web apps and mobile apps maintain significant stateful capabilities: client-side UI interaction, persistent local storage, and background synchronization. Users can work offline and sync with remote servers when network connections are available.

Why it matters: Maintaining state on end-user devices enables applications to work offline and provide responsive UIs that don't wait for network requests. This is especially valuable for mobile devices with slow or unreliable cellular connections.

Client-Side State Evolution

Traditional Web (Stateless)

Modern Apps (Stateful)

Local State

None (except cache)

Full local database

Offline Capability

Only scroll cached pages

Full offline functionality

Network Dependency

Every action requires network

Background sync only

UI Responsiveness

Wait for server round-trip

Instant local updates

Data Model

Server owns truth

Local replica of remote state

Client state as a cache:

When you move away from stateless clients and central databases toward state maintained on end-user devices, new opportunities emerge:

Screen pixels: Materialized view onto model objects in client app
Model objects: Local replica of state in remote datacenter
Sync mechanism: Background process keeps replicas consistent

This is the same principle as derived data systems, but extended all the way to the edge (user devices).

💡 Insight

By thinking of on-device state as a cache of server state, we can apply the same principles we've developed for maintaining derived data systems. The client becomes another consumer of the event stream, maintaining a local materialized view that updates as events arrive.

6.3. Pushing State Changes to Clients

In plain English: Traditional web pages are like reading a newspaper—you see what was printed this morning, and to see updates, you must buy the afternoon edition. Modern apps are like watching live TV—updates appear automatically as they happen. Instead of clients repeatedly asking "any updates yet?", the server pushes changes as they occur.

In technical terms: Traditional HTTP-based web pages load data once and don't learn about server-side changes until the page reloads. Users see a stale snapshot unless they explicitly poll. Modern protocols (server-sent events via EventSource API, WebSockets) allow servers to actively push messages to browsers over persistent TCP connections, keeping client-side state current.

Why it matters: Actively pushing state changes all the way to client devices extends the write path to end users. This reduces staleness, provides better user experience, and aligns with the dataflow model of reactive programming.

Polling vs. Push Model

Traditional: Polling

Client: Any updates?

Repeated requests

↓

Server: No

Most requests wasted

↓

Client: Any updates?

Still polling

↓

Server: Yes, here!

Eventually finds update

Wasteful: most polls find nothing

Modern: Server Push

Client: Subscribe

Open persistent connection

↓

Server: Connected

Keep TCP connection alive

↓

Data Changes

State update occurs

↓

Server: Push update

Immediately notify client

Efficient: updates arrive when they occur

Technologies enabling push:

Server-Sent Events (SSE): EventSource API for one-way server-to-client messages
WebSockets: Full bidirectional communication channel
HTTP/2 Server Push: Proactively send resources before client requests

Handling offline devices:

Devices are offline part of the time and can't receive notifications during those periods. But this problem is already solved: in "Consumer offsets" (Chapter 12), we discussed how log-based message brokers let consumers reconnect after disconnection and catch up on missed messages. The same technique works for individual user devices—each device is a subscriber to its own stream of relevant events.

6.4. End-to-End Event Streams

In plain English: Imagine a live sports scoreboard. When a goal is scored, the change flows instantly from the referee's device → league servers → TV networks → your screen. Modern web apps can work the same way: user actions on one device trigger state changes that flow through backend services and emerge as UI updates on other users' devices, all in under a second.

In technical terms: Modern UI frameworks like React and Elm already have the ability to update rendered interfaces in response to state changes. Extending this programming model to allow servers to push state-change events into the client-side event pipeline creates end-to-end event streams: from user interaction on one device, through event logs and stream processors, to UI updates on other devices.

Why it matters: This architecture provides low-latency state propagation (under one second end-to-end) and enables real-time collaborative applications. Some apps (instant messaging, online games) already work this way—why not build all applications like this?

End-to-End Event Stream

User Action

Click 'Like' on Device A

→

Event Published

Write to event log

→

Stream Processing

Update derived state

→

Push to Subscribers

Notify interested clients

→

UI Update

React re-renders on Device B

Under 1 second from action to UI update across devices

The challenge:

Request/response interactions are deeply ingrained in databases, libraries, frameworks, and protocols:

Most datastores support read/write operations returning a single response
Few provide ability to subscribe to changes (request returning a stream of responses)
Client frameworks assume stateless request/response model
HTTP is designed around single request → single response

The opportunity:

Moving toward publish/subscribe dataflow would require rethinking many systems, but would provide significant benefits:

More responsive user interfaces
Better offline support
Real-time collaboration features
Reduced server load (push instead of polling)

💡 Insight

To extend the write path all the way to end users, we need to fundamentally rethink how we build many systems: moving away from request/response interaction toward publish/subscribe dataflow. This requires effort, but would make user interfaces more responsive and provide better offline support.

6.5. Reads Are Events Too

In plain English: What if every time someone looked at a product on your e-commerce site, that "read" was treated as an event, just like "write" events (purchases, updates)? You could track what users saw before they made decisions, reconstruct exactly what information led to purchases, and analyze how UI changes affect user behavior. This event log of reads enables powerful analytics and debugging.

In technical terms: So far, we've treated writes as events going through event logs while reads are transient network requests directly querying storage. But reads can also be represented as events and sent through stream processors. When both writes and reads are events routed to the same operator, you're performing a stream-table join between the stream of read queries and the database.

Why it matters: Recording read events enables tracking causal dependencies across systems, reconstructing what users saw before making decisions, and powerful debugging capabilities (time-travel debugging). It also enables distributed multi-shard query processing.

Reads as Events

Traditional

Write Events

Through log→

Storage

Direct query←

Read Requests

Reads as Events

Write Events

Updates to state

→

Read Events

Queries as events

→

Stream Processor

Joins writes + reads

→

Response Stream

Results as events

Both reads and writes flow through same event processing infrastructure

Applications:

Causal dependency tracking: Reconstruct what user saw before making decisions
E-commerce example: Record predicted shipping date and inventory status shown to customer, then analyze how these affect purchase decisions
Multi-shard query processing: Distributed queries requiring data from multiple shards can be expressed as read events, with stream processors joining data from different shards
Fraud prevention: Assess risk by joining purchase event with reputation scores from multiple sharded datasets (IP address, email, billing address, etc.)

Trade-offs:

Additional storage and I/O: Logging reads incurs overhead
Optimization research: Reducing overhead is an open research problem
Operational logging: Many systems already log reads for operations; making the log the source of requests isn't a huge change

Correspondence to joins:

When a read request passes through the stream processor, it's a one-off join that's immediately forgotten. A subscribe request is a persistent join with past and future events on the other side of the join. This correspondence between serving requests and performing joins is fundamental to understanding dataflow systems.

💡 Insight

This idea opens the possibility of distributed execution of complex queries across multiple shards, taking advantage of the infrastructure for message routing, sharding, and joining already provided by stream processors. Storm's distributed RPC feature and various fraud prevention systems use this pattern successfully.

7. Aiming for Correctness

7.1. The End-to-End Argument for Databases

In plain English: Just because you use a high-security vault doesn't mean your valuables are safe if you leave the door open. Similarly, just because you use a database with serializable transactions doesn't mean your application is free from data corruption if your application code has bugs that write incorrect data. You need end-to-end guarantees, not just individual component guarantees.

In technical terms: Using a data system with strong safety properties (like serializable transactions) doesn't guarantee the application is free from data loss or corruption. If application bugs write incorrect data or delete data incorrectly, serializable transactions won't save you. Correctness requires end-to-end measures throughout the entire system.

Why it matters: Low-level reliability mechanisms (like serializability) are necessary but not sufficient. You still need application-level fault tolerance mechanisms, and these are hard to get right—most application-level mechanisms don't work correctly.

Why this is challenging:

Stateful systems are harder: With stateless services, bugs can be fixed by restarting. With databases that remember forever, mistakes last forever
Weak transaction properties: Weak isolation levels (see Chapter 8) have confusing semantics
Scalability pressures: Many systems abandon transactions for better performance/scalability but get messier semantics
Configuration complexity: Hard to determine if it's safe to run at particular isolation level or replication configuration

Real-world problems:

Jepsen experiments: Revealed stark discrepancies between claimed safety guarantees and actual behavior during network problems and crashes
Application misuse: Even when infrastructure is correct, applications often misuse database features (wrong isolation level, incorrect quorum configuration)
Testing challenges: Simple solutions appear correct under low concurrency without faults, but have subtle bugs under demanding circumstances

7.2. Exactly-Once Execution of an Operation

In plain English: Imagine paying for coffee, but your card reader freezes. Did the payment go through? If you retry, you might get charged twice. If you don't retry, you might not pay at all. The system needs to guarantee that whether you press "pay" once or ten times due to glitches, you're only charged once—this is exactly-once execution.

In technical terms: When message processing fails, you can either give up (data loss) or retry. Retrying risks processing the same message twice if the first attempt actually succeeded but you didn't receive confirmation. Exactly-once (or effectively-once) semantics ensure the final effect is the same as if no faults occurred, even if operations were retried.

Why it matters: Processing messages twice is a form of data corruption. Charging customers twice, incrementing counters twice, or sending duplicate notifications all violate correctness. Making operations naturally idempotent or implementing duplicate suppression is essential.

The Retry Dilemma

Problem: Ambiguous Result

Send Payment

Request leaves client

↓

Process Payment

Server charges card

↓

Network Failure

Response never arrives

↓

Client Doesn't Know

Success or failure?

Solution: Idempotent Operations

Generate Request ID

UUID or hash

↓

Send with ID

Include in request

↓

Server Deduplicates

Check ID, process once

↓

Safe to Retry

Same ID = idempotent

Approaches:

Natural idempotence: Some operations are inherently idempotent (setting a value, deleting by ID)
Maintaining metadata: Track operation IDs that have updated values, ensuring duplicates are recognized
Fencing tokens: Ensure proper ordering when failing over between nodes

Challenges:

Requires careful design and implementation
Must maintain additional metadata
Needs to handle failover scenarios correctly
Application-specific logic often needed

💡 Insight

One of the most effective approaches for exactly-once semantics is making operations idempotent—ensuring they have the same effect whether executed once or multiple times. However, making a non-naturally-idempotent operation idempotent requires effort: maintaining operation ID metadata and proper fencing during failover.

7.3. Duplicate Suppression

In plain English: TCP automatically removes duplicate packets on a single connection, but what happens when your database connection drops and reconnects? TCP's duplicate detection doesn't span connections. Similarly, what if the user's browser times out and they retry? Now you have duplicate requests at the application level that neither TCP nor the database can detect.

In technical terms: Duplicate suppression occurs at many levels (TCP packet sequence numbers, transaction IDs, application request IDs), but each level has limited scope. TCP duplicate suppression only works within a single connection. Database transactions are tied to client connections. When connections fail and reconnect, you're outside the scope of these duplicate suppression mechanisms.

Why it matters: Even with database transactions, you can't prevent duplicate execution without application-level duplicate detection. The standard bank transfer example (debit one account, credit another) is actually not correct without additional safeguards.

Example: Non-Idempotent Money Transfer

BEGIN TRANSACTION;
UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;
COMMIT;

The problem:

Client Sends COMMIT

Transaction submitted

→

Database Commits

Transaction executes

→

Network Timeout

Response never arrives

→

Client Reconnects

New TCP connection

→

Client Retries

Executes transaction again!

Result: $22 transferred instead of $11. Real banks don't work like this!

Even 2PC doesn't fully solve this:

Two-phase commit protocols allow transaction coordinators to reconnect after network faults to complete in-doubt transactions. But this still doesn't prevent duplicates from:

End-user device retries (browser "Submit form again?" warning)
Network timeouts between client and application server
Application server retries to database

💡 Insight

Multiple layers of duplicate suppression exist (TCP, database transactions, stream processors), but none by itself provides end-to-end guarantees. Each layer helps reduce the probability of duplicates reaching higher levels, but the application must implement end-to-end duplicate suppression to guarantee correctness.

7.4. The End-to-End Argument

In plain English: It's like having locked doors throughout a building (good!) but still needing to secure the front entrance. Inner doors help, but only the front door provides end-to-end security. Similarly, TCP prevents duplicate packets, databases prevent duplicate transactions within a connection, but you still need application-level duplicate prevention for true end-to-end correctness.

In technical terms: The end-to-end argument, articulated by Saltzer, Reed, and Clark in 1984, states: "The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the endpoints of the communication system." Lower-level mechanisms help but aren't sufficient by themselves.

Why it matters: You cannot rely solely on infrastructure-level features for correctness. Applications must implement end-to-end measures like duplicate suppression, checksums, and encryption to ensure correctness across the entire system path.

The End-to-End Argument

Layers of Duplicate Suppression

Application Layer

Request IDsIdempotent operationsBusiness logic deduplication

Database Layer

Transaction boundariesWithin-connection deduplicationUniqueness constraints

Stream Processing Layer

Exactly-once semanticsMessage deduplicationCheckpoint recovery

Network Layer (TCP)

Packet sequence numbersRetransmissionIn-order delivery

Each layer provides partial guarantees; only end-to-end measures ensure correctness

Examples of end-to-end argument:

Duplicate suppression: TCP suppresses duplicate packets, databases deduplicate within transactions, but only end-to-end request IDs guarantee no duplicates from user
Data integrity: Ethernet, TCP, and TLS have checksums detecting network corruption, but not corruption from software bugs or disk errors. Only end-to-end checksums catch all sources of corruption
Encryption: WiFi password protects against WiFi snooping, TLS/SSL protects against network attackers, but only end-to-end encryption protects against compromised servers

Making application code reliable:

Just because an application uses serializable transactions doesn't guarantee freedom from data loss or corruption. The application itself needs end-to-end measures like duplicate suppression.

The challenge:

Fault-tolerance mechanisms are hard to implement correctly
Low-level reliability (like TCP) works well, so high-level faults are rare (but still occur)
We haven't found the right abstraction to wrap high-level fault tolerance
Transactions help but aren't sufficient by themselves

Uniquely Identifying Requests:

To make requests idempotent through several network hops, you need end-to-end request identifiers:

-- Generate unique ID (UUID) and include in all request processing
ALTER TABLE requests ADD UNIQUE (request_id);

BEGIN TRANSACTION;

INSERT INTO requests
  (request_id, from_account, to_account, amount)
  VALUES('0286FDB8-D7E1-423F-B40B-792B3608036C', 4321, 1234, 11.00);

UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;

COMMIT;

How it works:

Generate UUID in client application (hidden form field or hash of form fields)
Pass request ID through entire system to database
Uniqueness constraint on request_id prevents duplicate execution
If duplicate attempted, INSERT fails and transaction aborts
requests table acts as event log for event sourcing or CDC

💡 Insight

Low-level reliability features (TCP duplicate suppression, Ethernet checksums, WiFi encryption) cannot provide end-to-end features by themselves, but they're still useful since they reduce the probability of problems at higher levels. We just need to remember that low-level reliability is not sufficient to ensure end-to-end correctness.

8. Enforcing Constraints

8.1. Uniqueness Constraints Require Consensus

In plain English: Imagine two people trying to book the last seat on a flight at the same time. Someone needs to decide who got there first—you need consensus. The same applies to usernames, email addresses, account balances, or any resource where uniqueness matters. Without consensus, two conflicting operations might both succeed, violating the constraint.

In technical terms: In a distributed setting, enforcing uniqueness constraints requires consensus: when several concurrent requests have the same value, the system must decide which operation is accepted and reject others as violations. This is typically achieved by making a single node the leader that decides on all requests for a particular value.

Why it matters: Uniqueness constraints are fundamental to many applications (user registration, booking systems, inventory management), but they require coordination that can become a bottleneck. Understanding how to scale uniqueness checking is essential for building high-performance systems.

Uniqueness Constraints Need Consensus

Problem: Concurrent Writes

User A → username "alice"

User B → username "alice"

Without consensus: both might succeed!

Solution: Single Decision Maker

Requests Arrive

Both want 'alice'

↓

Leader Decides

Serializes requests

↓

First Wins

User A gets 'alice'

↓

Second Rejected

User B: name taken

Approaches:

Single leader: Make one node responsible for all decisions for a value
- Works fine if throughput is acceptable
- Funnels all requests through one node
- Subject to leader election if leader fails
Consensus algorithms: Raft, Paxos handle leader election safely
- Prevent split brain scenarios
- Recover from leader failures
- Maintain consistency guarantees
Sharding by value: Scale out by partitioning
- Route requests by hash of value (username, request ID, etc.)
- Each shard handles its own subset independently
- Increases throughput while maintaining per-shard uniqueness

What doesn't work:

Asynchronous multi-leader replication: Different leaders might concurrently accept conflicting writes, violating uniqueness
Eventually consistent systems: Can't immediately reject violations without coordination

Other similar constraints:

The same techniques apply to:

Account balances never going negative
Not selling more items than warehouse stock
Preventing overlapping meeting room bookings
Ensuring sequential ID generation

💡 Insight

If you want to immediately reject writes that would violate a constraint, synchronous coordination is unavoidable. However, as we'll see later, many applications can actually tolerate temporary constraint violations and fix them up later with compensating transactions.

8.2. Uniqueness in Log-Based Messaging

In plain English: Think of a log as a line of people at a ticket counter—everyone stands in order, and the ticket agent serves them one by one. The agent (stream processor) can definitively say who asked for tickets first and who gets them. Even if there's a dispute later about who was first, the written log record provides the indisputable truth.

In technical terms: A shared log ensures all consumers see messages in the same order—a guarantee known as total order broadcast, which is formally equivalent to consensus. A stream processor consuming a log shard sequentially on a single thread can unambiguously and deterministically decide which of several conflicting operations came first.

Why it matters: Log-based messaging provides the same consensus guarantees as traditional approaches but in a way that scales easily with throughput by increasing the number of shards, as each shard can be processed independently.

Example: Username Registration

Encode Requests

Username requests → messages

→

Shard by Hash

hash(username) determines shard

→

Sequential Processing

Stream processor reads log order

→

Local State Check

Is username taken?

→

Emit Success/Reject

Write to output stream

Algorithm:

Every username request is encoded as a message and appended to a shard determined by hash(username)
A stream processor sequentially reads requests in the log, using a local database to track which usernames are taken
For every available username request, record the name as taken and emit a success message to an output stream
For every taken username request, emit a rejection message to the output stream
The client requesting the username watches the output stream and waits for success or rejection

Advantages:

Same as consensus: This algorithm is the same as achieving consensus using a shared log (Chapter 10)
Scales with throughput: Increase number of shards to handle more requests
Independent processing: Each shard processed independently
General applicability: Works for many constraint types, not just uniqueness

Conflict definition:

The fundamental principle: any writes that may conflict are routed to the same shard and processed sequentially. The stream processor can use arbitrary logic to validate requests—the definition of a conflict depends on the application.

💡 Insight

This approach works not only for uniqueness constraints but also for many other kinds of constraints. The key principle is routing potentially conflicting writes to the same shard for sequential processing. The stream processor can use arbitrary application-specific logic to validate each request.

8.3. Multi-Shard Request Processing

In plain English: Imagine transferring money from your checking account to your savings account, while the bank also deducts a fee. That's three accounts potentially on three different servers. Traditional databases would lock all three accounts during the transaction. Log-based systems can achieve the same correctness without locking, by carefully routing events through shards in the right order.

In technical terms: Executing operations atomically across multiple shards while satisfying constraints is challenging. Traditional databases use atomic commit (2PC) across shards, forcing total ordering with respect to all other transactions. However, equivalent correctness can be achieved without cross-shard transactions using sharded logs and stream processors.

Why it matters: Cross-shard coordination limits throughput since shards can no longer be processed independently. Log-based approaches can achieve the same correctness without traditional distributed transactions, enabling better performance and fault tolerance.

Multi-Shard Payment Processing

Payment Request with Unique ID

Append to source account shard↓

Source Account Processor

Check balance

Reserve payment

Emit events to 3 shards

Outgoing payment↓

Source Account

Deduct reserved amount

Incoming payment↓

Destination Account

Credit amount

Incoming fee↓

Fee Account

Credit fee

All updates deduplicated by request ID, achieving exactly-once semantics

How it works:

Request to log: Payment request with unique ID is appended to a log shard based on source account ID
Process at source: Stream processor reads log, maintains state of source account and processed request IDs. When encountering new request:
- Check if source account has sufficient balance
- If yes, reserve payment amount locally
- Emit events to three log shards:
  - Outgoing payment → source account shard (own input)
  - Incoming payment → destination account shard
  - Incoming payment → fee account shard
- Include original request ID in all emitted events
Complete at source: When outgoing payment event returns to source processor:
- Recognize based on request ID
- Execute payment (deduct reserved amount)
- Ignore duplicates based on request ID
Process at destination/fee: Independent processors for destination and fee accounts:
- Receive incoming payment events
- Update local state
- Deduplicate based on request ID

Key properties:

Three separate shards: Accounts can be in same or different shards—doesn't matter
Strict log order: Events for any given account processed in log order
At-least-once semantics: Tolerate crashes with reprocessing
Deterministic processing: Same inputs always produce same decisions

Fault tolerance:

If source account processor crashes while processing payment:

Output messages may or may not have been emitted
After recovery, reprocesses same request (at-least-once)
Makes same decision (deterministic)
Emits same messages with same request ID
Downstream consumers ignore duplicates via request ID

Atomicity:

Comes not from transactions but from:

Writing initial request event to source account log is atomic
Once that one event is in the log, all downstream events will eventually be written
Possibly after crashes and with duplicates, but eventually consistent

💡 Insight

By breaking down the multi-shard transaction into several differently sharded stages and using end-to-end request IDs, we achieve the same correctness property (every request applied exactly once to all accounts) even with faults, and without using an atomic commit protocol. This is more scalable and robust than traditional 2PC.

9. Timeliness and Integrity

In plain English: There are two types of "being wrong" in data systems. Type 1: "Your Amazon order will arrive in 3 days" but the page takes 30 seconds to load, so you don't see that estimate for half a minute. That's annoying but temporary. Type 2: Your bank shows your account balance as $1,000 but a bug summed transactions incorrectly—you actually have $1,500. That's permanent corruption. The first is a timeliness problem, the second is an integrity problem.

In technical terms: The term consistency conflates two different requirements: (1) Timeliness means users observe the system in an up-to-date state—weak or eventual consistency is a timeliness issue. (2) Integrity means absence of corruption—no data loss, no contradictory or false data. ACID transactions provide both, but dataflow systems can provide strong integrity guarantees without timeliness.

Why it matters: Most applications can tolerate delays in seeing updates (timeliness violations) much better than data corruption (integrity violations). Understanding this distinction lets you build systems with strong integrity guarantees even when immediate consistency isn't feasible.

Timeliness

Integrity

Definition

Up-to-date observations

Absence of corruption

CAP Theorem

Consistency = linearizability

Not addressed by CAP

When Violated

Stale reads (temporary)

Data loss or corruption (permanent)

Resolution

Wait and retry → fixed

Requires explicit checking & repair

ACID Property

Consistency (linearizability)

Atomicity, Durability

Slogan

Eventual consistency

Perpetual inconsistency

Example: Credit Card Statement

Timeliness vs Integrity in Banking

Timeliness Issue (OK)

Transaction within last 24 hours doesn't appear yet

✓ Expected: systems have lag ✓ Will appear eventually ✓ Annoying but not catastrophic

Integrity Issue (BAD)

Statement balance ≠ sum of transactions, or money disappears

✗ Corruption of system integrity ✗ Won't fix itself by waiting ✗ Catastrophic for trust & legal reasons

💡 Insight

In slogan form: violations of timeliness are "eventual consistency," whereas violations of integrity are "perpetual inconsistency." In most applications, integrity is much more important than timeliness. Violations of timeliness can be annoying and confusing, but violations of integrity can be catastrophic.

9.1. Correctness of Dataflow Systems

In plain English: Traditional databases are like package deals—they give you both immediate updates (timeliness) and guaranteed correctness (integrity) bundled together. Dataflow systems unbundle these: you get strong integrity guarantees without paying for immediate timeliness. Users might need to wait a moment to see updates, but when they do see them, they're guaranteed to be correct.

In technical terms: ACID transactions usually provide both timeliness (linearizability) and integrity (atomic commit). Event-based dataflow systems decouple these: they don't guarantee timeliness unless you explicitly build consumers that wait for messages, but they can maintain strong integrity through exactly-once semantics and deterministic processing.

Why it matters: By decoupling timeliness and integrity, dataflow systems can achieve comparable correctness to ACID transactions with much better performance and operational robustness.

How dataflow systems maintain integrity:

Exactly-once/effectively-once semantics: Prevents data loss or duplication, preserving integrity
Deterministic derivation functions: Same inputs always produce same outputs
Client-generated request IDs: Enable end-to-end duplicate suppression and idempotence
Immutable messages: Easier to recover from bugs by reprocessing
Single atomic write: Write operation represented as single message that can be written atomically (event sourcing fits well)

Integrity without distributed transactions:

Dataflow systems achieve integrity through:

Fault-tolerant message delivery: Reliable log-based messaging
Duplicate suppression: Idempotent operations with request IDs
Deterministic processing: Pure functions with no side effects
Reprocessable data: Can rerun derivations from scratch

These mechanisms are less expensive and more operationally robust than distributed transactions.

💡 Insight

An interesting property of event-based dataflow systems is that they decouple timeliness and integrity. When processing event streams asynchronously, there's no guarantee of timeliness unless you explicitly wait for messages. However, integrity is central—exactly-once semantics is a mechanism for preserving integrity even without timeliness guarantees.

9.2. Loosely Interpreted Constraints

In plain English: Real businesses often break their own rules temporarily and fix them later. Airlines overbook flights expecting some no-shows. Warehouses oversell stock and apologize with discounts. Banks allow overdrafts and charge fees. Strict enforcement of constraints ("never overbook!") is often more restrictive than necessary. Many businesses prefer optimistically allowing violations and having processes to handle apologies and compensation.

In technical terms: Many real applications have business requirements to allow violations of what seem like hard constraints. Enforcing uniqueness strictly requires consensus/coordination, but many business contexts find it acceptable to temporarily violate constraints and apply compensating transactions later (apologizing, refunding, providing alternatives).

Why it matters: If compensating transactions are acceptable for your business, you can avoid the coordination bottleneck of strict constraint enforcement. Write operations can proceed optimistically, with validation happening after the fact but before expensive-to-undo actions.

Real-World Constraint Violations

📦

Warehouse Stock

Overselling inventory

Customers order more than available
Order more stock from supplier
Apologize for delay, offer discount
Same as forklift accident recovery

✈️

Airline Seats

Overbooking flights

Deliberately violate 'one person per seat'
Expect some passengers to no-show
Compensation when demand exceeds supply
Refunds, upgrades, alternative flights

💰

Bank Overdrafts

Negative balance allowed

Withdraw more than account balance
Charge overdraft fee
Request customer pay back
Limit daily withdrawals to bound risk

🏢

Inter-Organization

Settlement processes

Inconsistencies inevitably arise
Correction mechanisms necessary
Payment settlement between banks
Reconciliation is normal business

Compensating transactions:

A compensating transaction is a change that corrects a previous mistake. Examples:

Unsending email: Can't undo, but can send follow-up correction
Double charge: Can refund one charge (cost is just processing fees and customer complaint)
ATM cash: Can't reclaim directly, but can send debt collectors for overdraft
Flight oversold: Offer refund, upgrade, or hotel for bumped passengers

Cost of apologies:

Whether the cost of apologies is acceptable is a business decision. If acceptable:

No need to check all constraints before writing data
Can proceed optimistically with writes
Validate constraints after the fact
Ensure validation happens before expensive-to-recover actions

Integrity vs. Timeliness:

These applications require integrity (don't lose reservations, don't lose money due to mismatched credits/debits), but they don't require timeliness on constraint enforcement. If you've sold more items than available, you can fix it later by apologizing.

💡 Insight

In many business contexts, it's acceptable to temporarily violate a constraint and fix it up later by apologizing. This is similar to the conflict resolution approaches for multi-leader replication. The traditional model of checking all constraints before writing data is unnecessarily restrictive for many real-world applications.

9.3. Coordination-Avoiding Data Systems

In plain English: Imagine a large company where every decision requires a meeting with all departments. Progress would be glacial, and if one department is unavailable, everything stops. Now imagine departments work mostly independently, synchronizing only when truly necessary. That's the difference between coordination-heavy systems (distributed transactions) and coordination-avoiding systems (dataflow with loose constraints).

In technical terms: Two key observations: (1) Dataflow systems can maintain integrity guarantees on derived data without atomic commit, linearizability, or synchronous cross-shard coordination, and (2) many applications are fine with loose constraints that may be temporarily violated and fixed later, as long as integrity is preserved. Together, these enable coordination-avoiding data systems.

Why it matters: Systems that avoid coordination can achieve better performance and fault tolerance than those requiring synchronous coordination, while still maintaining strong integrity guarantees. This is especially valuable for geographically distributed deployments.

Coordination-Heavy (2PC)

Coordination-Avoiding (Dataflow)

Integrity

Strong atomicity guarantees

Strong through idempotence

Timeliness

Immediate (linearizable)

Eventual (asynchronous)

Coordination

Synchronous cross-shard

Asynchronous log processing

Fault Tolerance

Fails if any participant fails

Isolated failures, eventual recovery

Performance

Coordination overhead

High throughput

Geo-Distribution

Impractical (latency)

Multi-datacenter replication

Example architecture:

A coordination-avoiding system can:

Operate across multiple datacenters: Each datacenter in multi-leader configuration
Asynchronously replicate between regions: No synchronous cross-region coordination required
Continue during regional outages: Any datacenter can operate independently
Weak timeliness guarantees: Not linearizable without coordination
Strong integrity guarantees: Through deterministic processing and idempotence

Where coordination is still needed:

Strict constraints: When immediate rejection of violations is required
Unrecoverable operations: Before actions from which recovery is impossible
Small scope: Use serializable transactions where they work well
Specific features: Only parts of the application that truly need it

💡 Insight

Coordination-avoiding data systems have a lot of appeal: they can achieve better performance and fault tolerance than systems requiring synchronous coordination. Serializable transactions are still useful for maintaining derived state, but can be run at a small scope where they work well. Heterogeneous distributed transactions (XA) are not required.

Trade-off perspective:

Coordination and constraints reduce the number of apologies due to inconsistencies, but potentially reduce performance and availability, thus potentially increasing apologies due to outages. You cannot reduce apologies to zero, but you can find the best trade-off: the sweet spot where there are neither too many inconsistencies nor too many availability problems.

10. Trust, But Verify

10.1. Maintaining Integrity in the Face of Software Bugs

In plain English: Imagine trusting a calculator to always give correct answers, then discovering it sometimes makes rounding errors. Even well-tested databases have bugs. MySQL has failed to maintain uniqueness constraints, PostgreSQL's serializability has exhibited write skew anomalies. If mature, battle-tested databases have bugs, applications built on them certainly do. You need ways to detect corruption, not just assume it never happens.

In technical terms: Despite hardware failures being rare and software being carefully designed and tested, bugs still exist. Past versions of major databases (MySQL, PostgreSQL) have had bugs violating fundamental guarantees. Application code has even more bugs, and most applications don't correctly use the database features that preserve integrity (foreign keys, uniqueness constraints).

Why it matters: Data corruption is inevitable over time at large enough scale. You need ways to detect corruption so you can fix it and track down error sources. Auditing—checking data integrity—shouldn't be just for financial applications; it's important for all systems that care about correctness.

Real-world database bugs:

MySQL: Past versions failed to correctly maintain uniqueness constraints
PostgreSQL: Serializable isolation level exhibited write skew anomalies in some versions
Both are robust, well-regarded databases with extensive testing and review
Less mature software likely has more issues

Application code is worse:

Most applications receive far less review and testing than databases
Many don't use available integrity features (foreign key constraints, uniqueness constraints)
Often use database features incorrectly (unsafe isolation levels, improper transaction boundaries)

ACID consistency assumption:

ACID consistency assumes:

Database starts in a consistent state
Transaction transforms one consistent state to another
Relies on transactions being bug-free

But if applications use databases incorrectly (weak isolation levels unsafely, improper constraints), integrity cannot be guaranteed.

💡 Insight

Consistency in the ACID sense is based on the idea that the database starts in a consistent state and transactions transform it from one consistent state to another. However, this notion only makes sense if you assume transactions are free from bugs. If applications use databases incorrectly, integrity cannot be guaranteed.

10.2. Don't Just Blindly Trust What They Promise

In plain English: Don't assume your backup works—actually test restoring from it before you desperately need it during a disaster. Don't assume your data is uncorrupted—periodically check it. Large-scale storage systems don't fully trust disks; they constantly read files, compare replicas, and move data between disks to catch corruption early.

In technical terms: With both hardware and software not always living up to ideals, data corruption is inevitable eventually. Checking data integrity (auditing) is essential. Mature systems actively verify correctness rather than blindly trusting components. Storage systems like HDFS and S3 run background processes continually reading files, comparing replicas, and moving data to mitigate silent corruption.

Why it matters: If you want to ensure your data is still there, you have to actually read it and check. Most of the time it will be fine, but if it isn't, you want to find out sooner rather than later. Don't blindly trust—verify.

Examples of verification in practice:

Large-scale storage systems:

HDFS: Background processes continually read files, compare replicas, move files between disks
Amazon S3: Similar verification processes
Philosophy: Don't fully trust disks, but assume they work correctly most of the time
Proactive: Find corruption before it causes problems

Backup systems:

Test restores regularly: Don't wait for disaster to discover backup is broken
Verification processes: Automated testing of backup integrity
Recovery drills: Practice restore procedures

Trust, But Verify Approach

Store Data

Write to persistent storage

→

Background Verification

Read back and check continuously

→

Compare Replicas

Detect silent corruption

→

Proactive Repair

Fix before it causes issues

System model assumptions:

Traditional system models (Chapter 10) take a binary approach:

Some things can happen (processes crash, network delays)
Other things can never happen (disk data not lost after fsync, memory not corrupted, CPU instructions correct)

Reality is more nuanced—it's about probabilities:

Some things are more likely
Other things are less likely
Question: Do violations happen often enough that we encounter them in practice?

At scale, rare events happen:

If you operate at large enough scale, even very unlikely things do occur. Data can become corrupted:

In memory (hardware faults)
On disk (bit rot, firmware bugs)
On the network (cosmic rays, electromagnetic interference)
Through software bugs (the most common source)

💡 Insight

Systems like HDFS and S3 still assume that disks work correctly most of the time—which is reasonable—but not the same as assuming they always work correctly. However, not many systems currently have this "trust, but verify" approach of continually auditing themselves. In the future, we may see more self-validating systems that continually check their own integrity rather than relying on blind trust.

10.3. Designing for Auditability

In plain English: Imagine trying to figure out why your bank account is wrong by looking at the current balance—impossible! But if you have a ledger showing every transaction with timestamps and reasons, you can trace exactly what happened. Event-based systems provide this ledger: every state change is an immutable event, making it possible to audit, replay, and verify that the system did what it was supposed to do.

In technical terms: If transactions mutate objects in databases, it's difficult to determine after the fact what the transaction meant. Even with transaction logs, the insertions/updates/deletions don't necessarily show why mutations were performed. Event-based systems provide better auditability: user input is represented as immutable events, and state updates are deterministic derivations from those events.

Why it matters: Being explicit about dataflow makes data provenance much clearer, enabling integrity checking, debugging, and compliance. You can rerun derivations to verify correctness or even time-travel debug to understand why the system made specific decisions.

Comparison: Mutable Transactions vs. Immutable Events

Mutable Database Transactions

Event-Based Systems

User Action

Invocation is transient, can't reproduce

Immutable event, permanently recorded

State Updates

Multiple tables mutated in place

Derived deterministically from events

Reasoning

Hard to determine why changes happened

Explicit dataflow shows provenance

Verification

Can't recompute from scratch

Can rerun same events, verify output

Debugging

State at time of bug is lost

Time-travel debug: replay events

Benefits of event-based auditability:

Deterministic derivations: Running same events through same code produces same output
Hash verification: Can check event log storage hasn't been corrupted using cryptographic hashes
Redundant derivation: Can rerun batch/stream processors to verify derived state matches expectations
Parallel verification: Run redundant derivation in parallel with production to detect discrepancies
Time-travel debugging: Reproduce exact circumstances leading to unexpected events

Example audit workflow:

Suspicious Result

Detect anomaly in system

→

Check Event Log

Find original events

→

Verify Hashes

Ensure log not corrupted

→

Rerun Derivation

Process events again

→

Compare Results

Same or different?

Better than traditional databases:

Traditional transaction logs show low-level mutations (insert, update, delete) but not:

Why the application decided to perform those mutations
What user action triggered the transaction
What the application state was when the decision was made

Event-based systems preserve this context, making debugging and auditing much more feasible.

💡 Insight

Being explicit about dataflow makes the provenance of data much clearer, which makes integrity checking much more feasible. A deterministic and well-defined dataflow also makes it easier to debug and trace execution to determine why the system did something—a kind of time-travel debugging capability.

10.4. Tools for Auditable Data Systems

In plain English: Blockchains like Bitcoin got one thing very right: they provide tamper-proof logs where everyone can verify that the log hasn't been altered. You don't need cryptocurrency to benefit from these ideas—cryptographic tools like Merkle trees and signed logs can make regular applications more auditable, helping detect bugs and corruption before they cause serious damage.

In technical terms: Currently, few data systems make auditability a top-level concern. Some applications implement custom audit mechanisms, but guaranteeing integrity of both audit logs and database state is difficult. Cryptographic techniques from blockchains (Merkle trees, signed logs, Byzantine fault tolerance) can be adapted for general-purpose data systems to provide stronger auditability.

Why it matters: Auditing tools increase confidence about system correctness, allowing you to move faster. Like automated testing, auditing increases the chances that bugs are found quickly, reducing the risk that changes or new technologies cause damage.

Tools for Auditable Systems

⛓️

Blockchains

Byzantine fault tolerant

Shared append-only logs
Cryptographic consistency checks
Consensus despite corrupted nodes
Too high overhead for most apps

🌳

Merkle Trees

Efficient integrity proofs

Trees of hashes
Efficiently prove record appears in dataset
Used in Git, Bitcoin, Certificate Transparency
Lower overhead than full blockchain

🔒

Certificate Transparency

Cryptographically verified logs

Append-only logs + Merkle trees
Check validity of TLS/SSL certificates
Single leader per log (no consensus needed)
Detects certificate authority misbehavior

🔐

Hardware Security Modules

Tamper-proof signing

Periodically sign transaction logs
Makes logs tamper-proof
Doesn't guarantee right transactions logged
Prevents post-hoc alteration

Blockchain technologies relevant to general systems:

Merkle trees: Trees of cryptographic hashes that efficiently prove a record appears in a dataset
Append-only logs: Immutable event logs that can't be altered without detection
Cryptographic signatures: Sign log segments to prove they haven't been tampered with
Byzantine fault tolerance: Continue working correctly even if some nodes have corrupted data

Example: Certificate Transparency

Certificate Transparency uses cryptographic techniques to audit TLS/SSL certificates:

Maintain cryptographically verified append-only logs
Use Merkle trees to efficiently prove certificate issuance
Avoid consensus protocol by having single leader per log
Detect and prevent misissued certificates from certificate authorities

Future direction:

Integrity-checking and auditing algorithms might become more widely used in general data systems:

Some work needed to make them scalable
Keep performance penalty low
Adapt techniques from distributed ledgers and certificate transparency
Apply to general-purpose applications, not just financial systems

💡 Insight

At present, not many data systems make auditability a top-level concern. However, cryptographic tools from blockchains can be used in lightweight contexts. Integrity-checking algorithms like those in certificate transparency and distributed ledgers might become more widely used in data systems in the future, though work is needed to make them equally scalable.

11. Summary

In this chapter, we explored new approaches to designing data systems based on stream processing ideas, bringing together the themes of reliability, scalability, and maintainability:

Data Integration (Section 2):

No single tool efficiently serves all use cases
Applications must compose specialized software to accomplish goals
Solve integration through batch processing and event streams
Designate systems of record; derive other data through transformations
Asynchronous, loosely coupled derivations prevent faults from spreading

Dataflow Philosophy:

Express data processing as transformations from one dataset to another
When processing changes, rerun transformation code on whole input to rederive output
Similar to what databases do internally, but unbundled into separate components
Easier to evolve: fix code and reprocess data to recover from errors

Observing Derived State (Section 6):

Derived state can be observed by downstream consumers
Can extend dataflow to end-user devices
Build UIs that dynamically update to reflect data changes
Continue working offline with eventual synchronization

Correctness Without Coordination (Sections 7-9):

Strong integrity guarantees achievable with asynchronous event processing
Use end-to-end request IDs to make operations idempotent
Check constraints asynchronously
Clients can wait for checks or proceed optimistically (risking apology)
Much more scalable and robust than distributed transactions
Fits with how many business processes work in practice

Coordination-Avoiding Systems:

Maintain integrity without atomic commit, linearizability, or cross-shard coordination
Many applications fine with loose constraints temporarily violated and fixed later
Achieve better performance and fault tolerance
Work in geographically distributed scenarios and with faults

Trust, But Verify (Section 10):

Auditing verifies integrity and detects corruption
Cryptographic techniques from blockchains applicable to general systems
Event-based systems provide better auditability than traditional mutations
Increases confidence, enabling faster iteration

💡 Insight

The key philosophical shift is treating data systems as dataflow pipelines where state changes flow through transformations, rather than as databases that are passively read and written. This enables building systems that maintain strong integrity guarantees while being more scalable, fault-tolerant, and evolvable than traditional approaches.

The future of data systems:

By structuring applications around dataflow and checking constraints asynchronously, we can create systems that:

Maintain integrity without coordination
Perform well even geographically distributed
Remain correct in the presence of faults
Can be audited and verified end-to-end
Enable faster, safer evolution

This philosophy represents the convergence of ideas from databases, stream processing, distributed systems, and functional programming, pointing toward how we'll build reliable systems in the coming decades.

Previous: Chapter 12: Stream Processing | Next: Chapter 14: Ethics

Table of Contents​

1. Introduction​

2. Data Integration​

2.1. Combining Specialized Tools by Deriving Data​

2.2. Reasoning About Dataflows​

2.3. Derived Data Versus Distributed Transactions​

2.4. The Limits of Total Ordering​

2.5. Ordering Events to Capture Causality​

3. Batch and Stream Processing​

3.1. Maintaining Derived State​

3.2. Reprocessing Data for Application Evolution​

3.3. Unifying Batch and Stream Processing​

4. Unbundling Databases​

4.1. Composing Data Storage Technologies​

4.2. Federated Databases: Unifying Reads​

4.3. Unbundled Databases: Unifying Writes​

4.4. Making Unbundling Work​

5. Designing Applications Around Dataflow​

5.1. Application Code as a Derivation Function​

5.2. Separation of Application Code and State​

5.3. Dataflow: Interplay Between State Changes and Application Code​

5.4. Stream Processors and Services​

6. Observing Derived State​

6.1. Materialized Views and Caching​

6.2. Stateful, Offline-Capable Clients​

6.3. Pushing State Changes to Clients​

6.4. End-to-End Event Streams​

6.5. Reads Are Events Too​

7. Aiming for Correctness​

7.1. The End-to-End Argument for Databases​

7.2. Exactly-Once Execution of an Operation​

7.3. Duplicate Suppression​

7.4. The End-to-End Argument​

8. Enforcing Constraints​

8.1. Uniqueness Constraints Require Consensus​

8.2. Uniqueness in Log-Based Messaging​

8.3. Multi-Shard Request Processing​

9. Timeliness and Integrity​

9.1. Correctness of Dataflow Systems​

9.2. Loosely Interpreted Constraints​

9.3. Coordination-Avoiding Data Systems​

10. Trust, But Verify​

10.1. Maintaining Integrity in the Face of Software Bugs​

10.2. Don't Just Blindly Trust What They Promise​

10.3. Designing for Auditability​

10.4. Tools for Auditable Data Systems​

11. Summary​

Table of Contents