Skip to main content

Designing Data-Intensive Applications

"There are no solutions, there are only trade-offs. But you try to get the best trade-off you can get, and that's all you can hope for."

— Thomas Sowell

Welcome to this study guide for Designing Data-Intensive Applications (2nd Edition) by Martin Kleppmann — your comprehensive resource for understanding how to build reliable, scalable, and maintainable data systems.


Table of Contents

  1. What You'll Learn
  2. Who This Guide Is For
  3. How to Use This Guide
  4. Book Structure
  5. Prerequisites

1. What You'll Learn

In plain English: How to design systems that handle large amounts of data reliably and efficiently.

In technical terms: You'll master the fundamental principles behind databases, distributed systems, data processing pipelines, and the trade-offs involved in building real-world applications.

Why it matters: Modern applications are data-intensive, not compute-intensive. Understanding these principles is essential for building systems that scale, remain reliable under failure, and evolve gracefully over time.


2. Who This Guide Is For

This guide is designed for:

RoleWhat You'll Gain
Software EngineersDeep understanding of data systems internals and how to choose the right tools
System ArchitectsFramework for making informed trade-off decisions in system design
Tech LeadsVocabulary and concepts to guide your team's technical decisions
StudentsSolid foundation in distributed systems and database fundamentals

3. How to Use This Guide

┌─────────────────────┐     ┌─────────────────────┐     ┌─────────────────────┐
│ Part I │ │ Part II │ │ Part III │
│ ───────────────── │ ──▶ │ ───────────────── │ ──▶ │ ───────────────── │
│ Foundations of │ │ Distributed │ │ Derived │
│ Data Systems │ │ Data │ │ Data │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘

If you're new to data systems: Start from Part I to build a solid foundation in data models, storage engines, and encoding formats.

If you're familiar with databases: Jump to Part II to explore replication, partitioning, and the challenges of distributed systems.

If you're interested in data pipelines: Part III covers batch and stream processing, helping you understand how to derive value from data.


4. Book Structure

Part I: Foundations of Data Systems

ChapterTopicKey Concepts
1Trade-offs in ArchitectureOLTP vs OLAP, cloud vs self-hosting, distributed vs single-node
2Nonfunctional RequirementsPerformance, reliability, scalability, maintainability
3Data Models & Query LanguagesRelational, document, graph models, SQL, MapReduce
4Storage and RetrievalLSM-trees, B-trees, column storage, data warehouses
5Encoding and EvolutionJSON, Protocol Buffers, Avro, schema evolution

Part II: Distributed Data

ChapterTopicKey Concepts
6ReplicationLeader-follower, multi-leader, leaderless replication
7ShardingPartitioning strategies, rebalancing, request routing
8TransactionsACID, isolation levels, serializability, distributed transactions
9Distributed SystemsNetwork issues, clocks, truth in distributed systems
10Consistency & ConsensusLinearizability, ordering, leader election, total order broadcast

Part III: Derived Data

ChapterTopicKey Concepts
11Batch ProcessingUnix philosophy, MapReduce, Spark, dataflow engines
12Stream ProcessingMessage brokers, event sourcing, stream joins
13Philosophy of StreamingUnbundling databases, dataflow, end-to-end arguments
14EthicsPrivacy, bias, accountability in data systems

5. Prerequisites

To get the most from this guide:

  • Basic programming experience (any language)
  • Familiarity with SQL and relational databases
  • Understanding of client-server architecture
  • Curiosity about how systems work under the hood

💡 Insight

You don't need to be an expert—this guide explains concepts from first principles. The most important prerequisite is a genuine interest in understanding why systems are designed the way they are, not just how to use them.


Ready to begin? Start with Chapter 1: Trade-offs in Data Systems Architecture!