Chapter 2. Defining Nonfunctional Requirements

The invisible requirements that make or break your system

"The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error-free?"

— Alan Kay, in interview with Dr Dobb's Journal (2012)

Introduction
Case Study: Social Network Home Timelines
- 2.1. Representing Users, Posts, and Follows
- 2.2. Materializing and Updating Timelines
Describing Performance
- 3.1. Latency and Response Time
- 3.2. Average, Median, and Percentiles
- 3.3. Use of Response Time Metrics
Reliability and Fault Tolerance
- 4.1. Fault Tolerance
- 4.2. Hardware and Software Faults
- 4.3. Humans and Reliability
- 4.4. How Important Is Reliability?
Scalability
- 5.1. Describing Load
- 5.2. Shared-Memory, Shared-Disk, and Shared-Nothing Architecture
- 5.3. Principles for Scalability
Maintainability
- 6.1. Operability: Making Life Easy for Operations
- 6.2. Simplicity: Managing Complexity
- 6.3. Evolvability: Making Change Easy
Summary

1. Introduction

In plain English: You wouldn't build a house focusing only on the floor plan while ignoring whether it can withstand storms, stay warm in winter, or be maintained. Similarly, software needs more than just functionality—it must be fast, reliable, and maintainable.

In technical terms: Nonfunctional requirements define system qualities like performance, reliability, scalability, and maintainability. Unlike functional requirements (what the system does), nonfunctional requirements describe how well it does it.

Why it matters: An app that works perfectly in theory but crashes constantly, responds slowly, or becomes unmaintainable is worthless. These "invisible" requirements often determine whether a system succeeds or fails in production.

⚡

Performance

How fast does it respond?

🛡️

Reliability

Does it keep working when things fail?

📈

Scalability

Can it handle growth?

🔧

Maintainability

Can we evolve it over time?

💡 Insight

Nonfunctional requirements are often unstated because they seem "obvious," but they're just as critical as features. A slow, unreliable app might as well not exist—users will abandon it regardless of its features.

In plain English: Imagine building a Twitter-like feed where millions of people post and read updates every second. Should you build the feed when someone opens the app, or prepare it ahead of time? This simple question reveals fundamental trade-offs in system design.

In technical terms: Social network timelines demonstrate the classic read vs. write optimization trade-off. Computing timelines on-demand optimizes writes but slows reads. Precomputing timelines optimizes reads but increases write complexity.

Why it matters: This pattern—choosing when to do work—appears everywhere in data systems. Understanding it helps you make similar decisions across different domains.

Social Network Scale

✍️500M posts/day
(5,700/sec avg)

📈150k posts/sec
(peak load)

👥200 follows/user
(average)

2.1. Representing Users, Posts, and Follows

In plain English: You could store everything in a database and build each person's feed by searching for their friends' posts every time they open the app. Simple, but slow.

In technical terms: A relational schema with users, posts, and follows tables supports on-demand timeline generation via joins. This approach computes timelines at read time.

Let's say the main read operation is the home timeline, displaying recent posts by people you follow. The SQL query:

SELECT posts.*, users.* FROM posts
  JOIN follows ON posts.sender_id = follows.followee_id
  JOIN users   ON posts.sender_id = users.id
  WHERE follows.follower_id = current_user
  ORDER BY posts.timestamp DESC
  LIMIT 1000

The problem with polling:

10 million online users
Polling every 5 seconds
= 2 million queries/second
Each query fetches posts from ~200 followed users
= 400 million post lookups/second

On-Demand Timeline Generation

User Opens App

Request home timeline

→

Find Follows

Query follows table

→

Fetch Posts

Get recent posts from each

→

Merge & Sort

Combine and order by time

💡 Insight

The query is expensive because it fetches and merges posts from 200 users per request. At scale, this read-time computation becomes a bottleneck—400 million lookups/second is prohibitive for most databases.

2.2. Materializing and Updating Timelines

In plain English: Instead of searching for your friends' posts when you open the app, what if we prepared your feed ahead of time—like delivering mail to your mailbox instead of making you go to every friend's house to check for letters?

In technical terms: Materialization precomputes query results and stores them. When a user posts, we fan out that post to all followers' timeline caches. This shifts work from read time to write time.

Timeline Materialization: Query-Time vs Write-Time

Query-Time (On-Demand)

→

Write-Time (Materialized)

Fan-out Pattern: Write Amplification

User Makes 1 Post

Fan-out↓

Timeline 1

Timeline 2

Timeline 3

...

Timeline 200

1 write → 200 writes (fan-out factor = 200)

The fan-out calculation:

5,700 posts/second (average)
× 200 followers per post (fan-out factor)
= 1.14 million timeline writes/second

Trade-off analysis:

On-Demand

Materialized

Read Performance

Slow (expensive joins)

Fast (cache lookup)

Write Performance

Fast (single insert)

Slower (fan-out writes)

Write Volume

5,700 writes/sec

1.14M writes/sec

Read Volume

400M lookups/sec

Simple cache reads

💡 Insight

Materialization is a read-write trade-off: precomputing views shifts work from read time to write time. This is beneficial when reads vastly outnumber writes—a common pattern in user-facing applications.

The celebrity problem: Users with 100M followers create extreme fan-out. A single post triggers 100M timeline updates—impractical. Real implementations use hybrid approaches: materialize for normal users, compute on-demand for celebrities.

3. Describing Performance

In plain English: "Fast" is subjective. Does it mean average speed, worst-case speed, or something else? To build reliable systems, we need precise ways to measure and discuss performance.

In technical terms: Performance metrics fall into two categories: response time (latency from request to response) and throughput (requests or data volume per second). Both are critical but measure different aspects of system behavior.

Performance Metrics

Response Time

User makes request

time elapsed↓

User receives answer

Throughput

Requests/second

Data volume/second

Operations/second

Metric	Description	Unit
Response time	Elapsed time from request to answer	Seconds (ms, μs)
Throughput	Number of requests or data volume per second	"per second"

The relationship: These metrics are interconnected. When throughput increases (more concurrent requests), response time often degrades due to queueing—requests waiting for resources like CPU or network.

3.1. Latency and Response Time

In plain English: Response time is what the user experiences—the total wait. Latency is the waiting time when nothing is actively happening (like waiting in line).

In technical terms: Response time encompasses all delays; latency specifically measures idle waiting time when a request isn't being actively processed.

Term	Definition
Response time	What the client sees; includes all delays anywhere in the system
Service time	Duration the service is actively processing the request
Queueing delays	Time waiting for resources (CPU, network, disk)
Latency	Time during which a request is not being actively processed (latent, waiting)

Anatomy of Response Time

Network Latency (client → server)

Queueing Delay (waiting for CPU)

Service Time (active processing)

Queueing Delay (waiting for DB)

Network Latency (server → client)

Variability: Response time varies dramatically between identical requests due to:

Context switches (OS scheduling)
Network packet loss and TCP retransmission
Garbage collection pauses
Page faults (RAM → disk swapping)
Cache misses

💡 Insight

Head-of-line blocking causes queueing delays to amplify variability. Since servers process limited concurrent requests, even a few slow requests block subsequent ones, creating cascading delays.

3.2. Average, Median, and Percentiles

In plain English: If nine requests take 100ms and one takes 10 seconds, the average is 1090ms—but that's misleading. Most users experienced 100ms, not 1090ms. We need better ways to describe distributions.

In technical terms: Response time is a distribution, not a single number. Percentiles describe this distribution more accurately than averages.

Performance Percentiles Visualization

p50: 100ms
50% of users

slower →→

p95: 200ms
5% of users

→

p99: 500ms
1% of users

→

p999: 2s
0.1% of users

Box size represents percentage of users experiencing that latency or better

Understanding Percentiles

📊

p50 (Median)

Half faster, half slower than this value

📈

p95

95% of requests faster than this

🎯

p99

99% of requests faster than this

⚡

p999

99.9% of requests faster than this

Percentile	Meaning
p50 (median)	Half of requests are faster, half are slower
p95	95% of requests are faster than this threshold
p99	99% of requests are faster than this threshold
p999	99.9% of requests are faster than this threshold

Why high percentiles matter:

Valuable customers: Users with slow requests often have the most data (e.g., power users, large accounts)
User experience: Consistently slow experiences drive users away
Amazon example: Uses p999 for internal services because slowest users are often most valuable

💡 Insight

Tail latencies (high percentiles) directly impact user experience. A p99 of 2 seconds means 1 in 100 users waits 2+ seconds—enough to notice and complain. For high-traffic services, that's thousands of frustrated users daily.

3.3. Use of Response Time Metrics

In plain English: When one page load requires calling 10 backend services, even if each service is fast 99% of the time, there's a good chance at least one will be slow—making the whole page slow.

In technical terms: Tail latency amplification occurs when a user request requires multiple backend calls. The probability of encountering at least one slow call increases with the number of calls.

Tail Latency Amplification

User Request

↓

Service A

Service B

Service C

Service D

slowest call determines total time↓

Response to User

Example: If each service has p99 = 100ms, and you call 10 services in parallel, the probability that all respond within 100ms is only 90%—meaning p99 for the overall request is likely much worse.

Service Level Objectives (SLOs) and Agreements (SLAs):

Percentiles define expected performance in contracts:

SLO: "p95 response time will be under 200ms"
SLA: "We guarantee p99 under 1 second, or you get a refund"

💡 Insight

High-percentile guarantees become exponentially harder in distributed systems. Each additional service call compounds tail latency risk, which is why microservices often struggle with consistent performance.

4. Reliability and Fault Tolerance

In plain English: Reliability means the system keeps working correctly even when things go wrong—hard drives fail, networks glitch, and humans make mistakes. A reliable system handles these gracefully instead of crashing.

In technical terms: Reliability is the system's ability to continue providing required services despite faults. This requires fault tolerance mechanisms that prevent faults from escalating into failures.

Why it matters: Hardware fails constantly at scale. In a datacenter with 10,000 hard drives, if each has a 1% annual failure rate, you'll have 100 drive failures per year—nearly 2 per week. Systems must be designed to handle this.

For software, typical reliability expectations include:

✓

Correct Function

Performs what users expect

🛡️

Error Tolerance

Handles mistakes gracefully

⚡

Good Performance

Fast enough for use case

🔒

Security

Prevents unauthorized access

Key distinction:

Term	Definition
Fault	A component stops working correctly (e.g., disk fails, network drops)
Failure	The system as a whole stops providing required service to users

Fault vs Failure

Disk Fails

Network Drops

Process Crashes

Faults (Component Issues)

prevented by→

Redundancy

Retry Logic

Monitoring

Fault Tolerance

avoids→

Data Loss

Downtime

Corruption

Failures (System Issues)

💡 Insight

Reliability = continuing to work correctly, even when things go wrong. The goal isn't to eliminate all faults (impossible), but to prevent faults from becoming failures.

4.1. Fault Tolerance

In plain English: Fault tolerance means your system can lose a specific part and keep working. Like having a spare tire—you can drive even with a flat.

In technical terms: A system is fault-tolerant if it continues providing required services despite certain faults occurring. Components that, if failed, cause system failure are called single points of failure (SPOFs).

Scope limits: Fault tolerance is always bounded:

"Tolerates up to 2 concurrent disk failures"
"Survives single datacenter outage"
"Handles 3 node failures in a 5-node cluster"

Single Point of Failure vs Fault Tolerance

SPOF Architecture

Load Balancer

↓

Single Database

⚠️ If DB fails, system fails

Fault-Tolerant

Load Balancer

↓

DB Primary

DB Replica

✓ System survives DB failure

Chaos Engineering: Deliberately injecting faults to test tolerance mechanisms:

Randomly kill processes
Introduce network delays
Fill up disks
Simulate datacenter outages

Fault Tolerance: Defense in Depth

Application Layer
Circuit breakers, graceful degradation, fallbacks

↓

Replication Layer
Data redundancy, automatic failover

↓

Network Layer
Retry logic, timeouts, load balancing

↓

Infrastructure Layer
RAID arrays, dual power supplies, redundant nodes

↓

Physical Layer
Multiple data centers, geographic distribution

💡 Insight

Counter-intuitively, increasing fault rates can improve reliability. By deliberately triggering faults (chaos engineering), you discover weaknesses before they cause real outages. Netflix's Chaos Monkey randomly terminates production instances to ensure systems handle failures gracefully.

4.2. Hardware and Software Faults

In plain English: Hardware breaks predictably—drives fail at known rates, you can plan for it. Software bugs are trickier because they often affect all instances simultaneously.

In technical terms: Hardware faults are typically independent and random. Software faults are correlated—the same bug exists on every node running that code, causing simultaneous failures.

Hardware fault rates:

Hardware Failure Rates

💿

Hard Drives

2–5% fail per year

💾

SSDs

0.5–1% fail per year

⚙️

CPU Cores

1 in 1,000 compute wrong results

🌟

RAM

Corrupted by cosmic rays

Traditional response: Add redundancy:

RAID for disk failures
Dual power supplies
Backup generators
Hot-swappable components

Software faults are more insidious:

Fault Type	Example	Impact
Cascading failures	One overloaded service causes others to fail	Widespread outage
Resource exhaustion	Memory leak consumes all RAM	All nodes fail simultaneously
Dependency failure	External API goes down	All dependent services affected
Retry storms	Failed requests retry, increasing load	System collapse under load

Software Fault Propagation

Bug Deployed

Same bug on all nodes

→

Triggered

Specific input causes crash

→

Simultaneous Failure

All nodes crash together

→

System Down

No redundancy helps

Mitigation strategies:

Careful testing (unit, integration, chaos)
Process isolation (containers, VMs)
Crash and restart (let it fail fast)
Avoid feedback loops (exponential backoff)
Production monitoring and alerting

💡 Insight

Hardware faults are independent; software faults are correlated. This is why redundancy alone doesn't guarantee reliability—three servers running buggy code will all fail the same way. You need defense in depth: testing, isolation, monitoring, and graceful degradation.

4.3. Humans and Reliability

In plain English: Humans make mistakes—it's inevitable. Blaming people doesn't help. Instead, design systems that make mistakes harder to make and easier to recover from.

In technical terms: Human error is the leading cause of outages, but it's a symptom of poor system design, not a root cause. Sociotechnical system design can minimize human-induced failures.

The data: One study found:

Configuration changes by operators: leading cause of outages
Hardware faults: only 10–25% of outages

Why "human error" is misleading: Blaming individuals ignores systemic issues. When humans make mistakes, it usually indicates:

Unclear interfaces
Inadequate training
Time pressure
Poor tooling
Complex systems

Defense in Depth Against Human Error

Design

Well-designed interfaces encourage correct use

→

Test

Thorough testing catches mistakes early

→

Deploy

Gradual rollouts limit blast radius

→

Monitor

Observability detects issues quickly

→

Recover

Easy rollback mechanisms

Technical measures to minimize human mistakes:

Measure	How It Helps
Testing	Unit, integration, and end-to-end tests catch bugs before production
Rollback mechanisms	Quickly revert bad changes
Gradual rollouts	Deploy to small percentage first, catch issues early
Monitoring	Detect anomalies and alert operators
Good interfaces	Make correct actions obvious, dangerous actions difficult
Documentation	Clear runbooks for common operations

Blameless postmortems: After incidents, teams share what happened without fear of punishment. This encourages honesty and systemic learning rather than hiding mistakes.

💡 Insight

Blame is counterproductive. When incidents happen, ask "What about our system allowed this to occur?" instead of "Who did this?" Culture and tooling that treat mistakes as learning opportunities build more reliable systems than punishment-based approaches.

4.4. How Important Is Reliability?

In plain English: Even "boring" business apps need reliability. Bugs don't just annoy users—they can destroy lives and businesses.

In technical terms: Reliability failures have cascading impacts: lost revenue, damaged reputation, legal liability, and in severe cases, ruined lives. The cost of unreliability far exceeds the cost of building reliable systems.

Real-world consequences:

🛒

E-commerce

Outages = lost revenue + reputation damage

💼

Business Apps

Lost productivity across organization

🏥

Healthcare

Patient safety at risk

💰

Financial

Regulatory fines + customer lawsuits

Case Study: Post Office Horizon Scandal

Between 1999 and 2019, hundreds of Post Office branch managers in Britain were convicted of theft or fraud because accounting software (Horizon) showed shortfalls in their accounts. Many were imprisoned, went bankrupt, or died before vindication.

Horizon Scandal Timeline

1999–2019

Buggy software shows false shortfalls

→

Prosecutions

Hundreds convicted based on software 'evidence'

→

Lives Ruined

Prison, bankruptcy, suicides

→

2019+

Convictions overturned, scandal revealed

Eventually discovered: Many shortfalls were due to software bugs, not theft. The system was unreliable, but management trusted it over people.

💡 Insight

Unreliable software has real human costs. The Horizon scandal shows how software bugs can destroy lives when systems are trusted blindly. Reliability isn't just about uptime—it's about responsibility to the people who depend on your systems.

5. Scalability

In plain English: Scalability means your system can handle growth without falling over. It's not a binary property—you don't "have" scalability. Instead, you plan for specific growth patterns and know when you'll hit limits.

In technical terms: Scalability describes a system's ability to maintain performance as load increases. It requires understanding current load, predicting growth, and having a plan to add capacity when needed.

Why it matters: Even reliable systems degrade under increased load. Without scalability planning, success can kill your system—viral growth crashes your service just as users discover it.

Scalability is not:

❌ "This system is scalable"
❌ "We built for infinite scale"
❌ "It scales horizontally"

Scalability is:

✅ "If daily users grow 10x, we'll need 5 more DB replicas"
✅ "We'll hit limits at 50k concurrent users; we're at 20k now"
✅ "Adding 10 nodes doubles our write capacity"

Key Scalability Questions

📈

Growth Pattern

How will the system grow?

⚡

Resource Impact

How does load affect performance?

🎯

Capacity Planning

When do we add resources?

💰

Cost Analysis

What's the cost of scaling?

5.1. Describing Load

In plain English: Before you can scale, you need to measure what "load" means for your system. Is it requests per second? Users online? Data volume? The answer shapes your scaling strategy.

In technical terms: Load parameters quantify current system stress. Common metrics include throughput (requests/sec), concurrency (active users), and data volume (GB/day).

Common load parameters:

Metric	Example	Use Case
Requests/second	10,000 API calls/sec	Web services
Data volume	500 GB new data/day	Data pipelines
Concurrent users	100,000 simultaneous users	Gaming, streaming
Transactions	5,000 checkouts/hour	E-commerce

Two Key Scalability Questions

Question 1: Fixed Resources

Load increases → Resources unchanged → Performance?

Question 2: Fixed Performance

Load increases → How much to increase resources → Keep same performance?

Linear scalability: If you can double resources to handle double the load with the same performance, you have linear scalability—the holy grail.

💡 Insight

Load description is application-specific. For a social network, it might be "posts/second" and "timeline reads/second." For an analytics system, it's "queries/hour" and "data ingestion rate." Understanding your specific load parameters is the first step toward effective scaling.

5.2. Shared-Memory, Shared-Disk, and Shared-Nothing Architecture

In plain English: There are three ways to add capacity: buy a bigger computer (vertical), connect multiple computers to the same storage (shared-disk), or give each computer its own everything (horizontal).

In technical terms: Scaling architectures differ in how they share resources. Each has distinct cost, complexity, and scalability trade-offs.

Scaling Architectures

Shared-Memory (Vertical)

More CPU

More RAM

More Disk

Single Machine

Shared-Disk

CPU 1

CPU 2

↓

Shared Storage

Shared-Nothing (Horizontal)

Node 1
CPU+Disk

Node 2
CPU+Disk

Node 3
CPU+Disk

Independent Nodes

Vertical Scaling

Horizontal Scaling

Scalability Limit

CPU/RAM bottlenecks

Potentially linear

Cost Curve

Exponential (big machines expensive)

Linear (commodity hardware)

Fault Tolerance

Single point of failure

Survive node failures

Complexity

Simple (one machine)

Complex (distributed)

Coordination

None needed

Consensus, sharding required

Shared-Nothing advantages:

Linear scalability potential
Cost-effective commodity hardware
Fault tolerance across nodes
Elastic—add/remove nodes dynamically
Geographic distribution

Shared-Nothing challenges:

Complex distributed system logic
Data sharding required
Network latency
Partial failures
Eventual consistency

💡 Insight

There's no universal winner. Vertical scaling is simpler but limited. Horizontal scaling can scale infinitely but adds massive complexity. Modern systems often use both: vertically scale nodes to delay the complexity of distribution, then horizontally scale when necessary.

5.3. Principles for Scalability

In plain English: Scalability isn't one-size-fits-all. A system that handles millions of tiny requests needs different architecture than one processing huge analytics queries. The key is breaking work into independent pieces.

In technical terms: Scalable architectures decompose systems into loosely-coupled components that can operate independently. This enables parallel processing and localized failures.

Core principles:

🧩

Independent Components

Break system into parts that don't need each other

✨

Simplicity First

Don't distribute until necessary

📊

Measure Everything

Know your bottlenecks before scaling

🎯

Plan for Growth

Understand limits and growth trajectory

Examples of decomposition:

Pattern	Description	Benefit
Microservices	Split application into independent services	Teams work independently
Sharding	Partition data across nodes by key	Parallel data processing
Stream processing	Break large jobs into small, independent tasks	Continuous, incremental processing
Caching	Store precomputed results	Reduce load on primary system

💡 Insight

Scalability is not automatic. There's no "scalable architecture" you can copy. Systems that scale well at social networks (many tiny writes) fail at analytics (few huge queries). Design for your specific load patterns, and don't make things more complicated than necessary—a single-machine database is often better than a distributed mess.

6. Maintainability

In plain English: Software doesn't wear out like machines, but it does "age" as requirements change, platforms evolve, and knowledge decays. Maintainability means designing systems that stay easy to work with over time.

In technical terms: Maintainability encompasses operability (ease of running), simplicity (ease of understanding), and evolvability (ease of changing). These determine the long-term cost and viability of systems.

Why it matters: Most software cost isn't initial development—it's ongoing maintenance. A system used for years will spend far more on bug fixes, feature additions, and operational costs than its original build.

Software Lifecycle Costs

Initial Dev
20%

Ongoing Maintenance
80%

Maintenance activities:

Fixing bugs
Keeping systems operational
Investigating failures
Adapting to new platforms
Modifying for new use cases
Repaying technical debt
Adding new features

Three Pillars of Maintainability

⚙️

Operability

Easy to keep running smoothly

✨

Simplicity

Easy for new engineers to understand

🔄

Evolvability

Easy to make changes in future

6.1. Operability: Making Life Easy for Operations

In plain English: Operations teams keep systems running—deploying updates, handling incidents, scaling resources. Good operability means routine tasks are easy, freeing operators to focus on high-value work.

In technical terms: Operability is the ease with which operators can maintain a system's health. Well-designed systems provide good observability, automation, and sensible defaults.

"Good operations can often work around the limitations of bad software, but good software cannot run reliably with bad operations."

What good operability provides:

📊

Monitoring

Visibility into system health

🤖

Automation

Routine tasks don't need humans

📚

Documentation

Clear operational model

🎯

Predictability

Minimal surprises

🔧

Self-healing

Automatic recovery where safe

⚡

Good Defaults

Works well out of the box

Operations responsibilities:

Task	Good Operability	Poor Operability
Monitoring	Dashboards show key metrics, alerts fire before users notice	Logs scattered, no alerts, discover issues from user complaints
Deployment	Automated, gradual rollout, easy rollback	Manual steps, all-or-nothing, no rollback
Scaling	Auto-scaling based on metrics	Manual server provisioning
Incidents	Clear runbooks, automatic diagnostics	Guesswork, tribal knowledge

💡 Insight

Operability enables reliability. Even the best-designed system will fail if operators can't understand, monitor, or repair it. Invest in observability and automation—they're not luxuries, they're requirements for reliable production systems.

6.2. Simplicity: Managing Complexity

In plain English: Simple code is easy to understand. Complex code is a tangled mess where changing one thing breaks three others. As projects grow, fighting complexity becomes critical.

In technical terms: Complexity is the enemy of maintainability. Systems mired in complexity—"big balls of mud"—resist change and harbor bugs. Simplicity through abstraction manages this complexity.

Symptoms of excessive complexity:

Signs of a Big Ball of Mud

💥

Explosion of State Space

Too many possible states

🔗

Tight Coupling

Changing A breaks B and C

🕸️

Tangled Dependencies

Circular, unclear relationships

📝

Inconsistent Naming

Same thing called differently

🔧

Special Cases

Hacks and workarounds everywhere

❓

No Onboarding

Takes months to understand

Abstraction: The tool for managing complexity

How Abstraction Simplifies

Without Abstraction

Machine Code

Memory Mgmt

Disk I/O

Network Protocol

All details visible

abstraction→

With Abstraction

High-Level Language

SQL Database

Implementation hidden

Examples of good abstractions:

High-level languages hide machine code details
SQL hides on-disk data structures and query optimization
HTTP hides TCP packet management
React hides DOM manipulation

Building for simplicity:

Principle	How It Helps
Clear interfaces	Hide implementation details
Consistent conventions	Reduce cognitive load
Remove accidental complexity	Keep only essential complexity
Avoid premature optimization	Simpler code is better than "clever" code

💡 Insight

Simplicity is not simplistic. A good abstraction hides enormous complexity behind a clean interface (like SQL hiding B-trees and query planners). The goal isn't to avoid complexity entirely—it's to manage it through layers of abstraction so each layer is simple.

6.3. Evolvability: Making Change Easy

In plain English: Requirements never stop changing. Features get added, platforms evolve, business needs shift. Evolvable systems adapt to change easily instead of resisting it.

In technical terms: Evolvability (also called extensibility or modifiability) is the ease of making changes to a system. It's closely linked to simplicity—loosely-coupled, simple systems are easier to modify.

Why requirements change:

👥

User Needs

Features users want evolve

🎯

Business Goals

Company strategy shifts

💻

Technology

New platforms emerge

📈

Scale

Growth demands different solutions

Factors that enable evolvability:

Factor	Description
Loose coupling	Components can change independently
Good abstractions	Implementation changes don't affect interface
Comprehensive tests	Confidence that changes don't break things
Reversibility	Easy to undo changes if needed
Clear documentation	Understand system to change it safely

Reversibility Enables Flexibility

Irreversible Changes

• Careful planning required
• Risk of wrong choice
• Paralysis by analysis

→

Reversible Changes

• Experiment freely
• Learn from mistakes
• Iterate quickly

💡 Insight

Irreversibility kills evolvability. When changes can't be undone, teams become paralyzed—every decision feels permanent. Minimize irreversibility through feature flags, database migrations (not destructive changes), and architectures that support gradual transitions. The easier it is to reverse a decision, the faster you can evolve.

7. Summary

In plain English: This chapter taught you how to think about the qualities that make systems good beyond just functionality: how fast they respond, whether they stay up when things break, if they can grow with demand, and whether they'll become a maintenance nightmare.

In technical terms: Nonfunctional requirements—performance, reliability, scalability, and maintainability—are as critical as functional requirements. Understanding how to measure and optimize these qualities is essential for building production-grade systems.

The Four Pillars

⚡

Performance

Measured via percentiles (p50, p95, p99) and throughput

Response time distributions
Tail latency amplification
SLAs based on percentiles

🛡️

Reliability

Continues working correctly when things go wrong

Fault tolerance mechanisms
Hardware + software faults
Human error mitigation

📈

Scalability

Maintains performance as load increases

Describe load parameters
Vertical vs horizontal scaling
Independent components

🔧

Maintainability

Easy to operate, understand, and evolve

Operability via automation
Simplicity via abstraction
Evolvability via loose coupling

Key Takeaways

Performance:

Use percentiles (p50, p95, p99) instead of averages to describe response time
High percentiles matter—tail latencies affect valuable users
Tail latency amplification occurs when requests require multiple backend calls

Reliability:

Distinguish faults (component failures) from failures (system-wide breakdown)
Fault tolerance prevents faults from becoming failures
Hardware faults are independent; software faults are correlated
Human error is a symptom, not a cause—design systems to minimize it

Scalability:

Scalability is not binary—describe specific growth patterns and limits
Vertical scaling (bigger machines) is simple but limited
Horizontal scaling (more machines) scales linearly but adds complexity
Break systems into independent components for better scalability

Maintainability:

Most cost is ongoing maintenance, not initial development
Operability: Make it easy to run
Simplicity: Use abstraction to manage complexity
Evolvability: Minimize irreversibility to enable change

💡 Insight

These four qualities are interconnected. A system that's hard to maintain will eventually become unreliable as tech debt accumulates. A system that can't scale will have poor performance under load. A system that's too complex resists both scaling and evolution. Good architecture considers all dimensions together, making conscious trade-offs rather than ignoring any one aspect.

Previous: Chapter 1: Trade-offs in Data Systems Architecture | Next: Chapter 3: Data Models and Query Languages

Table of Contents​

1. Introduction​

2. Case Study: Social Network Home Timelines​

2.1. Representing Users, Posts, and Follows​

2.2. Materializing and Updating Timelines​

3. Describing Performance​

3.1. Latency and Response Time​

3.2. Average, Median, and Percentiles​

3.3. Use of Response Time Metrics​

4. Reliability and Fault Tolerance​

4.1. Fault Tolerance​

4.2. Hardware and Software Faults​

4.3. Humans and Reliability​

4.4. How Important Is Reliability?​

5. Scalability​

5.1. Describing Load​

5.2. Shared-Memory, Shared-Disk, and Shared-Nothing Architecture​

5.3. Principles for Scalability​

6. Maintainability​

6.1. Operability: Making Life Easy for Operations​

6.2. Simplicity: Managing Complexity​

6.3. Evolvability: Making Change Easy​

7. Summary​

Key Takeaways​

Table of Contents