Skip to main content

Chapter 2. Defining Nonfunctional Requirements

The invisible requirements that make or break your system

"The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error-free?"

— Alan Kay, in interview with Dr Dobb's Journal (2012)


Table of Contents

  1. Introduction
  2. Case Study: Social Network Home Timelines
  3. Describing Performance
  4. Reliability and Fault Tolerance
  5. Scalability
  6. Maintainability
  7. Summary

1. Introduction

In plain English: You wouldn't build a house focusing only on the floor plan while ignoring whether it can withstand storms, stay warm in winter, or be maintained. Similarly, software needs more than just functionality—it must be fast, reliable, and maintainable.

In technical terms: Nonfunctional requirements define system qualities like performance, reliability, scalability, and maintainability. Unlike functional requirements (what the system does), nonfunctional requirements describe how well it does it.

Why it matters: An app that works perfectly in theory but crashes constantly, responds slowly, or becomes unmaintainable is worthless. These "invisible" requirements often determine whether a system succeeds or fails in production.

Performance
How fast does it respond?
🛡️
Reliability
Does it keep working when things fail?
📈
Scalability
Can it handle growth?
🔧
Maintainability
Can we evolve it over time?

💡 Insight

Nonfunctional requirements are often unstated because they seem "obvious," but they're just as critical as features. A slow, unreliable app might as well not exist—users will abandon it regardless of its features.


2. Case Study: Social Network Home Timelines

In plain English: Imagine building a Twitter-like feed where millions of people post and read updates every second. Should you build the feed when someone opens the app, or prepare it ahead of time? This simple question reveals fundamental trade-offs in system design.

In technical terms: Social network timelines demonstrate the classic read vs. write optimization trade-off. Computing timelines on-demand optimizes writes but slows reads. Precomputing timelines optimizes reads but increases write complexity.

Why it matters: This pattern—choosing when to do work—appears everywhere in data systems. Understanding it helps you make similar decisions across different domains.

Social Network Scale
✍️500M posts/day
(5,700/sec avg)
📈150k posts/sec
(peak load)
👥200 follows/user
(average)

2.1. Representing Users, Posts, and Follows

In plain English: You could store everything in a database and build each person's feed by searching for their friends' posts every time they open the app. Simple, but slow.

In technical terms: A relational schema with users, posts, and follows tables supports on-demand timeline generation via joins. This approach computes timelines at read time.

Let's say the main read operation is the home timeline, displaying recent posts by people you follow. The SQL query:

SELECT posts.*, users.* FROM posts
JOIN follows ON posts.sender_id = follows.followee_id
JOIN users ON posts.sender_id = users.id
WHERE follows.follower_id = current_user
ORDER BY posts.timestamp DESC
LIMIT 1000

The problem with polling:

  • 10 million online users
  • Polling every 5 seconds
  • = 2 million queries/second
  • Each query fetches posts from ~200 followed users
  • = 400 million post lookups/second
On-Demand Timeline Generation
1
User Opens App
Request home timeline
2
Find Follows
Query follows table
3
Fetch Posts
Get recent posts from each
4
Merge & Sort
Combine and order by time

💡 Insight

The query is expensive because it fetches and merges posts from 200 users per request. At scale, this read-time computation becomes a bottleneck—400 million lookups/second is prohibitive for most databases.

2.2. Materializing and Updating Timelines

In plain English: Instead of searching for your friends' posts when you open the app, what if we prepared your feed ahead of time—like delivering mail to your mailbox instead of making you go to every friend's house to check for letters?

In technical terms: Materialization precomputes query results and stores them. When a user posts, we fan out that post to all followers' timeline caches. This shifts work from read time to write time.

Timeline Materialization: Query-Time vs Write-Time
Query-Time (On-Demand)
Write-Time (Materialized)
Fan-out Pattern: Write Amplification
User Makes 1 Post
Fan-out
Timeline 1
Timeline 2
Timeline 3
...
Timeline 200

1 write → 200 writes (fan-out factor = 200)

The fan-out calculation:

  • 5,700 posts/second (average)
  • × 200 followers per post (fan-out factor)
  • = 1.14 million timeline writes/second

Trade-off analysis:

On-Demand
Materialized
Read Performance
Slow (expensive joins)
Fast (cache lookup)
Write Performance
Fast (single insert)
Slower (fan-out writes)
Write Volume
5,700 writes/sec
1.14M writes/sec
Read Volume
400M lookups/sec
Simple cache reads

💡 Insight

Materialization is a read-write trade-off: precomputing views shifts work from read time to write time. This is beneficial when reads vastly outnumber writes—a common pattern in user-facing applications.

The celebrity problem: Users with 100M followers create extreme fan-out. A single post triggers 100M timeline updates—impractical. Real implementations use hybrid approaches: materialize for normal users, compute on-demand for celebrities.


3. Describing Performance

In plain English: "Fast" is subjective. Does it mean average speed, worst-case speed, or something else? To build reliable systems, we need precise ways to measure and discuss performance.

In technical terms: Performance metrics fall into two categories: response time (latency from request to response) and throughput (requests or data volume per second). Both are critical but measure different aspects of system behavior.

Performance Metrics
Response Time
User makes request
time elapsed
User receives answer
Throughput
Requests/second
Data volume/second
Operations/second
MetricDescriptionUnit
Response timeElapsed time from request to answerSeconds (ms, μs)
ThroughputNumber of requests or data volume per second"per second"

The relationship: These metrics are interconnected. When throughput increases (more concurrent requests), response time often degrades due to queueing—requests waiting for resources like CPU or network.

3.1. Latency and Response Time

In plain English: Response time is what the user experiences—the total wait. Latency is the waiting time when nothing is actively happening (like waiting in line).

In technical terms: Response time encompasses all delays; latency specifically measures idle waiting time when a request isn't being actively processed.

TermDefinition
Response timeWhat the client sees; includes all delays anywhere in the system
Service timeDuration the service is actively processing the request
Queueing delaysTime waiting for resources (CPU, network, disk)
LatencyTime during which a request is not being actively processed (latent, waiting)
Anatomy of Response Time
Network Latency (client → server)
Queueing Delay (waiting for CPU)
Service Time (active processing)
Queueing Delay (waiting for DB)
Network Latency (server → client)

Variability: Response time varies dramatically between identical requests due to:

  • Context switches (OS scheduling)
  • Network packet loss and TCP retransmission
  • Garbage collection pauses
  • Page faults (RAM → disk swapping)
  • Cache misses

💡 Insight

Head-of-line blocking causes queueing delays to amplify variability. Since servers process limited concurrent requests, even a few slow requests block subsequent ones, creating cascading delays.

3.2. Average, Median, and Percentiles

In plain English: If nine requests take 100ms and one takes 10 seconds, the average is 1090ms—but that's misleading. Most users experienced 100ms, not 1090ms. We need better ways to describe distributions.

In technical terms: Response time is a distribution, not a single number. Percentiles describe this distribution more accurately than averages.

Performance Percentiles Visualization
p50: 100ms
50% of users
slower →
p95: 200ms
5% of users
p99: 500ms
1% of users
p999: 2s
0.1% of users

Box size represents percentage of users experiencing that latency or better

Understanding Percentiles
📊
p50 (Median)
Half faster, half slower than this value
📈
p95
95% of requests faster than this
🎯
p99
99% of requests faster than this
p999
99.9% of requests faster than this
PercentileMeaning
p50 (median)Half of requests are faster, half are slower
p9595% of requests are faster than this threshold
p9999% of requests are faster than this threshold
p99999.9% of requests are faster than this threshold

Why high percentiles matter:

  1. Valuable customers: Users with slow requests often have the most data (e.g., power users, large accounts)
  2. User experience: Consistently slow experiences drive users away
  3. Amazon example: Uses p999 for internal services because slowest users are often most valuable

💡 Insight

Tail latencies (high percentiles) directly impact user experience. A p99 of 2 seconds means 1 in 100 users waits 2+ seconds—enough to notice and complain. For high-traffic services, that's thousands of frustrated users daily.

3.3. Use of Response Time Metrics

In plain English: When one page load requires calling 10 backend services, even if each service is fast 99% of the time, there's a good chance at least one will be slow—making the whole page slow.

In technical terms: Tail latency amplification occurs when a user request requires multiple backend calls. The probability of encountering at least one slow call increases with the number of calls.

Tail Latency Amplification
User Request
Service A
Service B
Service C
Service D
slowest call determines total time
Response to User

Example: If each service has p99 = 100ms, and you call 10 services in parallel, the probability that all respond within 100ms is only 90%—meaning p99 for the overall request is likely much worse.

Service Level Objectives (SLOs) and Agreements (SLAs):

Percentiles define expected performance in contracts:

  • SLO: "p95 response time will be under 200ms"
  • SLA: "We guarantee p99 under 1 second, or you get a refund"

💡 Insight

High-percentile guarantees become exponentially harder in distributed systems. Each additional service call compounds tail latency risk, which is why microservices often struggle with consistent performance.


4. Reliability and Fault Tolerance

In plain English: Reliability means the system keeps working correctly even when things go wrong—hard drives fail, networks glitch, and humans make mistakes. A reliable system handles these gracefully instead of crashing.

In technical terms: Reliability is the system's ability to continue providing required services despite faults. This requires fault tolerance mechanisms that prevent faults from escalating into failures.

Why it matters: Hardware fails constantly at scale. In a datacenter with 10,000 hard drives, if each has a 1% annual failure rate, you'll have 100 drive failures per year—nearly 2 per week. Systems must be designed to handle this.

For software, typical reliability expectations include:

Correct Function
Performs what users expect
🛡️
Error Tolerance
Handles mistakes gracefully
Good Performance
Fast enough for use case
🔒
Security
Prevents unauthorized access

Key distinction:

TermDefinition
FaultA component stops working correctly (e.g., disk fails, network drops)
FailureThe system as a whole stops providing required service to users
Fault vs Failure
Disk Fails
Network Drops
Process Crashes
Faults (Component Issues)
prevented by
Redundancy
Retry Logic
Monitoring
Fault Tolerance
avoids
Data Loss
Downtime
Corruption
Failures (System Issues)

💡 Insight

Reliability = continuing to work correctly, even when things go wrong. The goal isn't to eliminate all faults (impossible), but to prevent faults from becoming failures.

4.1. Fault Tolerance

In plain English: Fault tolerance means your system can lose a specific part and keep working. Like having a spare tire—you can drive even with a flat.

In technical terms: A system is fault-tolerant if it continues providing required services despite certain faults occurring. Components that, if failed, cause system failure are called single points of failure (SPOFs).

Scope limits: Fault tolerance is always bounded:

  • "Tolerates up to 2 concurrent disk failures"
  • "Survives single datacenter outage"
  • "Handles 3 node failures in a 5-node cluster"
Single Point of Failure vs Fault Tolerance
SPOF Architecture
Load Balancer
Single Database
⚠️ If DB fails, system fails
Fault-Tolerant
Load Balancer
DB Primary
DB Replica
DB Replica
✓ System survives DB failure

Chaos Engineering: Deliberately injecting faults to test tolerance mechanisms:

  • Randomly kill processes
  • Introduce network delays
  • Fill up disks
  • Simulate datacenter outages
Fault Tolerance: Defense in Depth
Application Layer
Circuit breakers, graceful degradation, fallbacks
Replication Layer
Data redundancy, automatic failover
Network Layer
Retry logic, timeouts, load balancing
Infrastructure Layer
RAID arrays, dual power supplies, redundant nodes
Physical Layer
Multiple data centers, geographic distribution

💡 Insight

Counter-intuitively, increasing fault rates can improve reliability. By deliberately triggering faults (chaos engineering), you discover weaknesses before they cause real outages. Netflix's Chaos Monkey randomly terminates production instances to ensure systems handle failures gracefully.

4.2. Hardware and Software Faults

In plain English: Hardware breaks predictably—drives fail at known rates, you can plan for it. Software bugs are trickier because they often affect all instances simultaneously.

In technical terms: Hardware faults are typically independent and random. Software faults are correlated—the same bug exists on every node running that code, causing simultaneous failures.

Hardware fault rates:

Hardware Failure Rates
💿
Hard Drives
2–5% fail per year
💾
SSDs
0.5–1% fail per year
⚙️
CPU Cores
1 in 1,000 compute wrong results
🌟
RAM
Corrupted by cosmic rays

Traditional response: Add redundancy:

  • RAID for disk failures
  • Dual power supplies
  • Backup generators
  • Hot-swappable components

Software faults are more insidious:

Fault TypeExampleImpact
Cascading failuresOne overloaded service causes others to failWidespread outage
Resource exhaustionMemory leak consumes all RAMAll nodes fail simultaneously
Dependency failureExternal API goes downAll dependent services affected
Retry stormsFailed requests retry, increasing loadSystem collapse under load
Software Fault Propagation
1
Bug Deployed
Same bug on all nodes
2
Triggered
Specific input causes crash
3
Simultaneous Failure
All nodes crash together
4
System Down
No redundancy helps

Mitigation strategies:

  • Careful testing (unit, integration, chaos)
  • Process isolation (containers, VMs)
  • Crash and restart (let it fail fast)
  • Avoid feedback loops (exponential backoff)
  • Production monitoring and alerting

💡 Insight

Hardware faults are independent; software faults are correlated. This is why redundancy alone doesn't guarantee reliability—three servers running buggy code will all fail the same way. You need defense in depth: testing, isolation, monitoring, and graceful degradation.

4.3. Humans and Reliability

In plain English: Humans make mistakes—it's inevitable. Blaming people doesn't help. Instead, design systems that make mistakes harder to make and easier to recover from.

In technical terms: Human error is the leading cause of outages, but it's a symptom of poor system design, not a root cause. Sociotechnical system design can minimize human-induced failures.

The data: One study found:

  • Configuration changes by operators: leading cause of outages
  • Hardware faults: only 10–25% of outages

Why "human error" is misleading: Blaming individuals ignores systemic issues. When humans make mistakes, it usually indicates:

  • Unclear interfaces
  • Inadequate training
  • Time pressure
  • Poor tooling
  • Complex systems
Defense in Depth Against Human Error
1
Design
Well-designed interfaces encourage correct use
2
Test
Thorough testing catches mistakes early
3
Deploy
Gradual rollouts limit blast radius
4
Monitor
Observability detects issues quickly
5
Recover
Easy rollback mechanisms

Technical measures to minimize human mistakes:

MeasureHow It Helps
TestingUnit, integration, and end-to-end tests catch bugs before production
Rollback mechanismsQuickly revert bad changes
Gradual rolloutsDeploy to small percentage first, catch issues early
MonitoringDetect anomalies and alert operators
Good interfacesMake correct actions obvious, dangerous actions difficult
DocumentationClear runbooks for common operations

Blameless postmortems: After incidents, teams share what happened without fear of punishment. This encourages honesty and systemic learning rather than hiding mistakes.

💡 Insight

Blame is counterproductive. When incidents happen, ask "What about our system allowed this to occur?" instead of "Who did this?" Culture and tooling that treat mistakes as learning opportunities build more reliable systems than punishment-based approaches.

4.4. How Important Is Reliability?

In plain English: Even "boring" business apps need reliability. Bugs don't just annoy users—they can destroy lives and businesses.

In technical terms: Reliability failures have cascading impacts: lost revenue, damaged reputation, legal liability, and in severe cases, ruined lives. The cost of unreliability far exceeds the cost of building reliable systems.

Real-world consequences:

🛒
E-commerce
Outages = lost revenue + reputation damage
💼
Business Apps
Lost productivity across organization
🏥
Healthcare
Patient safety at risk
💰
Financial
Regulatory fines + customer lawsuits

Case Study: Post Office Horizon Scandal

Between 1999 and 2019, hundreds of Post Office branch managers in Britain were convicted of theft or fraud because accounting software (Horizon) showed shortfalls in their accounts. Many were imprisoned, went bankrupt, or died before vindication.

Horizon Scandal Timeline
1
1999–2019
Buggy software shows false shortfalls
2
Prosecutions
Hundreds convicted based on software 'evidence'
3
Lives Ruined
Prison, bankruptcy, suicides
4
2019+
Convictions overturned, scandal revealed

Eventually discovered: Many shortfalls were due to software bugs, not theft. The system was unreliable, but management trusted it over people.

💡 Insight

Unreliable software has real human costs. The Horizon scandal shows how software bugs can destroy lives when systems are trusted blindly. Reliability isn't just about uptime—it's about responsibility to the people who depend on your systems.


5. Scalability

In plain English: Scalability means your system can handle growth without falling over. It's not a binary property—you don't "have" scalability. Instead, you plan for specific growth patterns and know when you'll hit limits.

In technical terms: Scalability describes a system's ability to maintain performance as load increases. It requires understanding current load, predicting growth, and having a plan to add capacity when needed.

Why it matters: Even reliable systems degrade under increased load. Without scalability planning, success can kill your system—viral growth crashes your service just as users discover it.

Scalability is not:

  • ❌ "This system is scalable"
  • ❌ "We built for infinite scale"
  • ❌ "It scales horizontally"

Scalability is:

  • ✅ "If daily users grow 10x, we'll need 5 more DB replicas"
  • ✅ "We'll hit limits at 50k concurrent users; we're at 20k now"
  • ✅ "Adding 10 nodes doubles our write capacity"
Key Scalability Questions
📈
Growth Pattern
How will the system grow?
Resource Impact
How does load affect performance?
🎯
Capacity Planning
When do we add resources?
💰
Cost Analysis
What's the cost of scaling?

5.1. Describing Load

In plain English: Before you can scale, you need to measure what "load" means for your system. Is it requests per second? Users online? Data volume? The answer shapes your scaling strategy.

In technical terms: Load parameters quantify current system stress. Common metrics include throughput (requests/sec), concurrency (active users), and data volume (GB/day).

Common load parameters:

MetricExampleUse Case
Requests/second10,000 API calls/secWeb services
Data volume500 GB new data/dayData pipelines
Concurrent users100,000 simultaneous usersGaming, streaming
Transactions5,000 checkouts/hourE-commerce
Two Key Scalability Questions
Question 1: Fixed Resources
Load increases → Resources unchanged → Performance?
Question 2: Fixed Performance
Load increases → How much to increase resources → Keep same performance?

Linear scalability: If you can double resources to handle double the load with the same performance, you have linear scalability—the holy grail.

💡 Insight

Load description is application-specific. For a social network, it might be "posts/second" and "timeline reads/second." For an analytics system, it's "queries/hour" and "data ingestion rate." Understanding your specific load parameters is the first step toward effective scaling.

5.2. Shared-Memory, Shared-Disk, and Shared-Nothing Architecture

In plain English: There are three ways to add capacity: buy a bigger computer (vertical), connect multiple computers to the same storage (shared-disk), or give each computer its own everything (horizontal).

In technical terms: Scaling architectures differ in how they share resources. Each has distinct cost, complexity, and scalability trade-offs.

Scaling Architectures
Shared-Memory (Vertical)
More CPU
More RAM
More Disk
Single Machine
Shared-Disk
CPU 1
CPU 2
Shared Storage
Shared-Nothing (Horizontal)
Node 1
CPU+Disk
Node 2
CPU+Disk
Node 3
CPU+Disk
Independent Nodes
Vertical Scaling
Horizontal Scaling
Scalability Limit
CPU/RAM bottlenecks
Potentially linear
Cost Curve
Exponential (big machines expensive)
Linear (commodity hardware)
Fault Tolerance
Single point of failure
Survive node failures
Complexity
Simple (one machine)
Complex (distributed)
Coordination
None needed
Consensus, sharding required

Shared-Nothing advantages:

  • Linear scalability potential
  • Cost-effective commodity hardware
  • Fault tolerance across nodes
  • Elastic—add/remove nodes dynamically
  • Geographic distribution

Shared-Nothing challenges:

  • Complex distributed system logic
  • Data sharding required
  • Network latency
  • Partial failures
  • Eventual consistency

💡 Insight

There's no universal winner. Vertical scaling is simpler but limited. Horizontal scaling can scale infinitely but adds massive complexity. Modern systems often use both: vertically scale nodes to delay the complexity of distribution, then horizontally scale when necessary.

5.3. Principles for Scalability

In plain English: Scalability isn't one-size-fits-all. A system that handles millions of tiny requests needs different architecture than one processing huge analytics queries. The key is breaking work into independent pieces.

In technical terms: Scalable architectures decompose systems into loosely-coupled components that can operate independently. This enables parallel processing and localized failures.

Core principles:

🧩
Independent Components
Break system into parts that don't need each other
Simplicity First
Don't distribute until necessary
📊
Measure Everything
Know your bottlenecks before scaling
🎯
Plan for Growth
Understand limits and growth trajectory

Examples of decomposition:

PatternDescriptionBenefit
MicroservicesSplit application into independent servicesTeams work independently
ShardingPartition data across nodes by keyParallel data processing
Stream processingBreak large jobs into small, independent tasksContinuous, incremental processing
CachingStore precomputed resultsReduce load on primary system

💡 Insight

Scalability is not automatic. There's no "scalable architecture" you can copy. Systems that scale well at social networks (many tiny writes) fail at analytics (few huge queries). Design for your specific load patterns, and don't make things more complicated than necessary—a single-machine database is often better than a distributed mess.


6. Maintainability

In plain English: Software doesn't wear out like machines, but it does "age" as requirements change, platforms evolve, and knowledge decays. Maintainability means designing systems that stay easy to work with over time.

In technical terms: Maintainability encompasses operability (ease of running), simplicity (ease of understanding), and evolvability (ease of changing). These determine the long-term cost and viability of systems.

Why it matters: Most software cost isn't initial development—it's ongoing maintenance. A system used for years will spend far more on bug fixes, feature additions, and operational costs than its original build.

Software Lifecycle Costs
Initial Dev
20%
Ongoing Maintenance
80%

Maintenance activities:

  • Fixing bugs
  • Keeping systems operational
  • Investigating failures
  • Adapting to new platforms
  • Modifying for new use cases
  • Repaying technical debt
  • Adding new features
Three Pillars of Maintainability
⚙️
Operability
Easy to keep running smoothly
Simplicity
Easy for new engineers to understand
🔄
Evolvability
Easy to make changes in future

6.1. Operability: Making Life Easy for Operations

In plain English: Operations teams keep systems running—deploying updates, handling incidents, scaling resources. Good operability means routine tasks are easy, freeing operators to focus on high-value work.

In technical terms: Operability is the ease with which operators can maintain a system's health. Well-designed systems provide good observability, automation, and sensible defaults.

"Good operations can often work around the limitations of bad software, but good software cannot run reliably with bad operations."

What good operability provides:

📊
Monitoring
Visibility into system health
🤖
Automation
Routine tasks don't need humans
📚
Documentation
Clear operational model
🎯
Predictability
Minimal surprises
🔧
Self-healing
Automatic recovery where safe
Good Defaults
Works well out of the box

Operations responsibilities:

TaskGood OperabilityPoor Operability
MonitoringDashboards show key metrics, alerts fire before users noticeLogs scattered, no alerts, discover issues from user complaints
DeploymentAutomated, gradual rollout, easy rollbackManual steps, all-or-nothing, no rollback
ScalingAuto-scaling based on metricsManual server provisioning
IncidentsClear runbooks, automatic diagnosticsGuesswork, tribal knowledge

💡 Insight

Operability enables reliability. Even the best-designed system will fail if operators can't understand, monitor, or repair it. Invest in observability and automation—they're not luxuries, they're requirements for reliable production systems.

6.2. Simplicity: Managing Complexity

In plain English: Simple code is easy to understand. Complex code is a tangled mess where changing one thing breaks three others. As projects grow, fighting complexity becomes critical.

In technical terms: Complexity is the enemy of maintainability. Systems mired in complexity—"big balls of mud"—resist change and harbor bugs. Simplicity through abstraction manages this complexity.

Symptoms of excessive complexity:

Signs of a Big Ball of Mud
💥
Explosion of State Space
Too many possible states
🔗
Tight Coupling
Changing A breaks B and C
🕸️
Tangled Dependencies
Circular, unclear relationships
📝
Inconsistent Naming
Same thing called differently
🔧
Special Cases
Hacks and workarounds everywhere
No Onboarding
Takes months to understand

Abstraction: The tool for managing complexity

How Abstraction Simplifies
Without Abstraction
Machine Code
Memory Mgmt
Disk I/O
Network Protocol
All details visible
abstraction
With Abstraction
High-Level Language
SQL Database
Implementation hidden

Examples of good abstractions:

  • High-level languages hide machine code details
  • SQL hides on-disk data structures and query optimization
  • HTTP hides TCP packet management
  • React hides DOM manipulation

Building for simplicity:

PrincipleHow It Helps
Clear interfacesHide implementation details
Consistent conventionsReduce cognitive load
Remove accidental complexityKeep only essential complexity
Avoid premature optimizationSimpler code is better than "clever" code

💡 Insight

Simplicity is not simplistic. A good abstraction hides enormous complexity behind a clean interface (like SQL hiding B-trees and query planners). The goal isn't to avoid complexity entirely—it's to manage it through layers of abstraction so each layer is simple.

6.3. Evolvability: Making Change Easy

In plain English: Requirements never stop changing. Features get added, platforms evolve, business needs shift. Evolvable systems adapt to change easily instead of resisting it.

In technical terms: Evolvability (also called extensibility or modifiability) is the ease of making changes to a system. It's closely linked to simplicity—loosely-coupled, simple systems are easier to modify.

Why requirements change:

👥
User Needs
Features users want evolve
🎯
Business Goals
Company strategy shifts
💻
Technology
New platforms emerge
📈
Scale
Growth demands different solutions

Factors that enable evolvability:

FactorDescription
Loose couplingComponents can change independently
Good abstractionsImplementation changes don't affect interface
Comprehensive testsConfidence that changes don't break things
ReversibilityEasy to undo changes if needed
Clear documentationUnderstand system to change it safely
Reversibility Enables Flexibility
Irreversible Changes

• Careful planning required
• Risk of wrong choice
• Paralysis by analysis

Reversible Changes

• Experiment freely
• Learn from mistakes
• Iterate quickly

💡 Insight

Irreversibility kills evolvability. When changes can't be undone, teams become paralyzed—every decision feels permanent. Minimize irreversibility through feature flags, database migrations (not destructive changes), and architectures that support gradual transitions. The easier it is to reverse a decision, the faster you can evolve.


7. Summary

In plain English: This chapter taught you how to think about the qualities that make systems good beyond just functionality: how fast they respond, whether they stay up when things break, if they can grow with demand, and whether they'll become a maintenance nightmare.

In technical terms: Nonfunctional requirements—performance, reliability, scalability, and maintainability—are as critical as functional requirements. Understanding how to measure and optimize these qualities is essential for building production-grade systems.

The Four Pillars
Performance
Measured via percentiles (p50, p95, p99) and throughput
  • Response time distributions
  • Tail latency amplification
  • SLAs based on percentiles
🛡️
Reliability
Continues working correctly when things go wrong
  • Fault tolerance mechanisms
  • Hardware + software faults
  • Human error mitigation
📈
Scalability
Maintains performance as load increases
  • Describe load parameters
  • Vertical vs horizontal scaling
  • Independent components
🔧
Maintainability
Easy to operate, understand, and evolve
  • Operability via automation
  • Simplicity via abstraction
  • Evolvability via loose coupling

Key Takeaways

Performance:

  • Use percentiles (p50, p95, p99) instead of averages to describe response time
  • High percentiles matter—tail latencies affect valuable users
  • Tail latency amplification occurs when requests require multiple backend calls

Reliability:

  • Distinguish faults (component failures) from failures (system-wide breakdown)
  • Fault tolerance prevents faults from becoming failures
  • Hardware faults are independent; software faults are correlated
  • Human error is a symptom, not a cause—design systems to minimize it

Scalability:

  • Scalability is not binary—describe specific growth patterns and limits
  • Vertical scaling (bigger machines) is simple but limited
  • Horizontal scaling (more machines) scales linearly but adds complexity
  • Break systems into independent components for better scalability

Maintainability:

  • Most cost is ongoing maintenance, not initial development
  • Operability: Make it easy to run
  • Simplicity: Use abstraction to manage complexity
  • Evolvability: Minimize irreversibility to enable change

💡 Insight

These four qualities are interconnected. A system that's hard to maintain will eventually become unreliable as tech debt accumulates. A system that can't scale will have poor performance under load. A system that's too complex resists both scaling and evolution. Good architecture considers all dimensions together, making conscious trade-offs rather than ignoring any one aspect.


Previous: Chapter 1: Trade-offs in Data Systems Architecture | Next: Chapter 3: Data Models and Query Languages