Chapter 2. Defining Nonfunctional Requirements
The invisible requirements that make or break your system
"The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made. When was the last time a technology with a scale like that was so error-free?"
— Alan Kay, in interview with Dr Dobb's Journal (2012)
Table of Contents
- Introduction
- Case Study: Social Network Home Timelines
- Describing Performance
- Reliability and Fault Tolerance
- Scalability
- Maintainability
- Summary
1. Introduction
In plain English: You wouldn't build a house focusing only on the floor plan while ignoring whether it can withstand storms, stay warm in winter, or be maintained. Similarly, software needs more than just functionality—it must be fast, reliable, and maintainable.
In technical terms: Nonfunctional requirements define system qualities like performance, reliability, scalability, and maintainability. Unlike functional requirements (what the system does), nonfunctional requirements describe how well it does it.
Why it matters: An app that works perfectly in theory but crashes constantly, responds slowly, or becomes unmaintainable is worthless. These "invisible" requirements often determine whether a system succeeds or fails in production.
💡 Insight
Nonfunctional requirements are often unstated because they seem "obvious," but they're just as critical as features. A slow, unreliable app might as well not exist—users will abandon it regardless of its features.
2. Case Study: Social Network Home Timelines
In plain English: Imagine building a Twitter-like feed where millions of people post and read updates every second. Should you build the feed when someone opens the app, or prepare it ahead of time? This simple question reveals fundamental trade-offs in system design.
In technical terms: Social network timelines demonstrate the classic read vs. write optimization trade-off. Computing timelines on-demand optimizes writes but slows reads. Precomputing timelines optimizes reads but increases write complexity.
Why it matters: This pattern—choosing when to do work—appears everywhere in data systems. Understanding it helps you make similar decisions across different domains.
(5,700/sec avg)
(peak load)
(average)
2.1. Representing Users, Posts, and Follows
In plain English: You could store everything in a database and build each person's feed by searching for their friends' posts every time they open the app. Simple, but slow.
In technical terms: A relational schema with users, posts, and follows tables supports on-demand timeline generation via joins. This approach computes timelines at read time.
Let's say the main read operation is the home timeline, displaying recent posts by people you follow. The SQL query:
SELECT posts.*, users.* FROM posts
JOIN follows ON posts.sender_id = follows.followee_id
JOIN users ON posts.sender_id = users.id
WHERE follows.follower_id = current_user
ORDER BY posts.timestamp DESC
LIMIT 1000
The problem with polling:
- 10 million online users
- Polling every 5 seconds
- = 2 million queries/second
- Each query fetches posts from ~200 followed users
- = 400 million post lookups/second
💡 Insight
The query is expensive because it fetches and merges posts from 200 users per request. At scale, this read-time computation becomes a bottleneck—400 million lookups/second is prohibitive for most databases.
2.2. Materializing and Updating Timelines
In plain English: Instead of searching for your friends' posts when you open the app, what if we prepared your feed ahead of time—like delivering mail to your mailbox instead of making you go to every friend's house to check for letters?
In technical terms: Materialization precomputes query results and stores them. When a user posts, we fan out that post to all followers' timeline caches. This shifts work from read time to write time.
1 write → 200 writes (fan-out factor = 200)
The fan-out calculation:
- 5,700 posts/second (average)
- × 200 followers per post (fan-out factor)
- = 1.14 million timeline writes/second
Trade-off analysis:
💡 Insight
Materialization is a read-write trade-off: precomputing views shifts work from read time to write time. This is beneficial when reads vastly outnumber writes—a common pattern in user-facing applications.
The celebrity problem: Users with 100M followers create extreme fan-out. A single post triggers 100M timeline updates—impractical. Real implementations use hybrid approaches: materialize for normal users, compute on-demand for celebrities.
3. Describing Performance
In plain English: "Fast" is subjective. Does it mean average speed, worst-case speed, or something else? To build reliable systems, we need precise ways to measure and discuss performance.
In technical terms: Performance metrics fall into two categories: response time (latency from request to response) and throughput (requests or data volume per second). Both are critical but measure different aspects of system behavior.
| Metric | Description | Unit |
|---|---|---|
| Response time | Elapsed time from request to answer | Seconds (ms, μs) |
| Throughput | Number of requests or data volume per second | "per second" |
The relationship: These metrics are interconnected. When throughput increases (more concurrent requests), response time often degrades due to queueing—requests waiting for resources like CPU or network.
3.1. Latency and Response Time
In plain English: Response time is what the user experiences—the total wait. Latency is the waiting time when nothing is actively happening (like waiting in line).
In technical terms: Response time encompasses all delays; latency specifically measures idle waiting time when a request isn't being actively processed.
| Term | Definition |
|---|---|
| Response time | What the client sees; includes all delays anywhere in the system |
| Service time | Duration the service is actively processing the request |
| Queueing delays | Time waiting for resources (CPU, network, disk) |
| Latency | Time during which a request is not being actively processed (latent, waiting) |
Variability: Response time varies dramatically between identical requests due to:
- Context switches (OS scheduling)
- Network packet loss and TCP retransmission
- Garbage collection pauses
- Page faults (RAM → disk swapping)
- Cache misses
💡 Insight
Head-of-line blocking causes queueing delays to amplify variability. Since servers process limited concurrent requests, even a few slow requests block subsequent ones, creating cascading delays.
3.2. Average, Median, and Percentiles
In plain English: If nine requests take 100ms and one takes 10 seconds, the average is 1090ms—but that's misleading. Most users experienced 100ms, not 1090ms. We need better ways to describe distributions.
In technical terms: Response time is a distribution, not a single number. Percentiles describe this distribution more accurately than averages.
50% of users
5% of users
1% of users
0.1% of users
Box size represents percentage of users experiencing that latency or better
| Percentile | Meaning |
|---|---|
| p50 (median) | Half of requests are faster, half are slower |
| p95 | 95% of requests are faster than this threshold |
| p99 | 99% of requests are faster than this threshold |
| p999 | 99.9% of requests are faster than this threshold |
Why high percentiles matter:
- Valuable customers: Users with slow requests often have the most data (e.g., power users, large accounts)
- User experience: Consistently slow experiences drive users away
- Amazon example: Uses p999 for internal services because slowest users are often most valuable
💡 Insight
Tail latencies (high percentiles) directly impact user experience. A p99 of 2 seconds means 1 in 100 users waits 2+ seconds—enough to notice and complain. For high-traffic services, that's thousands of frustrated users daily.
3.3. Use of Response Time Metrics
In plain English: When one page load requires calling 10 backend services, even if each service is fast 99% of the time, there's a good chance at least one will be slow—making the whole page slow.
In technical terms: Tail latency amplification occurs when a user request requires multiple backend calls. The probability of encountering at least one slow call increases with the number of calls.
Example: If each service has p99 = 100ms, and you call 10 services in parallel, the probability that all respond within 100ms is only 90%—meaning p99 for the overall request is likely much worse.
Service Level Objectives (SLOs) and Agreements (SLAs):
Percentiles define expected performance in contracts:
- SLO: "p95 response time will be under 200ms"
- SLA: "We guarantee p99 under 1 second, or you get a refund"
💡 Insight
High-percentile guarantees become exponentially harder in distributed systems. Each additional service call compounds tail latency risk, which is why microservices often struggle with consistent performance.
4. Reliability and Fault Tolerance
In plain English: Reliability means the system keeps working correctly even when things go wrong—hard drives fail, networks glitch, and humans make mistakes. A reliable system handles these gracefully instead of crashing.
In technical terms: Reliability is the system's ability to continue providing required services despite faults. This requires fault tolerance mechanisms that prevent faults from escalating into failures.
Why it matters: Hardware fails constantly at scale. In a datacenter with 10,000 hard drives, if each has a 1% annual failure rate, you'll have 100 drive failures per year—nearly 2 per week. Systems must be designed to handle this.
For software, typical reliability expectations include:
Key distinction:
| Term | Definition |
|---|---|
| Fault | A component stops working correctly (e.g., disk fails, network drops) |
| Failure | The system as a whole stops providing required service to users |
💡 Insight
Reliability = continuing to work correctly, even when things go wrong. The goal isn't to eliminate all faults (impossible), but to prevent faults from becoming failures.
4.1. Fault Tolerance
In plain English: Fault tolerance means your system can lose a specific part and keep working. Like having a spare tire—you can drive even with a flat.
In technical terms: A system is fault-tolerant if it continues providing required services despite certain faults occurring. Components that, if failed, cause system failure are called single points of failure (SPOFs).
Scope limits: Fault tolerance is always bounded:
- "Tolerates up to 2 concurrent disk failures"
- "Survives single datacenter outage"
- "Handles 3 node failures in a 5-node cluster"
Chaos Engineering: Deliberately injecting faults to test tolerance mechanisms:
- Randomly kill processes
- Introduce network delays
- Fill up disks
- Simulate datacenter outages
Circuit breakers, graceful degradation, fallbacks
Data redundancy, automatic failover
Retry logic, timeouts, load balancing
RAID arrays, dual power supplies, redundant nodes
Multiple data centers, geographic distribution
💡 Insight
Counter-intuitively, increasing fault rates can improve reliability. By deliberately triggering faults (chaos engineering), you discover weaknesses before they cause real outages. Netflix's Chaos Monkey randomly terminates production instances to ensure systems handle failures gracefully.
4.2. Hardware and Software Faults
In plain English: Hardware breaks predictably—drives fail at known rates, you can plan for it. Software bugs are trickier because they often affect all instances simultaneously.
In technical terms: Hardware faults are typically independent and random. Software faults are correlated—the same bug exists on every node running that code, causing simultaneous failures.
Hardware fault rates:
Traditional response: Add redundancy:
- RAID for disk failures
- Dual power supplies
- Backup generators
- Hot-swappable components
Software faults are more insidious:
| Fault Type | Example | Impact |
|---|---|---|
| Cascading failures | One overloaded service causes others to fail | Widespread outage |
| Resource exhaustion | Memory leak consumes all RAM | All nodes fail simultaneously |
| Dependency failure | External API goes down | All dependent services affected |
| Retry storms | Failed requests retry, increasing load | System collapse under load |
Mitigation strategies:
- Careful testing (unit, integration, chaos)
- Process isolation (containers, VMs)
- Crash and restart (let it fail fast)
- Avoid feedback loops (exponential backoff)
- Production monitoring and alerting
💡 Insight
Hardware faults are independent; software faults are correlated. This is why redundancy alone doesn't guarantee reliability—three servers running buggy code will all fail the same way. You need defense in depth: testing, isolation, monitoring, and graceful degradation.
4.3. Humans and Reliability
In plain English: Humans make mistakes—it's inevitable. Blaming people doesn't help. Instead, design systems that make mistakes harder to make and easier to recover from.
In technical terms: Human error is the leading cause of outages, but it's a symptom of poor system design, not a root cause. Sociotechnical system design can minimize human-induced failures.
The data: One study found:
- Configuration changes by operators: leading cause of outages
- Hardware faults: only 10–25% of outages
Why "human error" is misleading: Blaming individuals ignores systemic issues. When humans make mistakes, it usually indicates:
- Unclear interfaces
- Inadequate training
- Time pressure
- Poor tooling
- Complex systems
Technical measures to minimize human mistakes:
| Measure | How It Helps |
|---|---|
| Testing | Unit, integration, and end-to-end tests catch bugs before production |
| Rollback mechanisms | Quickly revert bad changes |
| Gradual rollouts | Deploy to small percentage first, catch issues early |
| Monitoring | Detect anomalies and alert operators |
| Good interfaces | Make correct actions obvious, dangerous actions difficult |
| Documentation | Clear runbooks for common operations |
Blameless postmortems: After incidents, teams share what happened without fear of punishment. This encourages honesty and systemic learning rather than hiding mistakes.
💡 Insight
Blame is counterproductive. When incidents happen, ask "What about our system allowed this to occur?" instead of "Who did this?" Culture and tooling that treat mistakes as learning opportunities build more reliable systems than punishment-based approaches.
4.4. How Important Is Reliability?
In plain English: Even "boring" business apps need reliability. Bugs don't just annoy users—they can destroy lives and businesses.
In technical terms: Reliability failures have cascading impacts: lost revenue, damaged reputation, legal liability, and in severe cases, ruined lives. The cost of unreliability far exceeds the cost of building reliable systems.
Real-world consequences:
Case Study: Post Office Horizon Scandal
Between 1999 and 2019, hundreds of Post Office branch managers in Britain were convicted of theft or fraud because accounting software (Horizon) showed shortfalls in their accounts. Many were imprisoned, went bankrupt, or died before vindication.
Eventually discovered: Many shortfalls were due to software bugs, not theft. The system was unreliable, but management trusted it over people.
💡 Insight
Unreliable software has real human costs. The Horizon scandal shows how software bugs can destroy lives when systems are trusted blindly. Reliability isn't just about uptime—it's about responsibility to the people who depend on your systems.
5. Scalability
In plain English: Scalability means your system can handle growth without falling over. It's not a binary property—you don't "have" scalability. Instead, you plan for specific growth patterns and know when you'll hit limits.
In technical terms: Scalability describes a system's ability to maintain performance as load increases. It requires understanding current load, predicting growth, and having a plan to add capacity when needed.
Why it matters: Even reliable systems degrade under increased load. Without scalability planning, success can kill your system—viral growth crashes your service just as users discover it.
Scalability is not:
- ❌ "This system is scalable"
- ❌ "We built for infinite scale"
- ❌ "It scales horizontally"
Scalability is:
- ✅ "If daily users grow 10x, we'll need 5 more DB replicas"
- ✅ "We'll hit limits at 50k concurrent users; we're at 20k now"
- ✅ "Adding 10 nodes doubles our write capacity"
5.1. Describing Load
In plain English: Before you can scale, you need to measure what "load" means for your system. Is it requests per second? Users online? Data volume? The answer shapes your scaling strategy.
In technical terms: Load parameters quantify current system stress. Common metrics include throughput (requests/sec), concurrency (active users), and data volume (GB/day).
Common load parameters:
| Metric | Example | Use Case |
|---|---|---|
| Requests/second | 10,000 API calls/sec | Web services |
| Data volume | 500 GB new data/day | Data pipelines |
| Concurrent users | 100,000 simultaneous users | Gaming, streaming |
| Transactions | 5,000 checkouts/hour | E-commerce |
Linear scalability: If you can double resources to handle double the load with the same performance, you have linear scalability—the holy grail.
💡 Insight
Load description is application-specific. For a social network, it might be "posts/second" and "timeline reads/second." For an analytics system, it's "queries/hour" and "data ingestion rate." Understanding your specific load parameters is the first step toward effective scaling.
5.2. Shared-Memory, Shared-Disk, and Shared-Nothing Architecture
In plain English: There are three ways to add capacity: buy a bigger computer (vertical), connect multiple computers to the same storage (shared-disk), or give each computer its own everything (horizontal).
In technical terms: Scaling architectures differ in how they share resources. Each has distinct cost, complexity, and scalability trade-offs.
CPU+Disk
CPU+Disk
CPU+Disk
Shared-Nothing advantages:
- Linear scalability potential
- Cost-effective commodity hardware
- Fault tolerance across nodes
- Elastic—add/remove nodes dynamically
- Geographic distribution
Shared-Nothing challenges:
- Complex distributed system logic
- Data sharding required
- Network latency
- Partial failures
- Eventual consistency
💡 Insight
There's no universal winner. Vertical scaling is simpler but limited. Horizontal scaling can scale infinitely but adds massive complexity. Modern systems often use both: vertically scale nodes to delay the complexity of distribution, then horizontally scale when necessary.
5.3. Principles for Scalability
In plain English: Scalability isn't one-size-fits-all. A system that handles millions of tiny requests needs different architecture than one processing huge analytics queries. The key is breaking work into independent pieces.
In technical terms: Scalable architectures decompose systems into loosely-coupled components that can operate independently. This enables parallel processing and localized failures.
Core principles:
Examples of decomposition:
| Pattern | Description | Benefit |
|---|---|---|
| Microservices | Split application into independent services | Teams work independently |
| Sharding | Partition data across nodes by key | Parallel data processing |
| Stream processing | Break large jobs into small, independent tasks | Continuous, incremental processing |
| Caching | Store precomputed results | Reduce load on primary system |
💡 Insight
Scalability is not automatic. There's no "scalable architecture" you can copy. Systems that scale well at social networks (many tiny writes) fail at analytics (few huge queries). Design for your specific load patterns, and don't make things more complicated than necessary—a single-machine database is often better than a distributed mess.
6. Maintainability
In plain English: Software doesn't wear out like machines, but it does "age" as requirements change, platforms evolve, and knowledge decays. Maintainability means designing systems that stay easy to work with over time.
In technical terms: Maintainability encompasses operability (ease of running), simplicity (ease of understanding), and evolvability (ease of changing). These determine the long-term cost and viability of systems.
Why it matters: Most software cost isn't initial development—it's ongoing maintenance. A system used for years will spend far more on bug fixes, feature additions, and operational costs than its original build.
20%
80%
Maintenance activities:
- Fixing bugs
- Keeping systems operational
- Investigating failures
- Adapting to new platforms
- Modifying for new use cases
- Repaying technical debt
- Adding new features
6.1. Operability: Making Life Easy for Operations
In plain English: Operations teams keep systems running—deploying updates, handling incidents, scaling resources. Good operability means routine tasks are easy, freeing operators to focus on high-value work.
In technical terms: Operability is the ease with which operators can maintain a system's health. Well-designed systems provide good observability, automation, and sensible defaults.
"Good operations can often work around the limitations of bad software, but good software cannot run reliably with bad operations."
What good operability provides:
Operations responsibilities:
| Task | Good Operability | Poor Operability |
|---|---|---|
| Monitoring | Dashboards show key metrics, alerts fire before users notice | Logs scattered, no alerts, discover issues from user complaints |
| Deployment | Automated, gradual rollout, easy rollback | Manual steps, all-or-nothing, no rollback |
| Scaling | Auto-scaling based on metrics | Manual server provisioning |
| Incidents | Clear runbooks, automatic diagnostics | Guesswork, tribal knowledge |
💡 Insight
Operability enables reliability. Even the best-designed system will fail if operators can't understand, monitor, or repair it. Invest in observability and automation—they're not luxuries, they're requirements for reliable production systems.
6.2. Simplicity: Managing Complexity
In plain English: Simple code is easy to understand. Complex code is a tangled mess where changing one thing breaks three others. As projects grow, fighting complexity becomes critical.
In technical terms: Complexity is the enemy of maintainability. Systems mired in complexity—"big balls of mud"—resist change and harbor bugs. Simplicity through abstraction manages this complexity.
Symptoms of excessive complexity:
Abstraction: The tool for managing complexity
Examples of good abstractions:
- High-level languages hide machine code details
- SQL hides on-disk data structures and query optimization
- HTTP hides TCP packet management
- React hides DOM manipulation
Building for simplicity:
| Principle | How It Helps |
|---|---|
| Clear interfaces | Hide implementation details |
| Consistent conventions | Reduce cognitive load |
| Remove accidental complexity | Keep only essential complexity |
| Avoid premature optimization | Simpler code is better than "clever" code |
💡 Insight
Simplicity is not simplistic. A good abstraction hides enormous complexity behind a clean interface (like SQL hiding B-trees and query planners). The goal isn't to avoid complexity entirely—it's to manage it through layers of abstraction so each layer is simple.
6.3. Evolvability: Making Change Easy
In plain English: Requirements never stop changing. Features get added, platforms evolve, business needs shift. Evolvable systems adapt to change easily instead of resisting it.
In technical terms: Evolvability (also called extensibility or modifiability) is the ease of making changes to a system. It's closely linked to simplicity—loosely-coupled, simple systems are easier to modify.
Why requirements change:
Factors that enable evolvability:
| Factor | Description |
|---|---|
| Loose coupling | Components can change independently |
| Good abstractions | Implementation changes don't affect interface |
| Comprehensive tests | Confidence that changes don't break things |
| Reversibility | Easy to undo changes if needed |
| Clear documentation | Understand system to change it safely |
• Careful planning required
• Risk of wrong choice
• Paralysis by analysis
• Experiment freely
• Learn from mistakes
• Iterate quickly
💡 Insight
Irreversibility kills evolvability. When changes can't be undone, teams become paralyzed—every decision feels permanent. Minimize irreversibility through feature flags, database migrations (not destructive changes), and architectures that support gradual transitions. The easier it is to reverse a decision, the faster you can evolve.
7. Summary
In plain English: This chapter taught you how to think about the qualities that make systems good beyond just functionality: how fast they respond, whether they stay up when things break, if they can grow with demand, and whether they'll become a maintenance nightmare.
In technical terms: Nonfunctional requirements—performance, reliability, scalability, and maintainability—are as critical as functional requirements. Understanding how to measure and optimize these qualities is essential for building production-grade systems.
- Response time distributions
- Tail latency amplification
- SLAs based on percentiles
- Fault tolerance mechanisms
- Hardware + software faults
- Human error mitigation
- Describe load parameters
- Vertical vs horizontal scaling
- Independent components
- Operability via automation
- Simplicity via abstraction
- Evolvability via loose coupling
Key Takeaways
Performance:
- Use percentiles (p50, p95, p99) instead of averages to describe response time
- High percentiles matter—tail latencies affect valuable users
- Tail latency amplification occurs when requests require multiple backend calls
Reliability:
- Distinguish faults (component failures) from failures (system-wide breakdown)
- Fault tolerance prevents faults from becoming failures
- Hardware faults are independent; software faults are correlated
- Human error is a symptom, not a cause—design systems to minimize it
Scalability:
- Scalability is not binary—describe specific growth patterns and limits
- Vertical scaling (bigger machines) is simple but limited
- Horizontal scaling (more machines) scales linearly but adds complexity
- Break systems into independent components for better scalability
Maintainability:
- Most cost is ongoing maintenance, not initial development
- Operability: Make it easy to run
- Simplicity: Use abstraction to manage complexity
- Evolvability: Minimize irreversibility to enable change
💡 Insight
These four qualities are interconnected. A system that's hard to maintain will eventually become unreliable as tech debt accumulates. A system that can't scale will have poor performance under load. A system that's too complex resists both scaling and evolution. Good architecture considers all dimensions together, making conscious trade-offs rather than ignoring any one aspect.
Previous: Chapter 1: Trade-offs in Data Systems Architecture | Next: Chapter 3: Data Models and Query Languages