Skip to main content

Chapter 14. Doing the Right Thing

Feeding AI systems on the world's beauty, ugliness, and cruelty, but expecting it to reflect only the beauty is a fantasy.

Vinay Uday Prabhu and Abeba Birhane

Table of Contents

  1. Introduction
  2. Predictive Analytics
  3. Feedback Loops
  4. Privacy and Surveillance
  5. Data as Power
  6. What We Can Do
  7. Summary

1. Introduction

In plain English: We've spent this entire book talking about how to build powerful data systems—how to make them fast, reliable, and scalable. But we've left out the most important question: Should we build them? And if we do, how should we use them? Technology isn't magic—it's made by people, used by people, and affects people. We have a responsibility to think about those effects.

In technical terms: Every data system is built for a purpose and has both intended and unintended consequences. As engineers building these systems, we have a responsibility to carefully consider those consequences and to consciously decide what kind of world we want to live in.

Why it matters: Software development increasingly involves making important ethical choices. It's not sufficient for software engineers to focus exclusively on the technology and ignore its consequences—the ethical responsibility is ours to bear also.

1.1. Data is About People

How we abstract away humanity:

  • "Rows in a database"
  • "Events in a stream"
  • "Files in object storage"

What it actually is:

  • People's behavior
  • People's interests
  • People's identity

Implication: We must treat such data with humanity and respect.

WHAT DATA REALLY REPRESENTS
"User metrics"
"Data points"
"Training dataset"
"Behavioral signals"
"Conversion funnel"
People's daily lives
Individual choices
Personal experiences
Human behavior
Life decisions

💡 Insight

Users are humans too, and human dignity is paramount. Every row in your database represents a person with hopes, fears, and rights. The moment we forget this and treat people as mere data points, we've lost our way.

1.2. Technology is Not Neutral

In plain English: Imagine someone invents a new kind of knife. The inventor might say "I just make knives—how people use them is not my problem." But if you specifically design that knife to be easily concealed and perfect for stabbing, you can't claim neutrality. The same applies to software systems.

In technical terms: A technology is not good or bad in itself—what matters is how it is used and how it affects people. This is true for a software system like a search engine in much the same way as it is for a weapon like a gun.

Why it matters: It is not sufficient for software engineers to focus exclusively on the technology and ignore its consequences. Reasoning about ethics is difficult, but it is too important to ignore.

PerspectiveClaimsReality
Tech Company"We just build tools; we're platform-neutral"Design choices encode values and priorities
Engineer"I just write code; business decides how to use it"Technical decisions have ethical implications
Product Manager"Users consented via terms of service"Consent requires understanding; complexity obscures intent

1.3. Ethics is Not a Checklist

In plain English: You can't just run through a compliance checklist, tick all the boxes, and call yourself ethical. Real ethics requires continuous reflection, dialogue with affected people, and accountability for outcomes. It's more like being a good person than like passing an exam.

In technical terms: What makes something "good" or "bad" is not well-defined, and most people in computing don't even discuss that question. The concepts at the heart of ethics are not fixed or determinate in their precise meaning, and they require interpretation, which may be subjective.

Why it matters: Ethics is not going through some checklist to confirm you comply; it's a participatory and iterative process of reflection, in dialog with the people involved, with accountability for the results.

TWO APPROACHES TO ETHICS
✓ Legal team approved
✓ Privacy policy posted
✓ Cookie banner displayed
✓ Terms of service accepted
✓ Data encrypted
Done! We're ethical!
→ Who does this affect?
→ What are the consequences?
→ Are we treating people with dignity?
→ What power dynamics exist?
→ How can we be accountable?
Ongoing conversation

2. Predictive Analytics

In plain English: Predictive analytics means using historical data to predict the future. Predicting tomorrow's weather is one thing—but predicting whether a person will commit a crime, default on a loan, or get sick? That's very different because those predictions directly affect people's lives, often in ways they can't control or appeal.

In technical terms: Predictive analytics is a major part of why people are excited about big data and AI. Using data analysis to predict the weather or the spread of diseases is one thing; it is another matter to predict whether a convict is likely to reoffend, whether an applicant for a loan is likely to default, or whether an insurance customer is likely to make expensive claims.

Why it matters: These predictions have direct effects on individual people's lives—whether they can get a job, buy a house, or access financial services. The stakes are incredibly high.

2.1. Algorithmic Prison

THE ALGORITHMIC PRISON CYCLE
Person Flagged as "Risky"
Algorithm says no
Denied Job
No income
Denied Loan
No capital
Denied Housing
No stability
Denied Insurance
No safety net
Excluded from Society

In plain English:

  • Rejected for jobs, denied loans, refused housing
  • Not because of anything you did
  • Because an algorithm labeled you "risky"
  • No way to appeal, no way to prove yourself
  • Researchers call this: "algorithmic prison"

In technical terms:

  • Algorithmic decision-making → more "no" decisions for those labeled risky
  • Systematically excluded from:
    • Jobs
    • Air travel
    • Insurance coverage
    • Property rental
    • Financial services
  • Constraint so severe → termed "algorithmic prison"

Why it matters:

  • Criminal justice: presumes innocence until proven guilty
  • Automated systems: exclude people without proof of guilt
  • Little to no chance of appeal

💡 Insight

Organizations naturally want to be cautious—the cost of a bad hire or bad loan is higher than the cost of a missed opportunity. So when in doubt, they say no. But for the individual on the receiving end of dozens of algorithmic "no" decisions, this caution becomes a cage.

2.2. Bias and Discrimination

HOW ALGORITHMS AMPLIFY BIAS
1
2
3
4
5

In plain English:

  • Myth: algorithms are objective, humans are biased
  • Reality: algorithms learn from biased human data
  • Racist data → racist algorithm
  • "Machine learning is like money laundering for bias" — dirty input looks clean and mathematical

In technical terms:

  • AI systems infer rules from data (not explicitly programmed)
  • Patterns learned are opaque — correlation without understanding why
  • Systematic bias in input → amplified bias in output

Why it matters:

  • Anti-discrimination laws exist for: ethnicity, age, gender, sexuality, disability, beliefs
  • Problem: features that correlate with protected traits
  • Example: in segregated neighborhoods, postal code predicts race

Real-world example:

Feature UsedLegal StatusWhat It Actually Predicts
RaceIllegalRace (obviously)
Postal CodeLegalRace (in segregated areas)
First NameLegalRace and ethnicity
IP AddressLegalRace and socioeconomic status
Shopping PatternsLegalRace, gender, pregnancy, health

💡 Insight

Predictive analytics systems merely extrapolate from the past; if the past is discriminatory, they codify and amplify that discrimination. If we want the future to be better than the past, moral imagination is required, and that's something only humans can provide. Data and models should be our tools, not our masters.

2.3. Responsibility and Accountability

TRADITIONAL VS. ALGORITHMIC DECISION-MAKING
Who decides: Named person
Process: Explainable
Accountability: Clear
Appeal: Possible
Errors: Can be corrected

"I reviewed your application and here's why I said no. Here's what would change my decision."

Who decides: "The algorithm"
Process: Opaque black box
Accountability: Unclear
Appeal: Nearly impossible
Errors: Hard to identify/fix

"The system says no. No, we can't tell you why. No, there's no one to appeal to."

In plain English:

  • Human mistake → hold them accountable
  • Algorithm mistake → who to blame?
    • Engineer who wrote code?
    • Data scientist who trained model?
    • Product manager who approved it?
    • Company that deployed it?
  • Everyone points fingers: "not my fault"

In technical terms:

  • Automated decision-making → accountability problem
  • Human mistakes: can be appealed
  • Algorithm mistakes: who is responsible?
    • Self-driving car accident → who's liable?
    • Discriminatory credit scoring → any recourse?

Why it matters:

  • People should not evade responsibility by blaming algorithms
  • Key question: Can you explain to a judge how the algorithm made its decision?

The problem with opacity:

Old-School Credit ScoresModern Algorithmic Scoring
Based on borrowing historyHundreds of factors
Did you pay bills on time?What you buy, where you live, who your friends are
Errors can be identifiedErrors? Good luck figuring out what went wrong
Can be correctedMuch more opaque

The stereotyping problem:

INDIVIDUAL VS. STEREOTYPE-BASED DECISIONS

"How did YOU behave in the past?"

Based on your actual history

"Who is similar to you, and how did people like you behave?"

Based on people in your neighborhood, demographic group, etc.

💡 Insight

A credit score summarizes "How did you behave in the past?" whereas predictive analytics usually work on the basis of "Who is similar to you, and how did people like you behave in the past?" Drawing parallels to others' behavior implies stereotyping people, for example based on where they live (a close proxy for race and socioeconomic class). What about people who get put in the wrong bucket?

The probabilistic problem:

Statistics Work ForStatistics Don't Work For
PopulationsIndividuals
AveragesYour specific case
"80% accurate overall"Whether it's wrong about YOU

The gap:

  • Average life expectancy = 80 years
  • Doesn't mean you die on your 80th birthday
  • Algorithm 80% accurate → still wrong about 20% of people
  • Your life affected by that wrong prediction

3. Feedback Loops

In plain English:

  • The danger isn't the initial prediction
  • It's what happens next:
    • Predictions change reality
    • Changed reality generates new data
    • New data feeds back into algorithm
    • Vicious cycle
  • Small initial bias → snowballs into massive inequality

In technical terms:

  • Predictive analytics → self-reinforcing feedback loops
  • Feedback loops amplify existing inequalities
  • Create downward spirals that are difficult/impossible to escape

Why it matters:

  • Can't always predict when feedback loops happen
  • Solution: systems thinking — analyze the entire system

3.1. Self-Reinforcing Cycles

THE CREDIT SCORE DEATH SPIRAL
Good Worker Good Credit
Unexpected Misfortune (medical emergency, etc.)
Miss Bill Payments
Credit Score Drops
Algorithm says "too risky"
Can't Get Hired (employers check credit)
Jobless → Poverty
More missed payments
Score Gets Worse
Even harder to get hired
Can't Escape

Example 1: The employment trap

StepWhat Happens
1Employers use credit scores to evaluate candidates
2You lose job due to bad luck
3Miss bill payments → credit score drops
4Lower credit score → harder to get new job
5No job → more missed payments → worse credit
6Trapped in downward spiral

"A downward spiral due to poisonous assumptions, hidden behind a camouflage of mathematical rigor and data."


Example 2: Algorithmic collusion

  • Gas stations in Germany → replaced human pricing with algorithms
  • Expected: more competition, lower prices
  • Reality: algorithms learned to avoid price wars
  • Result: prices went UP (algorithms colluded)
  • No human explicitly programmed this

Example 3: Echo chambers

  • Recommendation systems learn what you like
  • Show you more of the same
  • Result: you only see opinions you already agree with
  • Creates echo chambers where:
    • Stereotypes breed
    • Misinformation spreads
    • Polarization grows
  • Already impacting election campaigns

3.2. Systems Thinking

NARROW VS. SYSTEMS THINKING
Focus: The algorithm
"Our model is 85% accurate!"
Ship it!
Consider the entire system:
• How do people respond to predictions?
• Do predictions change behavior?
• What feedback loops exist?
• Does it amplify inequality?
• Can people game the system?
• What are unintended consequences?

💡 Insight

We can't always predict when feedback loops will happen. However, many consequences can be predicted by thinking about the entire system (not just the computerized parts, but also the people interacting with it). We can try to understand how a data analysis system responds to different behaviors, structures, or characteristics. Does the system reinforce and amplify existing differences between people (e.g., making the rich richer or the poor poorer), or does it try to combat injustice?

In plain English:

  • Don't think about your algorithm in isolation
  • Think about the whole system:
    • How will real people interact with it?
    • How will it change their behavior?
    • What incentives does it create?
    • What feedback loops might emerge?
  • Even with best intentions → beware unintended consequences

In technical terms:

  • Systems thinking = analyzing how data systems respond to behaviors, structures, characteristics
  • Must understand:
    • Technical components
    • Humans interacting with system
    • How outputs feed back into inputs

4. Privacy and Surveillance

In plain English:

Helpful TrackingSurveillance
Netflix recommends shows you likeHealth insurance requires fitness tracker
Improves your experienceAdjusts premiums based on movements
You're the customerYou're the product

The line between "helpful personalization" and "creepy surveillance" has gotten very blurry.

In technical terms:

  • User explicitly enters data → service for user
  • User activity tracked as side effect → relationship unclear
  • Service takes on its own interests → may conflict with user's interests

Why it matters:

  • Ad-funded services: advertisers are the actual customers
  • Users' interests take second place
  • "If you're not paying, you're the product"

4.1. The Surveillance Infrastructure

The Data Surveillance Thought Experiment: Replace "Data" with "Surveillance"

Original PhraseWith "Surveillance"Still Sound Good?
Data-driven organizationSurveillance-driven organization🚩
Real-time data streamsReal-time surveillance streams🚩
Data warehouseSurveillance warehouse🚩
Data scientistsSurveillance scientists🚩
Advanced data analyticsAdvanced surveillance processing🚩
Derive new insightsDerive new surveillance insights🚩

In plain English: Try this thought experiment: Replace the word "data" with "surveillance" in your company's marketing materials. How does it sound now? "We're a surveillance-driven organization that collects real-time surveillance streams and stores them in our surveillance warehouse." Not so appealing, is it? But that's exactly what we've built.

In technical terms: In our attempts to make software "eat the world," we have built the greatest mass surveillance infrastructure the world has ever seen. We are rapidly approaching a world in which every inhabited space contains at least one internet-connected microphone, in the form of smartphones, smart TVs, voice-controlled assistant devices, baby monitors, and even children's toys that use cloud-based speech recognition.

The scale of surveillance:

In plain English: The most totalitarian regimes of the past could only dream of putting a microphone in every room and forcing people to carry tracking devices. Yet we now voluntarily accept this world of total surveillance. The only difference is that it's corporations collecting the data to sell us stuff, rather than governments seeking control. Is that really such a comforting difference?

In technical terms: What is new compared to the past is that digitization has made it easy to collect large amounts of data about people. Surveillance of our location and movements, our social relationships and communications, our purchases and payments, and data about our health have become almost unavoidable. Even the most totalitarian and repressive regimes of the past could only dream of putting a microphone in every room and forcing every person to constantly carry a device capable of tracking their location and movements. Yet the benefits that we get from digital technology are so great that we now voluntarily accept this world of total surveillance.

💡 Insight

Not all data collection necessarily qualifies as surveillance, but examining it as such can help us understand our relationship with the data collector. When surveillance is used to determine things that hold sway over important aspects of life, such as insurance coverage or employment, it starts to appear less benign.

Real-world examples:

ExampleDescription
Car InsuranceCars track driving behavior without explicit consent and share data with insurers, who adjust premiums based on how you drive—even if you never had an accident.
Health InsuranceSome plans require wearing fitness trackers. Not exercising enough? Higher premiums. The tracker knows more about your health than you do.
Smartwatch SensorsMovement sensors can infer what you're typing (including passwords) with good accuracy. That fitness tracker is also a keylogger.
Smart Home DevicesMicrophones in every room, cameras watching, tracking when you're home. Many have terrible security. Who has access to this data?
THE CONSENT PROBLEM
1
2
3
4
5

In plain English: Companies say "users voluntarily agreed to our terms of service, so they consented to data collection." But did they really? Have you read all 50 pages of legalese in a privacy policy? Do you understand what "we may share anonymized data with partners" actually means? Can you afford to not use essential services like email or maps? That's not meaningful consent—that's coercion dressed up as choice.

In technical terms: We might assert that users voluntarily choose to use a service that tracks their activity, and they have agreed to the terms of service and privacy policy, so they consent to data collection. However, there are several problems with this argument:

Problems with the "consent" argument:

  1. Lack of understanding:

In plain English: Users have no idea what data they're feeding into databases, or how it's being used. Privacy policies do more to obscure than to illuminate. Without understanding what happens to their data, users cannot give meaningful consent.

In technical terms: Users have little knowledge of what data they are feeding into our databases, or how it is retained and processed—and most privacy policies do more to obscure than to illuminate. Without understanding what happens to their data, users cannot give any meaningful consent. Often, data from one user also says things about other people who are not users of the service and who have not agreed to any terms.

  1. Asymmetric relationship:

In plain English: You don't get to negotiate. The company sets the terms, and you either accept everything or get nothing. It's like someone saying "I'll give you food, but only if you give me your house keys and agree to let me rummage through your belongings whenever I want." That's not a fair exchange.

In technical terms: Data is extracted from users through a one-way process, not a relationship with true reciprocity, and not a fair value exchange. There is no dialog, no option for users to negotiate how much data they provide and what service they receive in return: the relationship between the service and the user is very asymmetric and one-sided. The terms are set by the service, not by the user.

  1. No real choice:

In plain English: Try declining to use Google, Facebook, email, or smartphones and see how that goes. For most people in modern society, these services are essential for social participation and professional opportunities. "Just don't use it" is not a realistic option—it's like saying "if you don't like traffic laws, just don't drive." Well, I need to get to work somehow.

In technical terms: If a service is so popular that it is "regarded by most people as essential for basic social participation," then it is not reasonable to expect people to opt out of this service—using it is de facto mandatory. Especially when a service has network effects, there is a social cost to people choosing not to use it.

The European GDPR perspective:

The European Union's General Data Protection Regulation (GDPR) requires that consent must be:

  • Freely given — Can refuse/withdraw without detriment
  • Specific — Clear what you're consenting to
  • Informed — Actually understand what's happening
  • Unambiguous — Active choice, not silence or inactivity

The GDPR recognizes that if you can't say no without losing access to essential services, consent is not "freely given."

💡 Insight

Declining to use a service due to its tracking policies is easier said than done. These platforms are designed to engage users, often using game mechanics and tactics common in gambling. Even if a user gets past this, declining to engage is only an option for the small number of people who are privileged enough to have the time and knowledge to understand its privacy policy, and who can afford to potentially miss out on social participation or professional opportunities. For people in a less privileged position, there is no meaningful freedom of choice: surveillance becomes inescapable.

4.3. Privacy as a Decision Right

WHAT PRIVACY REALLY MEANS

"Keep everything secret"

The freedom to choose:

  • What to reveal to whom
  • What to make public
  • What to keep secret
  • Who can access your data
  • For what purpose

In plain English: People sometimes say "privacy is dead" because users post personal stuff on social media. But that's a misunderstanding of what privacy means. Privacy isn't about keeping secrets—it's about having control over your information. You might happily share photos with friends but not want strangers accessing them. You might share health data with your doctor but not your employer. Privacy means you get to decide.

In technical terms: Having privacy does not mean keeping everything secret; it means having the freedom to choose which things to reveal to whom, what to make public, and what to keep secret. The right to privacy is a decision right: it enables each person to decide where they want to be on the spectrum between secrecy and transparency in each situation. It is an important aspect of a person's freedom and autonomy.

Example: Medical data

In plain English: Someone with a rare disease might eagerly share their medical data with researchers if it might help find a cure. But they'd want to make sure that data doesn't get to insurance companies (who might deny coverage) or employers (who might not hire them). The key is that the person gets to choose who sees the data and why.

In technical terms: Someone who suffers from a rare medical condition might be very happy to provide their private medical data to researchers if there is a chance that it might help the development of treatments for their condition. However, the important thing is that this person has a choice over who may access this data, and for what purpose. If there was a risk that information about their medical condition would harm their access to medical insurance or employment, this person would probably be much more cautious about sharing their data.

The transfer of power:

WHO CONTROLS YOUR PRIVACY?

You decide what to reveal

Company decides what to reveal

In plain English: When data is extracted through surveillance, you lose control over your privacy. The company now decides what to reveal and to whom. They keep the really creepy stuff secret (because it would harm their business), but they use intimate details about you for targeted advertising. You've lost your agency—the company exercises your privacy right with the goal of maximizing profit.

In technical terms: When data is extracted from people through surveillance infrastructure, privacy rights are not necessarily eroded, but rather transferred to the data collector. Companies that acquire data essentially say "trust us to do the right thing with your data," which means that the right to decide what to reveal and what to keep secret is transferred from the individual to the company. The companies in turn choose to keep much of the outcome of this surveillance secret, because to reveal it would be perceived as creepy. It is not the user who decides what is revealed to whom—it is the company that exercises the privacy right with the goal of maximizing its profit.

💡 Insight

This kind of large-scale transfer of privacy rights from individuals to corporations is historically unprecedented. Surveillance has always existed, but it used to be expensive and manual, not scalable and automated. Trust relationships have always existed (between patient and doctor, defendant and attorney), but in these cases the use of data has been strictly governed by ethical, legal, and regulatory constraints. Internet services have made it much easier to amass huge amounts of sensitive information without meaningful consent, and to use it at massive scale without users understanding what is happening to their private data.


5. Data as Power

In plain English: Data isn't just information—it's power. The more data you have about people, the more power you have over their lives. And like any form of power, it can be used for good or abused for profit and control.

In technical terms: "Knowledge is power," as the old adage goes. And furthermore, "to scrutinize others while avoiding scrutiny oneself is one of the most important forms of power." This is why totalitarian governments want surveillance: it gives them the power to control the population.

Why it matters: Although today's technology companies are not overtly seeking political power, the data and knowledge they have accumulated nevertheless gives them a lot of power over our lives, much of which is surreptitious, outside of public oversight.

5.1. Data as Assets

THE ECONOMICS OF PERSONAL DATA
1
2
3
4
5

User gets: "Free" service


Company gets: Valuable asset worth billions

In plain English: Companies call user activity "data exhaust"—suggesting it's just worthless waste. But if it's worthless, why are startups valued by their "eyeballs" (i.e., how many people they can surveil)? Why does a shady industry of data brokers exist, buying and selling personal information? The truth is data is extremely valuable, and you're the one generating it without getting paid.

In technical terms: Since behavioral data is a byproduct of users interacting with a service, it is sometimes called "data exhaust"—suggesting that the data is worthless waste material. Viewed this way, behavioral and predictive analytics can be seen as a form of recycling. More correct would be to view it the other way round: from an economic point of view, if targeted advertising is what pays for a service, then the user activity that generates behavioral data could be regarded as a form of labor. Personal data is a valuable asset, as evidenced by the existence of data brokers, a shady industry operating in secrecy, purchasing, aggregating, analyzing, inferring, and reselling intrusive personal data about people.

Who wants your data:

WhoWhy They Want It
CompaniesTo maximize advertising revenue, optimize pricing, build competitive moats
GovernmentsThrough secret deals, legal compulsion, or simply stealing it
CriminalsData breaches happen constantly; your data is difficult to secure
Future BuyersWhen a company goes bankrupt, personal data is an asset that gets sold

5.2. Data as Hazardous Material

DATA IS...
Precious but inert
Valuable energy source
Powerful but dangerous

In plain English: People like to say "data is the new oil" or "data is the new gold." But maybe data is more like uranium—incredibly powerful, but also hazardous material that needs careful handling. Even if you think you can prevent abuse, you need to consider: What if your systems get hacked? What if a future government compels you to hand it over? What if your company gets acquired by someone with different values?

In technical terms: These observations have led critics to saying that data is not just an asset, but a "toxic asset," or at least "hazardous material." Maybe data is not the new gold, nor the new oil, but rather the new uranium. Even if we think that we are capable of preventing abuse of data, whenever we collect data, we need to balance the benefits with the risk of it falling into the wrong hands.

Consider all possible futures:

In plain English: When you build a surveillance system, you can't just think about how it will be used today. You have to think about all possible future governments. What happens if a regime comes to power that doesn't respect human rights? Your well-intentioned data collection system becomes the foundation for a police state.

In technical terms: When collecting data, we need to consider not just today's political environment, but all possible future governments. There is no guarantee that every government elected in future will respect human rights and civil liberties, so "it is poor civic hygiene to install technologies that could someday facilitate a police state."

💡 Insight

Because the data is valuable, many people want it: companies (that's why they collect it), governments (by means of secret deals, coercion, legal compulsion, or simply stealing it), criminals (data breaches happen often), and future buyers (when a company goes bankrupt, data is an asset that gets sold). The data is difficult to secure, so we must consider not just how we intend to use it, but all the ways it might be misused.

5.3. Remembering the Industrial Revolution

HISTORY REPEATING
Problem: Pollution
• Air pollution from factories
• Water pollution from waste
• Exploitation of workers
• Child labor
Solution: Regulation
• Environmental protections
• Workplace safety laws
• Child labor bans
• Food safety inspections
Problem: Data
• Privacy violations
• Surveillance infrastructure
• Algorithmic discrimination
• Power concentration
Solution: ???
We're at this stage now

In plain English: The Industrial Revolution brought tremendous benefits—economic growth, improved living standards, amazing innovations. But it also brought terrible pollution, exploitation of workers, and child labor. It took a long time before society established safeguards: environmental regulations, workplace safety rules, child labor laws. Those regulations cost businesses money, but society as a whole benefited hugely. We're at a similar moment with the Information Age.

In technical terms: The Industrial Revolution came about through major technological and agricultural advances, and it brought sustained economic growth and significantly improved living standards in the long run. Yet it also came with major problems: pollution of the air and water was dreadful, factory owners lived in splendor while urban workers often lived in very poor housing and worked long hours in harsh conditions, and child labor was common. It took a long time before safeguards were established. Undoubtedly the cost of doing business increased when factories were no longer allowed to dump their waste into rivers, sell tainted foods, or exploit workers. But society as a whole benefited hugely from these regulations.

Bruce Schneier's perspective:

Data is the pollution problem of the information age, and protecting privacy is the environmental challenge. Almost all computers produce information. It stays around, festering. How we deal with it—how we contain it and how we dispose of it—is central to the health of our information economy. Just as we look back today at the early decades of the industrial age and wonder how our ancestors could have ignored pollution in their rush to build an industrial world, our grandchildren will look back at us during these early decades of the information age and judge us on how we addressed the challenge of data collection and misuse.

We should try to make them proud.


6. What We Can Do

In plain English: Okay, this all sounds terrible. What can we actually do about it? The answer is a combination of regulation (laws that protect people) and self-regulation (choosing to do the right thing even when we're not legally required to). But most importantly, we need a culture change in how we think about users and their data.

In technical terms: We need both legislative protections for individuals' rights and a culture shift in the tech industry with regard to personal data. We should stop regarding users as metrics to be optimized, and remember that they are humans who deserve respect, dignity, and agency.

Why it matters: As people working in technology, if we don't consider the societal impact of our work, we're not doing our job.

6.1. Legislation and Self-Regulation

THE REGULATION TENSION
Collect everything
Combine datasets
Explore and experiment
Find unforeseen insights
Collect only what's needed
For specific, explicit purposes
Don't use for other purposes
Data minimization

These philosophies are fundamentally in conflict

In plain English: Laws like Europe's GDPR say you should only collect data for specific purposes and minimize what you collect. But the whole Big Data philosophy is "collect everything and see what we can learn from it." These are fundamentally opposed ideas. GDPR has helped a bit, but enforcement is weak, and it hasn't really changed tech industry culture.

in technical terms: The European GDPR states that personal data must be "collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes", and furthermore that data must be "adequate, relevant and limited to what is necessary." However, this principle of data minimization runs directly counter to the philosophy of Big Data, which is to maximize data collection, to combine it with other datasets, to experiment and to explore in order to generate new insights. While the GDPR has had some effect on the online advertising industry, the regulation has been weakly enforced, and it does not seem to have led to much of a change in culture and practices across the wider tech industry.

The regulation dilemma:

In plain English: Companies oppose regulation because it limits what they can do (and makes less money). Sometimes that opposition is justified—over-regulation could prevent beneficial innovations. For example, medical data could help develop better treatments and save lives, but strict privacy rules might prevent that research. It's hard to balance potential benefits with real risks.

In technical terms: Companies that collect lots of data about people oppose regulation as being a burden and a hindrance to innovation. To some extent that opposition is justified. For example, when sharing medical data, there are clear risks to privacy, but there are also potential opportunities: how many deaths could be prevented if data analysis was able to help us achieve better diagnostics or find better treatments? Over-regulation may prevent such breakthroughs. It is difficult to balance such potential opportunities with the risks.

6.2. Culture Change

FROM METRICS TO HUMANS
• Users are metrics to optimize
• Maximize engagement
• Extract maximum value
• Keep them in the dark
• Exploit asymmetric power
• "Move fast and break things"
• Users are humans with dignity
• Deserve respect and agency
• Entitled to understand their data
• Should maintain control
• Build trust, not just compliance
• Consider societal impact

In plain English: We need to fundamentally change how we think about users. They're not "daily active users" or "conversion rates"—they're people. People who deserve respect, dignity, and control over their own information. This means going beyond legal compliance to actually caring about the impact of our work.

In technical terms: We need a culture shift in the tech industry with regard to personal data. We should stop regarding users as metrics to be optimized, and remember that they are humans who deserve respect, dignity, and agency. We should self-regulate our data collection and processing practices in order to establish and maintain the trust of the people who depend on our software. And we should take it upon ourselves to educate end users about how their data is used, rather than keeping them in the dark.

Practical steps we can take:

💡 Insight

As a first step, we should not retain data forever, but purge it as soon as it is no longer needed, and minimize what we collect in the first place. Data you don't have is data that can't be leaked, stolen, or compelled by governments to be handed over.

StepAction
1. Data MinimizationOnly collect what you actually need. If you don't need it, don't collect it.
2. Data Retention LimitsDon't keep data forever. Delete it when it's no longer needed. Data you don't have can't be leaked.
3. TransparencyEducate users about how their data is used in plain language, not legalese.
4. User ControlGive users meaningful control over their data, not just privacy theater.
5. Consider ConsequencesThink about the societal impact of your work, not just technical metrics.
6. Protect PrivacyAllow individuals to maintain control over their own data. Don't steal that control through surveillance.

The bottom line:

In plain English: Our individual right to control our data is like the natural environment of a national park—if we don't explicitly protect and care for it, it will be destroyed. Ubiquitous surveillance is not inevitable. We are still able to stop it. But only if we choose to.

In technical terms: We should allow each individual to maintain their privacy—i.e., their control over own data—and not steal that control from them through surveillance. Our individual right to control our data is like the natural environment of a national park: if we don't explicitly protect and care for it, it will be destroyed. It will be the tragedy of the commons, and we will all be worse off for it. Ubiquitous surveillance is not inevitable—we are still able to stop it.

💡 Insight

Overall, culture and attitude changes will be necessary. As people working in technology, if we don't consider the societal impact of our work, we're not doing our job.


7. Summary

This brings us to the end of the book. Let's reflect on the journey we've taken and the ethical implications of our work:

What we've learned throughout this book:

  • Chapter 1: Contrasted analytical and operational systems, compared cloud to self-hosting, and discussed balancing business needs with user needs
  • Chapter 2: Defined nonfunctional requirements like performance, reliability, scalability, and maintainability
  • Chapter 3: Explored data models (relational, document, graph, event sourcing, DataFrames) and query languages
  • Chapter 4: Discussed storage engines for OLTP, analytics, and various indexing strategies
  • Chapter 5: Examined data encoding, evolution, and flow between processes
  • Chapter 6: Studied replication strategies and consistency models
  • Chapter 7: Went into sharding, rebalancing, and secondary indexing
  • Chapter 8: Covered transactions, isolation levels, and distributed transaction atomicity
  • Chapter 9: Surveyed fundamental distributed systems problems (faults, delays, crashes)
  • Chapter 10: Deep-dived into consensus and linearizability
  • Chapter 11: Built up from Unix tools to large-scale batch processing
  • Chapter 12: Generalized to stream processing with message brokers and CDC
  • Chapter 13: Explored a philosophy of streaming systems for integration and evolution

What we've learned in this chapter:

Key Ethical Lessons:

LessonDescription
Data is About PeopleEvery row in your database represents a human being who deserves dignity and respect
Algorithms Can HarmPredictive systems can create algorithmic prisons, amplify bias, and trap people in feedback loops
Surveillance ≠ ServiceThere's a difference between helping users and exploiting them for profit
Privacy is ControlPrivacy means individuals control their own data, not corporations
Data is PowerData gives enormous power over people's lives and must be handled responsibly
We Have ResponsibilityAs engineers, we must consider the societal impact of what we build

In plain English: We've spent 13 chapters learning how to build powerful, reliable, scalable data systems. But all that technical knowledge means nothing if we use it to harm people. Data can be used for good—to improve healthcare, understand the world, connect people, solve problems. But it can also cause serious harm: making unfair decisions about people's lives, enabling discrimination, normalizing mass surveillance, and concentrating power in the hands of corporations and governments.

In technical terms: Although data can be used to do good, it can also do significant harm: making decisions that seriously affect people's lives and are difficult to appeal against, leading to discrimination and exploitation, normalizing surveillance, and exposing intimate information. We also run the risk of data breaches, and we may find that a well-intentioned use of data has unintended consequences.

Why it matters: As software and data are having such a large impact on the world, we as engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect.

THE CHOICE WE FACE
Maximize data collection
Optimize for profit
Treat people as metrics
Build surveillance infrastructure
Concentrate power
Minimize data collection
Respect human dignity
Treat people as people
Protect privacy and autonomy
Distribute power

💡 Final Insight

The technical skills you've gained from this book are powerful tools. Like any powerful tool, they can be used to build or to destroy, to help or to harm. The choice is yours. As software and data are having such a large impact on the world, we as engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect.

Let's work together towards that goal.


Previous: Chapter 13. A Philosophy of Streaming Systems | Next: Interactive Code