Chapter 4. Storage and Retrieval
One of the miseries of life is that everybody names things a little bit wrong. And so it makes everything a little harder to understand in the world than it would be if it were named differently. A computer does not primarily compute in the sense of doing arithmetic. […] They primarily are filing systems.
Richard Feynman, Idiosyncratic Thinking seminar (1985)
Table of Contents
- Introduction
- Storage and Indexing for OLTP
- Log-Structured Storage
- B-Trees
- 4.1. B-Tree Structure
- 4.2. Making B-Trees Reliable
- 4.3. B-Tree Variants
- Comparing B-Trees and LSM-Trees
- 5.1. Read Performance
- 5.2. Sequential vs. Random Writes
- 5.3. Write Amplification
- 5.4. Disk Space Usage
- Multi-Column and Secondary Indexes
- Data Storage for Analytics
- Multidimensional and Full-Text Indexes
- 8.1. Multidimensional Indexes
- 8.2. Full-Text Search
- 8.3. Vector Embeddings
- Summary
1. Introduction
In Chapter 3 we discussed data models and query languages—i.e., the format in which you give the database your data, and the interface through which you can ask for it again later. In this chapter we discuss the same from the database's point of view: how the database can store the data that you give it, and how it can find the data again when you ask for it.
In plain English: Think of a database like a massive filing cabinet. You can stuff papers into drawers (writing data), but if you just throw them in randomly, you'll spend hours searching when you need something specific. Storage engines are the organizational systems—some use file folders (B-trees), others use chronological notebooks (log-structured storage).
In technical terms: A storage engine is the component of a database that handles how data is stored on disk and retrieved when queried. Different engines make different trade-offs between read performance, write performance, and space efficiency.
Why it matters: Selecting the right storage engine is critical for application performance. A poor choice can make your database 10x slower or consume 10x more resources. Understanding storage internals helps you tune databases effectively and diagnose performance issues.
💡 Insight
There is a fundamental tension in storage systems: optimizing for reads versus writes. Structures that make writes fast (append-only logs) make reads slow. Structures that make reads fast (sorted indexes) make writes slow. Every storage engine is a carefully chosen compromise along this spectrum.
In particular, there is a big difference between storage engines that are optimized for transactional workloads (OLTP) and those that are optimized for analytics (OLAP):
- OLTP systems handle many small reads and writes (user requests)
- OLAP systems scan massive datasets to answer analytical queries
This chapter examines two families of OLTP storage engines:
- Log-structured storage (LSM-trees, SSTables) - immutable files that are merged
- Update-in-place storage (B-trees) - fixed pages that are overwritten
Later, we'll explore column-oriented storage optimized for analytics, and specialized indexes for geospatial and full-text search.
2. Storage and Indexing for OLTP
2.1. The World's Simplest Database
Consider the world's simplest database, implemented as two Bash functions:
#!/bin/bash
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
These two functions implement a key-value store. You can call db_set key value, which will store key and value in the database. You can then call db_get key, which looks up the most recent value associated with that particular key and returns it.
Example usage:
$ db_set 12 '{"name":"London","attractions":["Big Ben","London Eye"]}'
$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'
$ db_get 42
{"name":"San Francisco","attractions":["Golden Gate Bridge"]}
The storage format is very simple: a text file where each line contains a key-value pair, separated by a comma. Every call to db_set appends to the end of the file. If you update a key several times, old versions are not overwritten—you need to look at the last occurrence:
$ db_set 42 '{"name":"San Francisco","attractions":["Exploratorium"]}'
$ db_get 42
{"name":"San Francisco","attractions":["Exploratorium"]}
$ cat database
12,{"name":"London","attractions":["Big Ben","London Eye"]}
42,{"name":"San Francisco","attractions":["Golden Gate Bridge"]}
42,{"name":"San Francisco","attractions":["Exploratorium"]}
The db_set function has pretty good performance because appending to a file is generally very efficient. Similarly to what db_set does, many databases internally use a log, which is an append-only data file. Real databases have more issues to deal with (concurrent writes, reclaiming disk space, crash recovery), but the basic principle is the same.
Note
The word log is often used to refer to application logs (human-readable text). In this book, log means an append-only sequence of records on disk. It doesn't have to be human-readable; it might be binary and intended only for internal use.
On the other hand, the db_get function has terrible performance if you have many records. Every lookup scans the entire database file from beginning to end. In algorithmic terms, the cost is O(n): if you double the number of records, a lookup takes twice as long.
2.2. Indexes and the Read-Write Trade-off
In order to efficiently find the value for a particular key, we need a different data structure: an index.
In plain English: An index is like the index at the back of a textbook. Instead of flipping through every page to find "B-tree", you look it up in the index, which tells you exactly which pages to read. The trade-off is that maintaining the index requires effort every time the book is updated.
In technical terms: An index is an additional data structure derived from the primary data. It doesn't affect the contents of the database; it only affects query performance. Maintaining indexes incurs overhead, especially on writes.
Why it matters: Well-chosen indexes can speed up read queries by orders of magnitude, but every index consumes disk space and slows down writes. You must balance these trade-offs based on your workload.
Key principle: Well-chosen indexes speed up reads but slow writes. Databases don't index everything by default—you choose based on your application's query patterns.
💡 Insight
For writes, it's hard to beat the performance of simply appending to a file, because that's the simplest possible write operation. Any kind of index usually slows down writes, because the index also needs to be updated every time data is written. This is an important trade-off in storage systems: well-chosen indexes speed up reads, but introduce overhead on writes.
3. Log-Structured Storage
3.1. Hash Index in Memory
Let's continue with the append-only log, but add an index to speed up reads. The simplest approach: keep a hash map in memory, where every key is mapped to the byte offset in the file where the most recent value can be found.
| Key | Offset |
|---|---|
| 12 | 0 |
| 42 | 40 |
| 99 | 60 |
Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote. When you want to look up a value, you use the hash map to find the offset in the log file, seek to that location, and read the value.
This approach is much faster, but it still suffers from several problems:
-
Disk space grows forever — You never free up space occupied by old values that have been overwritten; eventually you run out of disk space.
-
Hash map must fit in memory — If you have billions of keys, the hash map becomes impractically large. On-disk hash maps are difficult to make performant due to random I/O.
-
No crash recovery — The hash map is not persisted, so you have to rebuild it when you restart by scanning the whole log file (slow).
-
Range queries are inefficient — You cannot easily scan keys between 10000 and 19999; you'd have to look up each key individually.
3.2. SSTables (Sorted String Tables)
Instead of an append-only log with arbitrary key order, we can do better by keeping the data sorted by key. This format is called a Sorted String Table, or SSTable.
In plain English: Imagine organizing a filing cabinet. Instead of putting documents in the order they arrive (append-only), you keep them alphabetically sorted. Now, to find a document, you don't need to check every drawer—you can jump to the right section and scan a small range.
In technical terms: An SSTable stores key-value pairs sorted by key. Each key appears only once per SSTable. This enables efficient lookups using a sparse index.
| Key | Offset |
|---|---|
| handbag | 0 KB |
| handsome | 4 KB |
| hazard | 8 KB |
Advantages of SSTables:
-
Sparse index — You don't need to keep all keys in memory; just store the first key of each block (e.g., every 4 KB). This dramatically reduces memory requirements.
-
Efficient range queries — Because keys are sorted, scanning a range is fast: jump to the start key and read sequentially.
-
Compression — Blocks can be compressed, saving disk space and I/O bandwidth at the cost of a bit more CPU time.
3.3. Constructing and Merging SSTables
SSTables are better for reading than an append-only log, but they make writes more difficult—you can't just append at the end because the file must remain sorted.
The solution is a log-structured approach that combines an append-only log with sorted files:
How it works:
-
Write path:
- Add new writes to an in-memory sorted tree (memtable) — red-black tree or skip list
- When memtable reaches a threshold (e.g., 4 MB), write it to disk as a new SSTable
- Continue writing to a new memtable instance
-
Read path:
- First try to find the key in the memtable
- If not found, check the most recent SSTable on disk
- Continue searching older SSTables until the key is found or all segments are exhausted
-
Background compaction:
- Periodically merge multiple SSTables together
- Use mergesort algorithm: read files side-by-side, copy lowest key to output
- When the same key appears in multiple files, keep only the most recent value
- Result: fewer, larger SSTables with one value per key
Crash recovery: To ensure the memtable isn't lost on crash, keep a separate append-only log on disk. Every write is immediately appended to this log. When the memtable is written to an SSTable, the corresponding log can be discarded.
Deletions: To delete a key, append a special tombstone marker. During compaction, the tombstone tells the merge process to discard previous values for the deleted key.
💡 Insight
This algorithm is essentially what is used in RocksDB, Cassandra, Scylla, and HBase, all inspired by Google's Bigtable. It was originally published in 1996 as the Log-Structured Merge-Tree (LSM-Tree). The key insight: immutable SSTable files are written once and never modified. Merging and compaction happen in the background while serving reads from old segments.
3.4. Bloom Filters
With LSM storage, reading a key that doesn't exist can be slow—the storage engine must check several SSTables before concluding the key is absent. Bloom filters solve this problem.
In plain English: A Bloom filter is like a bouncer at a club who has a near-perfect memory. If you ask "Is Alice inside?", the bouncer can definitively say "No" or "Probably yes, check inside." Bloom filters never give false negatives (if it says no, the key is definitely absent), but can give false positives (it might say yes when the key is absent).
In technical terms: A Bloom filter is a probabilistic data structure that provides a fast, space-efficient way to test whether an element is in a set. It uses a bit array and multiple hash functions.
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
How Bloom filters help LSM-trees:
- If the Bloom filter says a key is not present, safely skip that SSTable (100% certain)
- If the Bloom filter says the key is present, check the SSTable (might be false positive)
- False positives waste a bit of I/O, but true negatives avoid it entirely
3.5. Compaction Strategies
An important detail is when to perform compaction and which SSTables to merge. Common strategies:
Size-tiered compaction:
- Newer, smaller SSTables are merged into older, larger SSTables
- SSTables containing older data can get very large
- Advantage: Can handle very high write throughput
- Disadvantage: Requires a lot of temporary disk space during compaction
Leveled compaction:
- Key range is split into smaller SSTables organized into "levels"
- Older data is moved into separate levels
- Compaction proceeds more incrementally and uses less disk space
- Advantage: More efficient for reads (fewer SSTables to check)
- Disadvantage: More overhead on writes
Rule of thumb:
- Size-tiered compaction: better for mostly writes and few reads
- Leveled compaction: better for read-dominated workloads
3.6. Embedded Storage Engines
Many databases run as network services, but embedded databases are libraries that run in the same process as your application. Examples: RocksDB, SQLite, LMDB, DuckDB, KùzuDB.
Use cases for embedded storage engines:
- Mobile apps — Store local user data
- Single-machine backends — When data fits on one machine with few concurrent transactions
- Multitenant systems — Separate embedded database instance per tenant (if tenants are small and isolated)
💡 Insight
The storage techniques we discuss (LSM-trees, B-trees, column storage) are used in both embedded and client-server databases. The principles are the same; only the deployment model differs.
4. B-Trees
4.1. B-Tree Structure
The log-structured approach is popular, but the most widely used structure for reading and writing database records by key is the B-tree.
In plain English: A B-tree is like a well-organized library with a card catalog. The catalog directs you to the right section, then the right shelf, then the right book. Each level narrows down the search, so you never have to scan the entire library.
In technical terms: B-trees break the database into fixed-size pages (typically 4-16 KB) organized into a tree structure. Each page contains keys and pointers to child pages, enabling logarithmic-time lookups.
Key properties:
- Sorted by key — Like SSTables, B-trees keep key-value pairs sorted, enabling efficient lookups and range queries
- Fixed-size pages — Unlike LSM-trees (variable-size segments), B-trees use fixed pages that can be overwritten in place
- Balanced tree — All leaf pages are at the same depth, ensuring predictable performance
- Logarithmic depth — A four-level tree with 4 KB pages and branching factor of 500 can store up to 250 TB
Operations:
Lookup: Start at root, follow pointers down until you reach a leaf page containing the key.
Update: Search for the leaf page, overwrite it with the new value.
Insert: Find the appropriate leaf page. If there's room, add the key. If the page is full, split it into two half-full pages and update the parent.
Want to insert 334 but no room!
If parent is full, it splits too (splits propagate up to root)
This algorithm ensures the tree remains balanced: a B-tree with n keys always has a depth of O(log n). Most databases can fit into a B-tree that is three or four levels deep.
4.2. Making B-Trees Reliable
The basic write operation of a B-tree is to overwrite a page on disk with new data. This is fundamentally different from LSM-trees, which only append to files and never modify them in place.
In plain English: Imagine editing a page in a bound book versus adding pages to a loose-leaf binder. Editing in place (B-trees) is riskier—if you spill coffee mid-edit, the page is corrupted. Adding new pages (LSM-trees) is safer—if something goes wrong, the old pages are still intact.
Challenges:
-
Torn pages — If the database crashes while writing a page, you may end up with a partially written page (corrupted data).
-
Multi-page updates — A page split requires writing multiple pages. If the crash happens after some pages are written but not others, the tree becomes corrupted (orphan pages, broken pointers).
Solution: Write-Ahead Log (WAL)
B-tree implementations include a write-ahead log (WAL), an append-only file to which every modification must be written before it's applied to the tree pages.
Equivalent to journaling in filesystems (ext4, XFS)
Performance optimization: B-tree implementations typically don't immediately write every modified page to disk. They buffer pages in memory for a while. The WAL ensures that data isn't lost if the system crashes—as long as the WAL entry has been flushed to disk (using fsync()), the database can recover.
4.3. B-Tree Variants
B-trees have been around since 1970, and many variants have been developed:
-
Copy-on-write B-trees — Instead of overwriting pages and maintaining a WAL, write modified pages to a new location and create new parent pages pointing to the new location. Used by LMDB. Also useful for snapshot isolation.
-
Key abbreviation — Store abbreviated keys in interior nodes (they only need enough information to act as boundaries). This increases branching factor and reduces tree depth.
-
Sequential leaf layout — Try to lay out leaf pages in sequential order on disk to speed up range scans. Difficult to maintain as the tree grows.
-
Sibling pointers — Each leaf page has pointers to its left and right siblings, allowing scanning keys in order without jumping back to parent pages.
5. Comparing B-Trees and LSM-Trees
5.1. Read Performance
B-trees:
- Looking up a key involves reading one page at each level (typically 3-4 levels)
- Predictable performance with low response times
- Range queries are simple and fast using the sorted tree structure
LSM-trees:
- Must check memtable and potentially several SSTables at different compaction stages
- Bloom filters reduce disk I/O by eliminating SSTables that don't contain the key
- Range queries must scan all segments in parallel and merge results
💡 Insight
Both approaches can perform well for reads—which is faster depends on the workload and implementation details. Benchmarks are often sensitive to specifics, so you must test with your particular workload to make a valid comparison.
Read throughput: Modern SSDs (especially NVMe) can perform many independent read requests in parallel. Both LSM-trees and B-trees can achieve high read throughput, but storage engines must be carefully designed to exploit this parallelism.
5.2. Sequential vs. Random Writes
B-trees: When the application writes keys scattered across the key space, the resulting disk operations are also scattered randomly. Each page to update could be anywhere on disk.
LSM-trees: Write entire segment files at a time (memtable flush or compaction), which are much larger than B-tree pages.
Sequential vs. Random Write Performance:
- On spinning-disk HDDs: Sequential writes are much faster than random writes (no mechanical head movement). LSM-trees have a big advantage.
- On SSDs: Sequential writes are still faster than random writes due to garbage collection (GC) overhead, but the difference is smaller than on HDDs.
💡 Insight
Even on SSDs, sequential writes outperform random writes. Flash memory is written page-by-page (4 KB) but erased block-by-block (512 KB). Random writes scatter pages across blocks, forcing the SSD controller to perform more garbage collection before erasing blocks. Sequential writes fill entire blocks, which can be erased without GC overhead.
5.3. Write Amplification
In plain English: Write amplification is like making photocopies. If you need to make one copy of a document, but the copier forces you to copy the entire binder it's in, that's write amplification. The database writes more bytes to disk than the application requested.
In technical terms: Write amplification is the ratio of bytes written to disk divided by bytes written by the application. Higher write amplification reduces throughput and wears out SSDs faster.
LSM-trees:
- A value is written to the log (durability)
- Written again when memtable is flushed
- Written again every time it's part of a compaction
- Mitigated by storing values separately from keys (if values are large)
B-trees:
- Write to WAL (durability)
- Write to tree page (may need to write entire page even if only a few bytes changed)
Application writes 1 MB of data
For typical workloads, LSM-trees tend to have lower write amplification because they compress data and don't write entire pages for small updates. This makes LSM storage engines well suited for write-heavy workloads.
Impact of write amplification:
- Throughput — Higher amplification means fewer effective writes per second within available disk bandwidth
- SSD wear — SSDs have limited write cycles; lower amplification extends SSD lifespan
5.4. Disk Space Usage
B-trees:
- Can become fragmented over time (deleted keys leave empty pages)
- Free pages can be reused, but can't easily be returned to the OS (they're in the middle of the file)
- Databases need background processes to defragment (e.g., PostgreSQL's
VACUUM)
LSM-trees:
- Compaction periodically rewrites data files, eliminating fragmentation
- SSTables don't have unused space within pages
- Better compression due to larger blocks
- Deleted keys (tombstones) continue to consume space until compacted out
Snapshots: LSM-trees make snapshots easy—just record which segment files existed at a point in time (no need to copy them, since they're immutable). B-trees with in-place updates need more complex snapshot mechanisms.
6. Multi-Column and Secondary Indexes
6.1. Storing Values Within the Index
So far we've discussed primary key indexes. In relational databases, you can also create secondary indexes on other columns using CREATE INDEX.
In plain English: A primary key is like your home address—it uniquely identifies you. Secondary indexes are like phone books sorted by last name or by street name—different ways to look up the same information.
In technical terms: A secondary index maps non-unique values to rows. This can be implemented as:
- A list of matching row IDs (like a postings list)
- Unique entries by appending a row ID to each key
Where to store the actual row data:
6.2. Keeping Everything in Memory
The data structures we've discussed (LSM-trees, B-trees) are designed around disk limitations. As RAM becomes cheaper, in-memory databases store everything in memory.
In plain English: It's like the difference between storing files on your desk (RAM) versus in a filing cabinet (disk). Accessing your desk is instant; the filing cabinet requires walking over and searching through drawers.
Common misconception: In-memory databases are fast because they avoid disk reads.
Reality: Even disk-based databases use the OS page cache, so frequently accessed data is already in memory. The real advantage is avoiding the overhead of encoding data for disk storage.
In-memory database durability:
While the dataset lives in memory, durability can be achieved through:
- Battery-powered RAM
- Append-only log written to disk
- Periodic snapshots written to disk
- Replicating state to other machines
Examples:
- VoltDB, SingleStore, Oracle TimesTen — In-memory relational databases
- RAMCloud — In-memory key-value store with durability
- Redis, Couchbase — Weak durability (asynchronous disk writes)
💡 Insight
In-memory databases can also provide data models that are difficult with disk-based indexes. For example, Redis offers priority queues, sets, and other data structures with a simple implementation because everything is in memory.
7. Data Storage for Analytics
The data model of a data warehouse is typically relational, and SQL is a good fit for analytic queries. However, the internals differ drastically from OLTP databases because they're optimized for different query patterns.
7.1. Cloud Data Warehouses
Modern cloud data warehouses separate storage and compute:
Traditional data warehouses: Teradata, Vertica, SAP HANA — tightly coupled storage and compute, often on specialized hardware
Cloud data warehouses: BigQuery, Snowflake, Redshift — leverage cloud infrastructure (object storage, serverless compute)
Advantages of cloud data warehouses:
- Elasticity — Scale storage and compute independently
- Integration — Easy integration with cloud services (automatic log ingestion, data processing frameworks)
- Cost — Pay for what you use
Open source evolution:
Components that were previously integrated (e.g., Apache Hive) are now separate:
- Parse SQL
- Optimize
- Execute
- How rows are encoded as bytes
- Manage files
- Support inserts/deletes
- Time travel
- Which tables comprise a database
- Metadata management
7.2. Column-Oriented Storage
Data warehouses often use a star schema with a massive fact table (trillions of rows, 100+ columns) and smaller dimension tables. A typical query accesses only 4-5 columns but many rows.
Problem with row-oriented storage:
SELECT
dim_date.weekday, dim_product.category,
SUM(fact_sales.quantity) AS quantity_sold
FROM fact_sales
JOIN dim_date ON fact_sales.date_key = dim_date.date_key
JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk
WHERE
dim_date.year = 2024 AND
dim_product.category IN ('Fresh fruit', 'Candy')
GROUP BY
dim_date.weekday, dim_product.category;
This query needs only 3 columns (date_key, product_sk, quantity) but row-oriented storage forces you to load all 100+ columns for each row.
Solution: Column-oriented storage
In plain English: Instead of storing all information about one person together (row-oriented), store all names together, all ages together, all addresses together (column-oriented). When you need just names and ages, you read only those two lists.
In technical terms: Store all values from each column together. A query only reads and parses the columns it needs.
To read quantity, must load ALL columns for each row
Used in: Snowflake, BigQuery, DuckDB, Parquet, Apache Arrow
How it works:
- Each column is stored separately
- Columns store rows in the same order (23rd entry in each column belongs to the 23rd row)
- Queries only load the columns they need
- In practice, columns are broken into blocks (thousands/millions of rows per block)
7.3. Column Compression
Column-oriented storage enables excellent compression because:
- Values in a column are often repetitive (e.g., a retailer has billions of transactions but only 100,000 products)
- Compression reduces disk throughput and network bandwidth requirements
Bitmap encoding:
7.4. Sort Order in Column Storage
Although columns are stored separately, we can impose a sort order on rows (similar to SSTables).
In plain English: Think of organizing a filing cabinet. You might sort files by date as the primary key, then by customer name as the secondary key. All files for the same date are grouped together, and within each date, they're sorted by customer.
Example: Sort by date_key first, then product_sk:
- Queries targeting date ranges (e.g., last month) can scan only the relevant rows
- Sales for the same product on the same day are grouped together in storage
Compression benefit: Sorting improves compression. If the primary sort column has long runs of the same value, run-length encoding compresses it dramatically (even billions of rows down to kilobytes).
7.5. Writing to Column-Oriented Storage
Column-oriented storage, compression, and sorting make reads fast, but writes are more complex.
Challenge: Inserting a single row in the middle of a sorted columnar table would require rewriting all compressed columns from the insertion point onwards.
Solution: LSM-tree approach
Systems using this approach: Snowflake, Vertica, Apache Pinot, Apache Druid, DuckDB
7.6. Query Execution: Compilation and Vectorization
For analytics queries scanning millions of rows, CPU efficiency matters as much as I/O.
Two approaches have emerged:
for (int i = 0; i < num_rows; i++) {
if (product_sk[i] == 30 && store_sk[i] == 3) {
output[output_count++] = quantity[i];
}
}Both approaches achieve good performance by:
- Preferring sequential memory access (reduce cache misses)
- Tight inner loops (keep CPU pipeline busy, avoid branch mispredictions)
- Parallelism (multiple threads, SIMD instructions)
- Operating directly on compressed data (saves memory allocation and copying)
7.7. Materialized Views and Data Cubes
Materialized view: A table-like object whose contents are the results of a query, written to disk.
In plain English: Instead of computing a complex report every time someone asks for it, compute it once, save the result, and update it when the underlying data changes. It's like pre-cooking meals instead of cooking from scratch every time.
Data cube (OLAP cube): A special type of materialized view that creates a grid of aggregates grouped by different dimensions.
| Dates ↓ / Products → | Apples | Bananas | Cherries | Total |
|---|---|---|---|---|
| Jan 1 | $120 | $80 | $50 | $250 |
| Jan 2 | $100 | $90 | $60 | $250 |
| Jan 3 | $130 | $70 | $40 | $240 |
| Total | $350 | $240 | $150 | $740 |
Trade-off:
- Advantage: Certain queries become very fast (effectively pre-computed)
- Disadvantage: No flexibility—can't answer ad-hoc questions that don't align with cube dimensions
Most data warehouses keep raw data and use cubes only as a performance boost for specific queries.
8. Multidimensional and Full-Text Indexes
8.1. Multidimensional Indexes
B-trees and LSM-trees allow range queries over a single attribute. But sometimes you need to query multiple columns simultaneously.
Example: Geospatial queries
A restaurant-search website needs to find all restaurants within a rectangular map area:
SELECT * FROM restaurants
WHERE latitude > 51.4946 AND latitude < 51.5079
AND longitude > -0.1162 AND longitude < -0.1004;
A concatenated index over (latitude, longitude) can't answer this efficiently—it can give you either all restaurants in a range of latitudes (at any longitude), or all in a range of longitudes (at any latitude), but not both.
Solutions:
- Space-filling curves — Translate 2D location into a single number using a curve, then use a regular B-tree
- R-trees — Divide space so nearby points tend to be grouped in the same subtree
- Grid-based approaches — Use regularly spaced grids (triangles, squares, hexagons)
Nearby points tend to be in same subtree → efficient pruning
Other uses: Not just for geography—can be used for any multi-dimensional data:
- Colors: 3D index on (red, green, blue) to search for products in a color range
- Weather data: 2D index on (date, temperature) to find observations in a time-temperature range
8.2. Full-Text Search
Full-text search allows searching text documents by keywords appearing anywhere in the text. This is a specialized domain with language-specific processing (tokenization, stemming, typos, synonyms).
Core idea: Think of full-text search as a multidimensional query where each possible word (term) is a dimension.
Inverted index: The primary data structure for full-text search.
| Term | Postings List |
|---|---|
| red | [1, 3] |
| green | [2] |
| apples | [1, 2] |
| roses | [3] |
| delicious | [1] |
| healthy | [2] |
| beautiful | [3] |
Postings lists stored as sparse bitmaps or run-length encoded
Implementation: Lucene (used by Elasticsearch and Solr) stores term-to-postings-list mappings in SSTable-like sorted files, merged in the background using LSM-tree approach. PostgreSQL's GIN index also uses postings lists.
Advanced features:
- N-grams — Break text into substrings of length n (e.g., trigrams of "hello": "hel", "ell", "llo"). Enables substring search and regular expressions.
- Fuzzy search — Levenshtein automaton allows searching for words within a given edit distance (typos).
8.3. Vector Embeddings
Semantic search goes beyond keyword matching to understand document concepts and user intent.
In plain English: Traditional search looks for exact words. If your help page says "cancelling your subscription" but the user searches for "close my account," traditional search fails. Semantic search understands they mean the same thing.
In technical terms: Embedding models translate documents into vector embeddings—lists of floating-point numbers representing a point in multi-dimensional space. Semantically similar documents have vectors that are close together.
- Store all vectors as-is
- Query: Compare with every vector
- Accurate but slow (O(n))
- Cluster vectors into partitions (centroids)
- Query: Check only nearest partitions
- Fast but approximate (may miss vectors in other partitions)
- Multiple layers of graphs
- Top layer: sparse, few nodes
- Bottom layer: dense, all vectors
- Query: Start at top, navigate down following nearest neighbors
- Fast and accurate (approximate)
Used in: pgvector, Faiss, Pinecone, Weaviate, Milvus
How semantic search works:
- Indexing: Generate vector embeddings for all documents using an embedding model (Word2Vec, BERT, GPT, multimodal models)
- Querying: User enters query → embedding model generates query vector
- Retrieval: Vector index returns documents with closest vectors (using cosine similarity or Euclidean distance)
Vector indexes:
- Flat indexes — Accurate but slow (compare with every vector)
- IVF indexes — Partition space into clusters; only check nearest partitions (approximate)
- HNSW indexes — Multi-layer graphs; navigate from coarse to fine (approximate)
💡 Insight
R-trees work poorly for high-dimensional vectors (100-1000 dimensions) due to the "curse of dimensionality." Specialized indexes like IVF and HNSW are designed for high-dimensional spaces and accept approximate results for speed.
9. Summary
In this chapter we tried to get to the bottom of how databases perform storage and retrieval. What happens when you store data in a database, and what does the database do when you query for the data again later?
We saw that storage engines optimized for OLTP look very different from those optimized for analytics (OLAP):
Key takeaways:
-
OLTP systems are optimized for high-volume small reads/writes accessed via primary or secondary indexes (ordered mappings from key to record).
-
Data warehouses and analytics systems are optimized for complex queries scanning millions of records using column-oriented storage with compression, minimizing disk I/O and CPU time.
-
Log-structured storage (LSM-trees) appends to files and never updates in place. Provides high write throughput. Examples: RocksDB, Cassandra, HBase.
-
Update-in-place storage (B-trees) treats disk as fixed-size pages that can be overwritten. Faster for reads with predictable performance. Used in all major relational databases.
-
The fundamental trade-off: Structures that make writes fast (append-only logs) make reads slow. Structures that make reads fast (sorted indexes) make writes slow. Every storage engine is a carefully chosen compromise.
-
Indexes speed up reads but slow down writes. You must choose indexes based on your application's query patterns, balancing read performance against write overhead.
-
Column-oriented storage excels at analytics by reading only the columns needed for a query. Combined with compression (bitmap encoding), sorting, and vectorization, it enables scanning billions of rows in seconds.
-
Specialized indexes exist for specific use cases: R-trees for geospatial queries, inverted indexes for full-text search, and vector indexes (IVF, HNSW) for semantic search.
💡 Insight
As an application developer, understanding storage engine internals helps you choose the right database for your workload, tune configuration parameters effectively, and diagnose performance issues. While you won't become an expert in any single engine from this chapter, you now have the vocabulary and mental models to make sense of database documentation and make informed decisions.
In Chapter 5, we'll turn to encoding and evolution—how data formats change over time while maintaining backward and forward compatibility.