Chapter 2: Data Ingestion Patterns
Feeding the machine learning beast - efficiently and at scale
Table of Contents
- Introduction
- What is Data Ingestion?
- The Fashion-MNIST Dataset
- Batching Pattern
- Sharding Pattern
- Caching Pattern
- Answers to Exercises
- 7.1. Section 4.4
- 7.2. Section 5.4
- 7.3. Section 6.4
- Summary
1. Introduction
Restaurant Analogy: Imagine you're running a busy restaurant kitchen. Raw ingredients (data) come in from suppliers, and you need to prep, organize, and serve them to hungry customers (your ML model). But what happens when:
- You get a massive food truck delivery that won't fit in your prep area?
- Orders are coming in faster than you can prep ingredients sequentially?
- You're making the same popular dish over and over?
That's exactly the challenge of data ingestion in machine learning systems.
Data ingestion is the critical first step that determines whether your ML pipeline will feast or famine. Get it wrong, and your expensive GPUs sit idle waiting for data. Get it right, and you unlock the full potential of distributed machine learning.
This chapter explores three fundamental patterns that solve the most common data ingestion challenges: batching, sharding, and caching. These aren't just theoretical concepts - they're battle-tested solutions used by every major ML platform.
2. What is Data Ingestion?
Definition: Data ingestion is the process that monitors data sources, consumes data (either all at once or streaming), and performs preprocessing to prepare for machine learning model training.
Think of it as the supply chain for your ML factory:
Sources
Ingestion
Data
Training
Two Main Approaches:
Insight
Data ingestion is often the hidden bottleneck in ML systems. A 10x improvement in model architecture means nothing if your data pipeline can only feed it at 1/10th the speed.
Key Responsibilities:
- Monitor data sources for new information
- Extract data in appropriate formats
- Transform raw data into ML-ready formats
- Load data efficiently for training processes
- Handle failures and inconsistencies gracefully
3. The Fashion-MNIST Dataset
Why Fashion-MNIST over classic MNIST?
The classic MNIST dataset (handwritten digits) has become too easy - most simple models achieve 95%+ accuracy. It's like using a children's puzzle to test engineering skills. Fashion-MNIST provides the same structure but with more realistic complexity.
Dataset Characteristics:
- Training: 60,000 examples
- Testing: 10,000 examples
- Image Size: 28×28 grayscale
- Classes: 10 fashion categories
- File Size: ~30 MB compressed
The 10 Fashion Classes:
- ~6,000 examples
- ~6,000 examples
- ~6,000 examples
- ~6,000 examples
- ~6,000 examples
- ~6,000 examples
- ~6,000 examples
- ~6,000 examples
- ~6,000 examples
- ~6,000 examples
Loading Fashion-MNIST with TensorFlow:
# Approach 1: Direct download and load
import tensorflow as tf
train, test = tf.keras.datasets.fashion_mnist.load_data()
# Approach 2: Load from NumPy arrays into tf.data.Dataset
from tensorflow.data import Dataset
images, labels = train
images = images / 255.0 # Normalize pixel values
dataset = Dataset.from_tensor_slices((images, labels))
print(dataset)
# <TensorSliceDataset shapes: ((28, 28), ()), types: (tf.float64, tf.uint8)>
The Memory Problem: Both approaches can hit TensorFlow's 2GB limit for tf.GraphDef protocol buffer when using tf.constant() operations, causing the NumPy array contents to be copied multiple times. Note: This limit occurs during graph serialization (saving models, @tf.function, distribution) - not during basic eager execution.
The Scale Challenge:
- Fashion-MNIST: 30 MB (manageable)
- Real-world datasets: 30 GB - 30 TB (problematic)
- Production systems: Continuous growth (challenging)
This small dataset serves as our learning foundation, but the patterns we'll explore scale to datasets thousands of times larger.
4. Batching Pattern
4.1. The Problem: Memory Constraints with Large Datasets
The Real Challenge: GPU/RAM Limits
Modern ML training hits hardware memory walls. Your GPU has finite VRAM (often 16GB), but datasets can be massive:
- Fashion-MNIST: 30 MB (fits easily)
- ImageNet: ~150 GB (won't fit in GPU memory)
- Real datasets: Often 100GB-10TB
The Memory Math:
Processing Overhead:
Beyond storage, operations multiply memory usage:
- Forward pass: Activations for each layer
- Backward pass: Gradients for each parameter
- Optimizer states: Adam requires 2x parameter memory
- Intermediate computations: Attention matrices, etc.
Note on GraphDef: TensorFlow's tf.constant() has a 2GB limit for graph serialization, but this is separate from the batching pattern. Proper data loading uses tf.data pipelines that stream data without embedding it in the graph.
Additional Processing Costs:
Beyond memory, image processing operations are expensive:
- Resizing: CPU/GPU intensive
- Normalization: Mathematical operations on every pixel
- Augmentation: Rotations, flips, color changes
- Convolutions: Complex mathematical transformations
4.2. The Solution: Divide and Consume
Restaurant Kitchen Analogy:
Instead of trying to prep all ingredients at once and overwhelming your kitchen, smart chefs prep mise en place (everything in place) in small batches throughout service.
The Batching Solution:
Instead of loading the entire dataset into GPU memory, we divide it into mini-batches that fit within hardware constraints:
Large Dataset: 1M examples (50GB total)
GPU Memory: 16GB available
32 imgs
(~2GB)
32 imgs
(~2GB)
32 imgs
(~2GB)
Implementation Pattern:
# Pseudocode for batching
batch = read_next_batch(dataset)
while batch is not None:
model.train(batch)
batch = read_next_batch(dataset)
TensorFlow Implementation:
# Efficient batching with TensorFlow I/O
import tensorflow_io as tfio
# Load dataset from URL without downloading everything
d_train = tfio.IODataset.from_mnist(
'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz'
)
# Or from a database (PostgreSQL example)
import os
endpoint = "postgresql://{}:{}@{}?port={}&dbname={}".format(
os.environ['TFIO_DEMO_DATABASE_USER'],
os.environ['TFIO_DEMO_DATABASE_PASS'],
os.environ['TFIO_DEMO_DATABASE_HOST'],
os.environ['TFIO_DEMO_DATABASE_PORT'],
os.environ['TFIO_DEMO_DATABASE_NAME'],
)
dataset = tfio.experimental.IODataset.from_sql(
query="SELECT co, pt08s1 FROM AirQualityUCI;",
endpoint=endpoint
)
4.3. Discussion: Trade-offs and Considerations
When Batching Works Well:
- ✅ Operations can be performed on data subsets
- ✅ Model training is iterative
- ✅ Memory is limited
- ✅ Framework has batch support
When Batching Doesn't Work:
- ❌ Algorithm needs global statistics (mean of entire dataset)
- ❌ Model requires minimum examples per class per batch
- ❌ Batch size becomes too critical for model performance
The Batch Size Balancing Act:
Sweet Spot: Often 64-512 for most applications
Insight
Modern frameworks like AdaptDL automatically tune batch sizes by measuring system performance and gradient noise, enabling efficient distributed training without requiring manual batch size tuning effort.
The original research compares the effects of automatically and manually tuned batch sizes on overall training time, demonstrating the efficiency benefits of automatic approaches.
4.4. Exercises
-
Are we training the model using batches in parallel or sequentially?
-
If the ML framework cannot handle large datasets, can we use the batching pattern?
-
If a model requires knowing the mean of a feature across the entire dataset, can we still use batching?
5. Sharding Pattern
5.1. The Problem: Sequential Processing Bottleneck
Scaling Beyond Single Machine Limits:
The batching pattern works great until you hit the sequential processing wall. Let's scale our Fashion-MNIST example:
Scale-Up Scenario:
- Original Fashion-MNIST: 30 MB, 60,000 examples
- Scaled dataset: 30 GB, 60,000,000 examples (1000x larger)
- Batch size: 5 GB each (6 total batches)
- Processing time per batch: 1 hour
Sequential Batch Processing: 6 hours total
1h
1h
1h
1h
1h
1h
The Utilization Problem:
[BUSY]
[IDLE]
[IDLE]
Resource Utilization: 33% (Wasteful!)
The Real Cost:
- Time: Sequential processing instead of parallel
- Money: Paying for multiple machines but using only one
- Opportunity: Delayed model deployment, inefficient resource usage
5.2. The Solution: Horizontal Data Distribution
Encyclopedia Analogy:
Imagine three friends need to read a 30-volume encyclopedia for an exam:
Sharding Architecture:
Hypothetical Dataset: 60 million examples
20M examples
20M examples
20M examples
[Training]
[Training]
[Training]
Processing improves with parallel distribution
Horizontal vs Vertical Partitioning:
ID | Name | Age | Job | Salary
1 | Ann | 25 | Dev | 50k
2 | Bob | 30 | Mgr | 60k
3 | Cal | 35 | Dev | 55k
4 | Dan | 40 | Dir | 80k
Shard A: Rows 1-2 (Ann, Bob)
Shard B: Rows 3-4 (Cal, Dan)
Part A: ID, Name, Age
Part B: ID, Job, Salary
Sharding Implementation:
# Pseudocode for sharding
if get_worker_rank() == 0: # Coordinator worker
create_and_send_shards(dataset)
shard = read_next_shard_locally()
while shard is not None:
model.train(shard)
shard = read_next_shard_locally()
5.3. Discussion: Balancing and Automation
The Load Balancing Challenge:
Manual sharding can lead to imbalanced shards:
Hash Sharding Solution:
Instead of manual splitting, use hash-based distribution:
Dynamic Resharding:
For growing datasets, automatic resharding maintains balance:
Insight
Production ML systems use automatic sharding with consistent hashing. This ensures even distribution and allows adding/removing workers without reshuffling all data.
Production Considerations:
- Backup strategy: Must backup multiple shards
- Schema coordination: All shards need consistency
- Failure handling: One worker doesn't stop others
- Monitoring: Track performance across all shards
- Rebalancing frequency: How often to reshard?
- Data freshness: New data distribution strategy
- Migration costs: Moving data between workers
5.4. Exercises
-
Does the sharding pattern use horizontal or vertical partitioning?
-
Where does the model read each shard from?
-
What's an alternative to manual sharding?
6. Caching Pattern
6.1. The Problem: Redundant Multi-Epoch Processing
The Multi-Epoch Necessity:
Modern ML algorithms, especially deep learning, require multiple epochs for convergence:
1 epoch = 1 complete pass through entire training dataset
Typical requirements: 10-100 epochs for good performance
Why Multiple Epochs Are Needed:
Gradient Descent Learning Analogy: Think of learning to ride a bike. You don't master it in one attempt - you need many practice sessions (epochs) to:
- Adjust balance (model parameters)
- Build muscle memory (stable convergence)
- Handle different terrains (various data patterns)
The Time Multiplication Problem:
Total: 15 hours for 5 epochs
Problem: Repeating data loading and processing
The Redundancy Waste:
Waste Ratio: 95% redundant work, 5% actual learning
6.2. The Solution: Smart Data Reuse
Browser Cache Analogy:
Your web browser doesn't re-download images every time you visit a website. It caches them locally for instant reuse. We can apply the same principle to ML data:
Caching Architecture:
Data
(90 min)
Data
(60 min)
& Train
(90 min)
From Cache
(5 min)
Model
(90 min)
Time Savings Per Epoch: 145 minutes!
Performance Comparison:
Basic Caching Implementation:
# Pseudocode for caching
batch = read_next_batch(dataset)
cache = initialize_cache(batch)
while batch is not None:
model.train(batch)
cache.append(batch)
batch = read_next_batch(dataset)
# For remaining epochs, use cache
while current_epoch() <= total_epochs:
batch = cache.read_next_batch()
model.train(batch)
Caching with Preprocessing:
# Cache preprocessed data to avoid reprocessing
batch = read_next_batch(dataset)
cache = initialize_cache(preprocess(batch))
while batch is not None:
batch = preprocess(batch) # Expensive operations
model.train(batch)
cache.append(batch) # Cache processed version
batch = read_next_batch(dataset)
# Later epochs skip preprocessing
while current_epoch() <= total_epochs:
processed_batch = cache.read_next_batch() # Already processed!
model.train(processed_batch)
6.3. Discussion: Memory vs Disk Trade-offs
The Storage Decision:
Hybrid Approach: Hot data in memory, cold on disk - Best of both worlds
Memory vs Disk Performance:
Reading from or writing to memory is about 6 times faster for sequential access and about 100,000 times faster for random access compared to disk. RAM access is measured in nanoseconds, while hard drive access is measured in milliseconds.
When to Use Each Strategy:
- ✅ Dataset fits in available RAM
- ✅ Frequent random access patterns
- ✅ System stability is high
- ✅ Cost of data reload is very high
- ✅ Short training sessions
- Use when: Speed > Reliability
- ✅ Dataset larger than available RAM
- ✅ System crashes are possible
- ✅ Long training sessions (days/weeks)
- ✅ Cost is a major concern
- ✅ Multiple experiments on same data
- Use when: Reliability > Speed
Cache Invalidation Strategies:
- Refresh cache every 24 hours
- Refresh when new data arrives
- Track data version, invalidate on change
- Explicit cache clear commands
Best Practice: Combine multiple strategies
Insight
Production systems often use tiered caching: frequently accessed data in memory, recent data on local SSD, older data on network storage. This balances speed, cost, and reliability.
As noted in the original research, if training one epoch on the entire training dataset takes 3 hours, training for multiple epochs would require proportionally more time without caching, making this approach inefficient for real-world systems that often require many epochs.
6.4. Exercises
-
Is caching useful for training on the same dataset or different datasets across epochs?
-
What should we store in the cache if the dataset needs preprocessing?
-
Is an on-disk cache faster to access than an in-memory cache?
7. Answers to Exercises
7.1. Section 4.4
-
Sequentially - Batches are processed one after another, not in parallel.
-
Yes - That's one of the primary use cases for batching.
-
No - If you need global statistics (like the mean across the entire dataset), you can't compute it from individual batches alone.
7.2. Section 5.4
-
Horizontal partitioning - We split data by rows (examples), not by columns (features).
-
Locally on each worker machine - Each worker reads from its local data shard.
-
Automatic sharding, such as hash sharding - Uses algorithms to distribute data evenly without manual intervention.
7.3. Section 6.4
-
Same dataset - Caching is beneficial when reusing the same data across multiple epochs.
-
The preprocessed batches - Store the processed versions to avoid recomputing expensive transformations.
-
No - In-memory cache is much faster to access than disk cache (nanoseconds vs milliseconds).
8. Summary
What We Learned:
- The critical first step that determines ML pipeline success
- Handle memory constraints by processing data in manageable chunks
- Distribute large datasets across multiple machines for parallel processing
- Eliminate redundant work in multi-epoch training through smart data reuse
Pattern Selection Guide:
Key Benefits:
Batching:
Train on datasets
larger than memory
Sharding:
Parallel processing
across machines
Caching:
Reduce redundant
data processing
Insight
Real ML systems combine all three patterns: shard data across workers, batch within each worker, and cache preprocessed results. This creates a robust, scalable data ingestion pipeline.
Next Steps:
In Chapter 3, we'll explore how multiple workers coordinate model training using distributed training patterns. You'll learn how the individual data shards we created actually contribute to a single, coherent model.
Ready to distribute the training itself? → Chapter 3: Distributed Training Patterns
Remember: Data ingestion patterns are the foundation of scalable ML. Master these, and you can handle datasets of any size efficiently.
Previous: Chapter 1: Introduction to Distributed ML | Next: Chapter 3: Distributed Training Patterns