Skip to main content

Chapter 2: Data Ingestion Patterns

Feeding the machine learning beast - efficiently and at scale


Table of Contents

  1. Introduction
  2. What is Data Ingestion?
  3. The Fashion-MNIST Dataset
  4. Batching Pattern
  5. Sharding Pattern
  6. Caching Pattern
  7. Answers to Exercises
  8. Summary

1. Introduction

Restaurant Analogy: Imagine you're running a busy restaurant kitchen. Raw ingredients (data) come in from suppliers, and you need to prep, organize, and serve them to hungry customers (your ML model). But what happens when:

  • You get a massive food truck delivery that won't fit in your prep area?
  • Orders are coming in faster than you can prep ingredients sequentially?
  • You're making the same popular dish over and over?

That's exactly the challenge of data ingestion in machine learning systems.

Data ingestion is the critical first step that determines whether your ML pipeline will feast or famine. Get it wrong, and your expensive GPUs sit idle waiting for data. Get it right, and you unlock the full potential of distributed machine learning.

This chapter explores three fundamental patterns that solve the most common data ingestion challenges: batching, sharding, and caching. These aren't just theoretical concepts - they're battle-tested solutions used by every major ML platform.


2. What is Data Ingestion?

Definition: Data ingestion is the process that monitors data sources, consumes data (either all at once or streaming), and performs preprocessing to prepare for machine learning model training.

Think of it as the supply chain for your ML factory:

Data Ingestion Pipeline
Raw Data
Sources
Data
Ingestion
Processed
Data
Model
Training

Two Main Approaches:

Streaming Data Ingestion
Non-streaming Data Ingestion
Dataset Size
Grows over time
Fixed size
Infrastructure
Long-running processes
Batch jobs on demand
Use Case
Real-time feeds, sensor data
Historical analysis, research
Example
Twitter stream, IoT sensors
Static datasets, archives

Insight

Data ingestion is often the hidden bottleneck in ML systems. A 10x improvement in model architecture means nothing if your data pipeline can only feed it at 1/10th the speed.

Key Responsibilities:

  • Monitor data sources for new information
  • Extract data in appropriate formats
  • Transform raw data into ML-ready formats
  • Load data efficiently for training processes
  • Handle failures and inconsistencies gracefully

3. The Fashion-MNIST Dataset

Why Fashion-MNIST over classic MNIST?

The classic MNIST dataset (handwritten digits) has become too easy - most simple models achieve 95%+ accuracy. It's like using a children's puzzle to test engineering skills. Fashion-MNIST provides the same structure but with more realistic complexity.

Dataset Characteristics:

  • Training: 60,000 examples
  • Testing: 10,000 examples
  • Image Size: 28×28 grayscale
  • Classes: 10 fashion categories
  • File Size: ~30 MB compressed

The 10 Fashion Classes:

0: T-shirt/top
  • ~6,000 examples
1: Trouser
  • ~6,000 examples
2: Pullover
  • ~6,000 examples
3: Dress
  • ~6,000 examples
4: Coat
  • ~6,000 examples
5: Sandal
  • ~6,000 examples
6: Shirt
  • ~6,000 examples
7: Sneaker
  • ~6,000 examples
8: Bag
  • ~6,000 examples
9: Ankle boot
  • ~6,000 examples

Loading Fashion-MNIST with TensorFlow:

# Approach 1: Direct download and load
import tensorflow as tf
train, test = tf.keras.datasets.fashion_mnist.load_data()

# Approach 2: Load from NumPy arrays into tf.data.Dataset
from tensorflow.data import Dataset
images, labels = train
images = images / 255.0 # Normalize pixel values
dataset = Dataset.from_tensor_slices((images, labels))
print(dataset)
# <TensorSliceDataset shapes: ((28, 28), ()), types: (tf.float64, tf.uint8)>

The Memory Problem: Both approaches can hit TensorFlow's 2GB limit for tf.GraphDef protocol buffer when using tf.constant() operations, causing the NumPy array contents to be copied multiple times. Note: This limit occurs during graph serialization (saving models, @tf.function, distribution) - not during basic eager execution.

The Scale Challenge:

  • Fashion-MNIST: 30 MB (manageable)
  • Real-world datasets: 30 GB - 30 TB (problematic)
  • Production systems: Continuous growth (challenging)

This small dataset serves as our learning foundation, but the patterns we'll explore scale to datasets thousands of times larger.


4. Batching Pattern

4.1. The Problem: Memory Constraints with Large Datasets

The Real Challenge: GPU/RAM Limits

Modern ML training hits hardware memory walls. Your GPU has finite VRAM (often 16GB), but datasets can be massive:

  • Fashion-MNIST: 30 MB (fits easily)
  • ImageNet: ~150 GB (won't fit in GPU memory)
  • Real datasets: Often 100GB-10TB

The Memory Math:

Memory Constraint Problem
Your Hardware: 16GB GPU memory
Large Dataset: 150GB
Problem: 150GB > 16GB (impossible to load at once)

Processing Overhead:

Beyond storage, operations multiply memory usage:

  • Forward pass: Activations for each layer
  • Backward pass: Gradients for each parameter
  • Optimizer states: Adam requires 2x parameter memory
  • Intermediate computations: Attention matrices, etc.

Note on GraphDef: TensorFlow's tf.constant() has a 2GB limit for graph serialization, but this is separate from the batching pattern. Proper data loading uses tf.data pipelines that stream data without embedding it in the graph.

Additional Processing Costs:

Beyond memory, image processing operations are expensive:

  • Resizing: CPU/GPU intensive
  • Normalization: Mathematical operations on every pixel
  • Augmentation: Rotations, flips, color changes
  • Convolutions: Complex mathematical transformations

4.2. The Solution: Divide and Consume

Restaurant Kitchen Analogy:

Instead of trying to prep all ingredients at once and overwhelming your kitchen, smart chefs prep mise en place (everything in place) in small batches throughout service.

Bad Kitchen
Smart Kitchen
Strategy
Prep ALL food at once
Prep Course 1, Serve Course 1
Process
Overwhelmed!
Prep Course 2, Serve Course 2
Result
Kitchen fails
Continuous Service

The Batching Solution:

Instead of loading the entire dataset into GPU memory, we divide it into mini-batches that fit within hardware constraints:

Batching Strategy

Large Dataset: 1M examples (50GB total)
GPU Memory: 16GB available

Divide into Batches
Batch 1
32 imgs
(~2GB)
Batch 2
32 imgs
(~2GB)
Batch 3
32 imgs
(~2GB)
...
Process sequentially
Benefits
✅ Fits within GPU memory limits (16GB)
✅ Enables training on datasets larger than hardware
✅ Constant memory usage (only one batch at a time)
✅ Allows gradient updates with realistic batch sizes

Implementation Pattern:

# Pseudocode for batching
batch = read_next_batch(dataset)
while batch is not None:
model.train(batch)
batch = read_next_batch(dataset)

TensorFlow Implementation:

# Efficient batching with TensorFlow I/O
import tensorflow_io as tfio

# Load dataset from URL without downloading everything
d_train = tfio.IODataset.from_mnist(
'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz'
)

# Or from a database (PostgreSQL example)
import os
endpoint = "postgresql://{}:{}@{}?port={}&dbname={}".format(
os.environ['TFIO_DEMO_DATABASE_USER'],
os.environ['TFIO_DEMO_DATABASE_PASS'],
os.environ['TFIO_DEMO_DATABASE_HOST'],
os.environ['TFIO_DEMO_DATABASE_PORT'],
os.environ['TFIO_DEMO_DATABASE_NAME'],
)

dataset = tfio.experimental.IODataset.from_sql(
query="SELECT co, pt08s1 FROM AirQualityUCI;",
endpoint=endpoint
)

4.3. Discussion: Trade-offs and Considerations

When Batching Works Well:

  • ✅ Operations can be performed on data subsets
  • ✅ Model training is iterative
  • ✅ Memory is limited
  • ✅ Framework has batch support

When Batching Doesn't Work:

  • ❌ Algorithm needs global statistics (mean of entire dataset)
  • ❌ Model requires minimum examples per class per batch
  • ❌ Batch size becomes too critical for model performance

The Batch Size Balancing Act:

Small Batches (32-128)
Large Batches (1024-8192)
Memory
✅ Low usage
❌ High requirements
Updates
✅ More frequent
❌ Slower convergence
Gradients
❌ Noisy estimates
✅ Stable estimates
GPU Usage
❌ Underutilization
✅ Better utilization

Sweet Spot: Often 64-512 for most applications

Insight

Modern frameworks like AdaptDL automatically tune batch sizes by measuring system performance and gradient noise, enabling efficient distributed training without requiring manual batch size tuning effort.

The original research compares the effects of automatically and manually tuned batch sizes on overall training time, demonstrating the efficiency benefits of automatic approaches.

4.4. Exercises

  1. Are we training the model using batches in parallel or sequentially?

  2. If the ML framework cannot handle large datasets, can we use the batching pattern?

  3. If a model requires knowing the mean of a feature across the entire dataset, can we still use batching?


5. Sharding Pattern

5.1. The Problem: Sequential Processing Bottleneck

Scaling Beyond Single Machine Limits:

The batching pattern works great until you hit the sequential processing wall. Let's scale our Fashion-MNIST example:

Scale-Up Scenario:

  • Original Fashion-MNIST: 30 MB, 60,000 examples
  • Scaled dataset: 30 GB, 60,000,000 examples (1000x larger)
  • Batch size: 5 GB each (6 total batches)
  • Processing time per batch: 1 hour
Sequential Processing Problem

Sequential Batch Processing: 6 hours total

B1
1h
B2
1h
B3
1h
B4
1h
B5
1h
B6
1h
Problem: Only using 1 machine at a time!

The Utilization Problem:

Resource Waste
Machine A
[BUSY]
Machine B
[IDLE]
Machine C
[IDLE]

Resource Utilization: 33% (Wasteful!)

The Real Cost:

  • Time: Sequential processing instead of parallel
  • Money: Paying for multiple machines but using only one
  • Opportunity: Delayed model deployment, inefficient resource usage

5.2. The Solution: Horizontal Data Distribution

Encyclopedia Analogy:

Imagine three friends need to read a 30-volume encyclopedia for an exam:

Bad Strategy (Sequential)
Smart Strategy (Sharding)
Friend A
Reads Vol 1-30
Reads Vol 1-10
Friend B
Reads Vol 1-30
Reads Vol 11-20
Friend C
Reads Vol 1-30
Reads Vol 21-30
Result
Sequential, inefficient
Parallel, optimal

Sharding Architecture:

Sharding Strategy

Hypothetical Dataset: 60 million examples

Split Dataset
Shard 1
20M examples
Shard 2
20M examples
Shard 3
20M examples
Distribute to Workers
Worker A
[Training]
Worker B
[Training]
Worker C
[Training]

Processing improves with parallel distribution

Horizontal vs Vertical Partitioning:

Database Partitioning Comparison
Original Table

ID | Name | Age | Job | Salary
1 | Ann | 25 | Dev | 50k
2 | Bob | 30 | Mgr | 60k
3 | Cal | 35 | Dev | 55k
4 | Dan | 40 | Dir | 80k

Horizontal Partitioning (Sharding)

Shard A: Rows 1-2 (Ann, Bob)

Shard B: Rows 3-4 (Cal, Dan)

Vertical Partitioning

Part A: ID, Name, Age

Part B: ID, Job, Salary

Sharding Implementation:

# Pseudocode for sharding
if get_worker_rank() == 0: # Coordinator worker
create_and_send_shards(dataset)

shard = read_next_shard_locally()
while shard is not None:
model.train(shard)
shard = read_next_shard_locally()

5.3. Discussion: Balancing and Automation

The Load Balancing Challenge:

Manual sharding can lead to imbalanced shards:

Bad Sharding (Manual)
Good Sharding (Balanced)
Worker A
Large shard - Long time
Equal shard - Same time
Worker B
Small shard - Short time
Equal shard - Same time
Worker C
Small shard - Short time
Equal shard - Same time
Result
B & C wait! No improvement
All finish together - Optimal

Hash Sharding Solution:

Instead of manual splitting, use hash-based distribution:

Hash Sharding
1
Data Record
ID: 12345, image: [...], label: 'T-shirt'
2
Hash Function
hash(12345) = 8947362
3
Shard Assignment
8947362 % 3 = 2
Result
Record goes to Worker C (Shard 2)
Benefits
✅ Automatic load balancing
✅ Even distribution
✅ Scales with worker count
✅ No manual intervention needed

Dynamic Resharding:

For growing datasets, automatic resharding maintains balance:

Resharding Process
Initial State
10M
10M
2 workers
Growth
After Growth
15M
15M
2 workers
Reshard
After Resharding
10M
10M
10M
3 workers

Insight

Production ML systems use automatic sharding with consistent hashing. This ensures even distribution and allows adding/removing workers without reshuffling all data.

Production Considerations:

Operational Complexity
  • Backup strategy: Must backup multiple shards
  • Schema coordination: All shards need consistency
  • Failure handling: One worker doesn't stop others
  • Monitoring: Track performance across all shards
Growing Dataset Challenges
  • Rebalancing frequency: How often to reshard?
  • Data freshness: New data distribution strategy
  • Migration costs: Moving data between workers

5.4. Exercises

  1. Does the sharding pattern use horizontal or vertical partitioning?

  2. Where does the model read each shard from?

  3. What's an alternative to manual sharding?


6. Caching Pattern

6.1. The Problem: Redundant Multi-Epoch Processing

The Multi-Epoch Necessity:

Modern ML algorithms, especially deep learning, require multiple epochs for convergence:

Multi-Epoch Training

1 epoch = 1 complete pass through entire training dataset

Fashion-MNIST Training Example
Epoch 1: Process all 60,000 examples → Initial learning
Epoch 2: Process all 60,000 examples → Improved performance
Epoch 3: Process all 60,000 examples → Better convergence
Epoch 4: Process all 60,000 examples → Further refinement
Epoch 5: Process all 60,000 examples → Stable performance

Typical requirements: 10-100 epochs for good performance

Why Multiple Epochs Are Needed:

Gradient Descent Learning Analogy: Think of learning to ride a bike. You don't master it in one attempt - you need many practice sessions (epochs) to:

  • Adjust balance (model parameters)
  • Build muscle memory (stable convergence)
  • Handle different terrains (various data patterns)

The Time Multiplication Problem:

Training Time Without Caching
Single Epoch: 3 hours (Load → Process → Train)
Traditional Multi-Epoch Training
Epoch 1: Load→Process→Train (3 hours)
Epoch 2: Load→Process→Train (3 hours)
Epoch 3: Load→Process→Train (3 hours)
Epoch 4: Load→Process→Train (3 hours)
Epoch 5: Load→Process→Train (3 hours)

Total: 15 hours for 5 epochs
Problem: Repeating data loading and processing

The Redundancy Waste:

Redundant Operations Per Epoch
What's Being Repeated?
❌ Download dataset from source (again)
❌ Parse file formats (again)
❌ Apply preprocessing transforms (again)
❌ Normalize pixel values (again)
❌ Reshape tensors (again)
✅ Train model (only this changes!)

Waste Ratio: 95% redundant work, 5% actual learning

6.2. The Solution: Smart Data Reuse

Browser Cache Analogy:

Your web browser doesn't re-download images every time you visit a website. It caches them locally for instant reuse. We can apply the same principle to ML data:

Web Browsing
ML Training
First Visit/Epoch
Download image, Store in cache
Load & process, Store in cache
Later Visits/Epochs
Load from cache (Instant!)
Load from cache (Much faster!)

Caching Architecture:

Caching Strategy
Epoch 1 (Cache Building)
Load
Data
(90 min)
Process
Data
(60 min)
Cache
& Train
(90 min)
Total: 240 minutes
Epochs 2-N (Cache Usage)
Load
From Cache
(5 min)
Train
Model
(90 min)
Total: 95 minutes

Time Savings Per Epoch: 145 minutes!

Performance Comparison:

Without Caching
With Caching
Epoch 1
185 min
185 min (Build cache)
Epoch 2
185 min
95 min (Use cache)
Epoch 3
185 min
95 min (Use cache)
Epoch 4
185 min
95 min (Use cache)
Epoch 5
185 min
95 min (Use cache)
Total
925 minutes
565 minutes (39% savings!)

Basic Caching Implementation:

# Pseudocode for caching
batch = read_next_batch(dataset)
cache = initialize_cache(batch)
while batch is not None:
model.train(batch)
cache.append(batch)
batch = read_next_batch(dataset)

# For remaining epochs, use cache
while current_epoch() <= total_epochs:
batch = cache.read_next_batch()
model.train(batch)

Caching with Preprocessing:

# Cache preprocessed data to avoid reprocessing
batch = read_next_batch(dataset)
cache = initialize_cache(preprocess(batch))
while batch is not None:
batch = preprocess(batch) # Expensive operations
model.train(batch)
cache.append(batch) # Cache processed version
batch = read_next_batch(dataset)

# Later epochs skip preprocessing
while current_epoch() <= total_epochs:
processed_batch = cache.read_next_batch() # Already processed!
model.train(processed_batch)

6.3. Discussion: Memory vs Disk Trade-offs

The Storage Decision:

In-Memory Cache
Disk Cache
Speed
✅ Ultra-fast (nanoseconds)
❌ Slower (milliseconds)
Reliability
❌ Lost on crash
✅ Persistent across failures
Capacity
❌ Limited by RAM
✅ Handle datasets larger than RAM
Cost
❌ Expensive for large datasets
✅ Cheap storage
I/O
✅ No bottlenecks
❌ Can become bottleneck

Hybrid Approach: Hot data in memory, cold on disk - Best of both worlds

Memory vs Disk Performance:

Reading from or writing to memory is about 6 times faster for sequential access and about 100,000 times faster for random access compared to disk. RAM access is measured in nanoseconds, while hard drive access is measured in milliseconds.

When to Use Each Strategy:

In-Memory Caching
  • ✅ Dataset fits in available RAM
  • ✅ Frequent random access patterns
  • ✅ System stability is high
  • ✅ Cost of data reload is very high
  • ✅ Short training sessions
  • Use when: Speed > Reliability
Disk Caching
  • ✅ Dataset larger than available RAM
  • ✅ System crashes are possible
  • ✅ Long training sessions (days/weeks)
  • ✅ Cost is a major concern
  • ✅ Multiple experiments on same data
  • Use when: Reliability > Speed

Cache Invalidation Strategies:

Time-Based Expiration
  • Refresh cache every 24 hours
Event-Based Invalidation
  • Refresh when new data arrives
Version-Based Tracking
  • Track data version, invalidate on change
Manual Control
  • Explicit cache clear commands

Best Practice: Combine multiple strategies

Insight

Production systems often use tiered caching: frequently accessed data in memory, recent data on local SSD, older data on network storage. This balances speed, cost, and reliability.

As noted in the original research, if training one epoch on the entire training dataset takes 3 hours, training for multiple epochs would require proportionally more time without caching, making this approach inefficient for real-world systems that often require many epochs.

6.4. Exercises

  1. Is caching useful for training on the same dataset or different datasets across epochs?

  2. What should we store in the cache if the dataset needs preprocessing?

  3. Is an on-disk cache faster to access than an in-memory cache?


7. Answers to Exercises

7.1. Section 4.4

  1. Sequentially - Batches are processed one after another, not in parallel.

  2. Yes - That's one of the primary use cases for batching.

  3. No - If you need global statistics (like the mean across the entire dataset), you can't compute it from individual batches alone.

7.2. Section 5.4

  1. Horizontal partitioning - We split data by rows (examples), not by columns (features).

  2. Locally on each worker machine - Each worker reads from its local data shard.

  3. Automatic sharding, such as hash sharding - Uses algorithms to distribute data evenly without manual intervention.

7.3. Section 6.4

  1. Same dataset - Caching is beneficial when reusing the same data across multiple epochs.

  2. The preprocessed batches - Store the processed versions to avoid recomputing expensive transformations.

  3. No - In-memory cache is much faster to access than disk cache (nanoseconds vs milliseconds).


8. Summary

What We Learned:

Data Ingestion Fundamentals
  • The critical first step that determines ML pipeline success
Batching Pattern
  • Handle memory constraints by processing data in manageable chunks
Sharding Pattern
  • Distribute large datasets across multiple machines for parallel processing
Caching Pattern
  • Eliminate redundant work in multi-epoch training through smart data reuse

Pattern Selection Guide:

Challenge
Solution & When to Use
Dataset too large for memory
❌ Cannot load
✅ Batching - Single machine, memory constraints
Training too slow sequentially
❌ One worker idle
✅ Sharding - Multiple machines available
Multi-epoch training wasteful
❌ Repeating work
✅ Caching - Repeated data access needed

Key Benefits:

Pattern Benefits

Batching:
Train on datasets
larger than memory

Sharding:
Parallel processing
across machines

Caching:
Reduce redundant
data processing

Insight

Real ML systems combine all three patterns: shard data across workers, batch within each worker, and cache preprocessed results. This creates a robust, scalable data ingestion pipeline.

Next Steps:

In Chapter 3, we'll explore how multiple workers coordinate model training using distributed training patterns. You'll learn how the individual data shards we created actually contribute to a single, coherent model.

Ready to distribute the training itself?Chapter 3: Distributed Training Patterns


Remember: Data ingestion patterns are the foundation of scalable ML. Master these, and you can handle datasets of any size efficiently.


Previous: Chapter 1: Introduction to Distributed ML | Next: Chapter 3: Distributed Training Patterns