Chapter 2: Data Ingestion Patterns

Feeding the machine learning beast - efficiently and at scale

Introduction
What is Data Ingestion?
The Fashion-MNIST Dataset
Batching Pattern
- 4.1. The Problem: Memory Constraints with Large Datasets
- 4.2. The Solution: Divide and Consume
- 4.3. Discussion: Trade-offs and Considerations
- 4.4. Exercises
Sharding Pattern
- 5.1. The Problem: Sequential Processing Bottleneck
- 5.2. The Solution: Horizontal Data Distribution
- 5.3. Discussion: Balancing and Automation
- 5.4. Exercises
Caching Pattern
- 6.1. The Problem: Redundant Multi-Epoch Processing
- 6.2. The Solution: Smart Data Reuse
- 6.3. Discussion: Memory vs Disk Trade-offs
- 6.4. Exercises
Answers to Exercises
- 7.1. Section 4.4
- 7.2. Section 5.4
- 7.3. Section 6.4
Summary

1. Introduction

Restaurant Analogy: Imagine you're running a busy restaurant kitchen. Raw ingredients (data) come in from suppliers, and you need to prep, organize, and serve them to hungry customers (your ML model). But what happens when:

You get a massive food truck delivery that won't fit in your prep area?
Orders are coming in faster than you can prep ingredients sequentially?
You're making the same popular dish over and over?

That's exactly the challenge of data ingestion in machine learning systems.

Data ingestion is the critical first step that determines whether your ML pipeline will feast or famine. Get it wrong, and your expensive GPUs sit idle waiting for data. Get it right, and you unlock the full potential of distributed machine learning.

This chapter explores three fundamental patterns that solve the most common data ingestion challenges: batching, sharding, and caching. These aren't just theoretical concepts - they're battle-tested solutions used by every major ML platform.

2. What is Data Ingestion?

Definition: Data ingestion is the process that monitors data sources, consumes data (either all at once or streaming), and performs preprocessing to prepare for machine learning model training.

Think of it as the supply chain for your ML factory:

Data Ingestion Pipeline

Raw Data
Sources

→

Data
Ingestion

→

Processed
Data

→

Model
Training

Two Main Approaches:

Streaming Data Ingestion

Non-streaming Data Ingestion

Dataset Size

Grows over time

Fixed size

Infrastructure

Long-running processes

Batch jobs on demand

Use Case

Real-time feeds, sensor data

Historical analysis, research

Example

Twitter stream, IoT sensors

Static datasets, archives

Insight

Data ingestion is often the hidden bottleneck in ML systems. A 10x improvement in model architecture means nothing if your data pipeline can only feed it at 1/10th the speed.

Key Responsibilities:

Monitor data sources for new information
Extract data in appropriate formats
Transform raw data into ML-ready formats
Load data efficiently for training processes
Handle failures and inconsistencies gracefully

3. The Fashion-MNIST Dataset

Why Fashion-MNIST over classic MNIST?

The classic MNIST dataset (handwritten digits) has become too easy - most simple models achieve 95%+ accuracy. It's like using a children's puzzle to test engineering skills. Fashion-MNIST provides the same structure but with more realistic complexity.

Dataset Characteristics:

Training: 60,000 examples
Testing: 10,000 examples
Image Size: 28×28 grayscale
Classes: 10 fashion categories
File Size: ~30 MB compressed

The 10 Fashion Classes:

0: T-shirt/top

~6,000 examples

1: Trouser

~6,000 examples

2: Pullover

~6,000 examples

3: Dress

~6,000 examples

4: Coat

~6,000 examples

5: Sandal

~6,000 examples

6: Shirt

~6,000 examples

7: Sneaker

~6,000 examples

8: Bag

~6,000 examples

9: Ankle boot

~6,000 examples

Loading Fashion-MNIST with TensorFlow:

# Approach 1: Direct download and load
import tensorflow as tf
train, test = tf.keras.datasets.fashion_mnist.load_data()

# Approach 2: Load from NumPy arrays into tf.data.Dataset
from tensorflow.data import Dataset
images, labels = train
images = images / 255.0  # Normalize pixel values
dataset = Dataset.from_tensor_slices((images, labels))
print(dataset)
# <TensorSliceDataset shapes: ((28, 28), ()), types: (tf.float64, tf.uint8)>

The Memory Problem: Both approaches can hit TensorFlow's 2GB limit for tf.GraphDef protocol buffer when using tf.constant() operations, causing the NumPy array contents to be copied multiple times. Note: This limit occurs during graph serialization (saving models, @tf.function, distribution) - not during basic eager execution.

The Scale Challenge:

Fashion-MNIST: 30 MB (manageable)
Real-world datasets: 30 GB - 30 TB (problematic)
Production systems: Continuous growth (challenging)

This small dataset serves as our learning foundation, but the patterns we'll explore scale to datasets thousands of times larger.

4. Batching Pattern

4.1. The Problem: Memory Constraints with Large Datasets

The Real Challenge: GPU/RAM Limits

Modern ML training hits hardware memory walls. Your GPU has finite VRAM (often 16GB), but datasets can be massive:

Fashion-MNIST: 30 MB (fits easily)
ImageNet: ~150 GB (won't fit in GPU memory)
Real datasets: Often 100GB-10TB

The Memory Math:

Memory Constraint Problem

Your Hardware: 16GB GPU memory

Large Dataset: 150GB

Problem: 150GB > 16GB (impossible to load at once)

Processing Overhead:

Beyond storage, operations multiply memory usage:

Forward pass: Activations for each layer
Backward pass: Gradients for each parameter
Optimizer states: Adam requires 2x parameter memory
Intermediate computations: Attention matrices, etc.

Note on GraphDef: TensorFlow's tf.constant() has a 2GB limit for graph serialization, but this is separate from the batching pattern. Proper data loading uses tf.data pipelines that stream data without embedding it in the graph.

Additional Processing Costs:

Beyond memory, image processing operations are expensive:

Resizing: CPU/GPU intensive
Normalization: Mathematical operations on every pixel
Augmentation: Rotations, flips, color changes
Convolutions: Complex mathematical transformations

4.2. The Solution: Divide and Consume

Restaurant Kitchen Analogy:

Instead of trying to prep all ingredients at once and overwhelming your kitchen, smart chefs prep mise en place (everything in place) in small batches throughout service.

Bad Kitchen

Smart Kitchen

Strategy

Prep ALL food at once

Prep Course 1, Serve Course 1

Process

Overwhelmed!

Prep Course 2, Serve Course 2

Result

Kitchen fails

Continuous Service

The Batching Solution:

Instead of loading the entire dataset into GPU memory, we divide it into mini-batches that fit within hardware constraints:

Batching Strategy

Large Dataset: 1M examples (50GB total)
GPU Memory: 16GB available

Divide into Batches↓

Batch 1
32 imgs
(~2GB)

Batch 2
32 imgs
(~2GB)

Batch 3
32 imgs
(~2GB)

...

Process sequentially↓

Benefits

✅ Fits within GPU memory limits (16GB)

✅ Enables training on datasets larger than hardware

✅ Constant memory usage (only one batch at a time)

✅ Allows gradient updates with realistic batch sizes

Implementation Pattern:

# Pseudocode for batching
batch = read_next_batch(dataset)
while batch is not None:
    model.train(batch)
    batch = read_next_batch(dataset)

TensorFlow Implementation:

# Efficient batching with TensorFlow I/O
import tensorflow_io as tfio

# Load dataset from URL without downloading everything
d_train = tfio.IODataset.from_mnist(
    'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
    'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz'
)

# Or from a database (PostgreSQL example)
import os
endpoint = "postgresql://{}:{}@{}?port={}&dbname={}".format(
    os.environ['TFIO_DEMO_DATABASE_USER'],
    os.environ['TFIO_DEMO_DATABASE_PASS'],
    os.environ['TFIO_DEMO_DATABASE_HOST'],
    os.environ['TFIO_DEMO_DATABASE_PORT'],
    os.environ['TFIO_DEMO_DATABASE_NAME'],
)

dataset = tfio.experimental.IODataset.from_sql(
    query="SELECT co, pt08s1 FROM AirQualityUCI;",
    endpoint=endpoint
)

4.3. Discussion: Trade-offs and Considerations

When Batching Works Well:

✅ Operations can be performed on data subsets
✅ Model training is iterative
✅ Memory is limited
✅ Framework has batch support

When Batching Doesn't Work:

❌ Algorithm needs global statistics (mean of entire dataset)
❌ Model requires minimum examples per class per batch
❌ Batch size becomes too critical for model performance

The Batch Size Balancing Act:

Small Batches (32-128)

Large Batches (1024-8192)

Memory

✅ Low usage

❌ High requirements

Updates

✅ More frequent

❌ Slower convergence

Gradients

❌ Noisy estimates

✅ Stable estimates

GPU Usage

❌ Underutilization

✅ Better utilization

Sweet Spot: Often 64-512 for most applications

Insight

Modern frameworks like AdaptDL automatically tune batch sizes by measuring system performance and gradient noise, enabling efficient distributed training without requiring manual batch size tuning effort.

The original research compares the effects of automatically and manually tuned batch sizes on overall training time, demonstrating the efficiency benefits of automatic approaches.

4.4. Exercises

Are we training the model using batches in parallel or sequentially?
If the ML framework cannot handle large datasets, can we use the batching pattern?
If a model requires knowing the mean of a feature across the entire dataset, can we still use batching?

5. Sharding Pattern

5.1. The Problem: Sequential Processing Bottleneck

Scaling Beyond Single Machine Limits:

The batching pattern works great until you hit the sequential processing wall. Let's scale our Fashion-MNIST example:

Scale-Up Scenario:

Original Fashion-MNIST: 30 MB, 60,000 examples
Scaled dataset: 30 GB, 60,000,000 examples (1000x larger)
Batch size: 5 GB each (6 total batches)
Processing time per batch: 1 hour

Sequential Processing Problem

Sequential Batch Processing: 6 hours total

B1
1h

→

B2
1h

→

B3
1h

→

B4
1h

→

B5
1h

→

B6
1h

Problem: Only using 1 machine at a time!

The Utilization Problem:

Resource Waste

Machine A
[BUSY]

Machine B
[IDLE]

Machine C
[IDLE]

Resource Utilization: 33% (Wasteful!)

The Real Cost:

Time: Sequential processing instead of parallel
Money: Paying for multiple machines but using only one
Opportunity: Delayed model deployment, inefficient resource usage

5.2. The Solution: Horizontal Data Distribution

Encyclopedia Analogy:

Imagine three friends need to read a 30-volume encyclopedia for an exam:

Bad Strategy (Sequential)

Smart Strategy (Sharding)

Friend A

Reads Vol 1-30

Reads Vol 1-10

Friend B

Reads Vol 1-30

Reads Vol 11-20

Friend C

Reads Vol 1-30

Reads Vol 21-30

Result

Sequential, inefficient

Parallel, optimal

Sharding Architecture:

Sharding Strategy

Hypothetical Dataset: 60 million examples

Split Dataset↓

Shard 1
20M examples

Shard 2
20M examples

Shard 3
20M examples

Distribute to Workers↓

Worker A
[Training]

Worker B
[Training]

Worker C
[Training]

Processing improves with parallel distribution

Horizontal vs Vertical Partitioning:

Database Partitioning Comparison

Original Table

ID | Name | Age | Job | Salary
1 | Ann | 25 | Dev | 50k
2 | Bob | 30 | Mgr | 60k
3 | Cal | 35 | Dev | 55k
4 | Dan | 40 | Dir | 80k

Horizontal Partitioning (Sharding)

Shard A: Rows 1-2 (Ann, Bob)

Shard B: Rows 3-4 (Cal, Dan)

Vertical Partitioning

Part A: ID, Name, Age

Part B: ID, Job, Salary

Sharding Implementation:

# Pseudocode for sharding
if get_worker_rank() == 0:  # Coordinator worker
    create_and_send_shards(dataset)

shard = read_next_shard_locally()
while shard is not None:
    model.train(shard)
    shard = read_next_shard_locally()

5.3. Discussion: Balancing and Automation

The Load Balancing Challenge:

Manual sharding can lead to imbalanced shards:

Bad Sharding (Manual)

Good Sharding (Balanced)

Worker A

Large shard - Long time

Equal shard - Same time

Worker B

Small shard - Short time

Equal shard - Same time

Worker C

Small shard - Short time

Equal shard - Same time

Result

B & C wait! No improvement

All finish together - Optimal

Hash Sharding Solution:

Instead of manual splitting, use hash-based distribution:

Hash Sharding

Data Record

ID: 12345, image: [...], label: 'T-shirt'

↓

Hash Function

hash(12345) = 8947362

↓

Shard Assignment

8947362 % 3 = 2

↓

✓

Result

Record goes to Worker C (Shard 2)

Benefits

✅ Automatic load balancing

✅ Even distribution

✅ Scales with worker count

✅ No manual intervention needed

Dynamic Resharding:

For growing datasets, automatic resharding maintains balance:

Resharding Process

Initial State

10M

2 workers

Growth→

After Growth

15M

2 workers

Reshard→

After Resharding

10M

3 workers

Insight

Production ML systems use automatic sharding with consistent hashing. This ensures even distribution and allows adding/removing workers without reshuffling all data.

Production Considerations:

Operational Complexity

Backup strategy: Must backup multiple shards
Schema coordination: All shards need consistency
Failure handling: One worker doesn't stop others
Monitoring: Track performance across all shards

Growing Dataset Challenges

Rebalancing frequency: How often to reshard?
Data freshness: New data distribution strategy
Migration costs: Moving data between workers

5.4. Exercises

Does the sharding pattern use horizontal or vertical partitioning?
Where does the model read each shard from?
What's an alternative to manual sharding?

6. Caching Pattern

6.1. The Problem: Redundant Multi-Epoch Processing

The Multi-Epoch Necessity:

Modern ML algorithms, especially deep learning, require multiple epochs for convergence:

Multi-Epoch Training

1 epoch = 1 complete pass through entire training dataset

Fashion-MNIST Training Example

Epoch 1: Process all 60,000 examples → Initial learning

Epoch 2: Process all 60,000 examples → Improved performance

Epoch 3: Process all 60,000 examples → Better convergence

Epoch 4: Process all 60,000 examples → Further refinement

Epoch 5: Process all 60,000 examples → Stable performance

Typical requirements: 10-100 epochs for good performance

Why Multiple Epochs Are Needed:

Gradient Descent Learning Analogy: Think of learning to ride a bike. You don't master it in one attempt - you need many practice sessions (epochs) to:

Adjust balance (model parameters)
Build muscle memory (stable convergence)
Handle different terrains (various data patterns)

The Time Multiplication Problem:

Training Time Without Caching

Single Epoch: 3 hours (Load → Process → Train)

Traditional Multi-Epoch Training

Epoch 1: Load→Process→Train (3 hours)

Epoch 2: Load→Process→Train (3 hours)

Epoch 3: Load→Process→Train (3 hours)

Epoch 4: Load→Process→Train (3 hours)

Epoch 5: Load→Process→Train (3 hours)

Total: 15 hours for 5 epochs
Problem: Repeating data loading and processing

The Redundancy Waste:

Redundant Operations Per Epoch

What's Being Repeated?

❌ Download dataset from source (again)

❌ Parse file formats (again)

❌ Apply preprocessing transforms (again)

❌ Normalize pixel values (again)

❌ Reshape tensors (again)

✅ Train model (only this changes!)

Waste Ratio: 95% redundant work, 5% actual learning

6.2. The Solution: Smart Data Reuse

Browser Cache Analogy:

Your web browser doesn't re-download images every time you visit a website. It caches them locally for instant reuse. We can apply the same principle to ML data:

Web Browsing

ML Training

First Visit/Epoch

Download image, Store in cache

Load & process, Store in cache

Later Visits/Epochs

Load from cache (Instant!)

Load from cache (Much faster!)

Caching Architecture:

Caching Strategy

Epoch 1 (Cache Building)

Load
Data
(90 min)

→

Process
Data
(60 min)

→

Cache
& Train
(90 min)

Total: 240 minutes

Epochs 2-N (Cache Usage)

Load
From Cache
(5 min)

→

Train
Model
(90 min)

Total: 95 minutes

Time Savings Per Epoch: 145 minutes!

Performance Comparison:

Without Caching

With Caching

Epoch 1

185 min

185 min (Build cache)

Epoch 2

185 min

95 min (Use cache)

Epoch 3

185 min

95 min (Use cache)

Epoch 4

185 min

95 min (Use cache)

Epoch 5

185 min

95 min (Use cache)

Total

925 minutes

565 minutes (39% savings!)

Basic Caching Implementation:

# Pseudocode for caching
batch = read_next_batch(dataset)
cache = initialize_cache(batch)
while batch is not None:
    model.train(batch)
    cache.append(batch)
    batch = read_next_batch(dataset)

# For remaining epochs, use cache
while current_epoch() <= total_epochs:
    batch = cache.read_next_batch()
    model.train(batch)

Caching with Preprocessing:

# Cache preprocessed data to avoid reprocessing
batch = read_next_batch(dataset)
cache = initialize_cache(preprocess(batch))
while batch is not None:
    batch = preprocess(batch)  # Expensive operations
    model.train(batch)
    cache.append(batch)  # Cache processed version
    batch = read_next_batch(dataset)

# Later epochs skip preprocessing
while current_epoch() <= total_epochs:
    processed_batch = cache.read_next_batch()  # Already processed!
    model.train(processed_batch)

6.3. Discussion: Memory vs Disk Trade-offs

The Storage Decision:

In-Memory Cache

Disk Cache

Speed

✅ Ultra-fast (nanoseconds)

❌ Slower (milliseconds)

Reliability

❌ Lost on crash

✅ Persistent across failures

Capacity

❌ Limited by RAM

✅ Handle datasets larger than RAM

Cost

❌ Expensive for large datasets

✅ Cheap storage

I/O

✅ No bottlenecks

❌ Can become bottleneck

Hybrid Approach: Hot data in memory, cold on disk - Best of both worlds

Memory vs Disk Performance:

Reading from or writing to memory is about 6 times faster for sequential access and about 100,000 times faster for random access compared to disk. RAM access is measured in nanoseconds, while hard drive access is measured in milliseconds.

When to Use Each Strategy:

In-Memory Caching

✅ Dataset fits in available RAM
✅ Frequent random access patterns
✅ System stability is high
✅ Cost of data reload is very high
✅ Short training sessions
Use when: Speed > Reliability

Disk Caching

✅ Dataset larger than available RAM
✅ System crashes are possible
✅ Long training sessions (days/weeks)
✅ Cost is a major concern
✅ Multiple experiments on same data
Use when: Reliability > Speed

Cache Invalidation Strategies:

Time-Based Expiration

Refresh cache every 24 hours

Event-Based Invalidation

Refresh when new data arrives

Version-Based Tracking

Track data version, invalidate on change

Manual Control

Explicit cache clear commands

Best Practice: Combine multiple strategies

Insight

Production systems often use tiered caching: frequently accessed data in memory, recent data on local SSD, older data on network storage. This balances speed, cost, and reliability.

As noted in the original research, if training one epoch on the entire training dataset takes 3 hours, training for multiple epochs would require proportionally more time without caching, making this approach inefficient for real-world systems that often require many epochs.

6.4. Exercises

Is caching useful for training on the same dataset or different datasets across epochs?
What should we store in the cache if the dataset needs preprocessing?
Is an on-disk cache faster to access than an in-memory cache?

7. Answers to Exercises

7.1. Section 4.4

Sequentially - Batches are processed one after another, not in parallel.
Yes - That's one of the primary use cases for batching.
No - If you need global statistics (like the mean across the entire dataset), you can't compute it from individual batches alone.

7.2. Section 5.4

Horizontal partitioning - We split data by rows (examples), not by columns (features).
Locally on each worker machine - Each worker reads from its local data shard.
Automatic sharding, such as hash sharding - Uses algorithms to distribute data evenly without manual intervention.

7.3. Section 6.4

Same dataset - Caching is beneficial when reusing the same data across multiple epochs.
The preprocessed batches - Store the processed versions to avoid recomputing expensive transformations.
No - In-memory cache is much faster to access than disk cache (nanoseconds vs milliseconds).

8. Summary

What We Learned:

Data Ingestion Fundamentals

The critical first step that determines ML pipeline success

Batching Pattern

Handle memory constraints by processing data in manageable chunks

Sharding Pattern

Distribute large datasets across multiple machines for parallel processing

Caching Pattern

Eliminate redundant work in multi-epoch training through smart data reuse

Pattern Selection Guide:

Challenge

Solution & When to Use

Dataset too large for memory

❌ Cannot load

✅ Batching - Single machine, memory constraints

Training too slow sequentially

❌ One worker idle

✅ Sharding - Multiple machines available

Multi-epoch training wasteful

❌ Repeating work

✅ Caching - Repeated data access needed

Key Benefits:

Pattern Benefits

Batching:
Train on datasets
larger than memory

Sharding:
Parallel processing
across machines

Caching:
Reduce redundant
data processing

Insight

Real ML systems combine all three patterns: shard data across workers, batch within each worker, and cache preprocessed results. This creates a robust, scalable data ingestion pipeline.

Next Steps:

In Chapter 3, we'll explore how multiple workers coordinate model training using distributed training patterns. You'll learn how the individual data shards we created actually contribute to a single, coherent model.

Ready to distribute the training itself? → Chapter 3: Distributed Training Patterns

Remember: Data ingestion patterns are the foundation of scalable ML. Master these, and you can handle datasets of any size efficiently.

Previous: Chapter 1: Introduction to Distributed ML | Next: Chapter 3: Distributed Training Patterns

Table of Contents​

1. Introduction​

2. What is Data Ingestion?​

3. The Fashion-MNIST Dataset​

4. Batching Pattern​

4.1. The Problem: Memory Constraints with Large Datasets​

4.2. The Solution: Divide and Consume​

4.3. Discussion: Trade-offs and Considerations​

4.4. Exercises​

5. Sharding Pattern​

5.1. The Problem: Sequential Processing Bottleneck​

5.2. The Solution: Horizontal Data Distribution​

5.3. Discussion: Balancing and Automation​

5.4. Exercises​

6. Caching Pattern​

6.1. The Problem: Redundant Multi-Epoch Processing​

6.2. The Solution: Smart Data Reuse​

6.3. Discussion: Memory vs Disk Trade-offs​

6.4. Exercises​

7. Answers to Exercises​

7.1. Section 4.4​

7.2. Section 5.4​

7.3. Section 6.4​

8. Summary​

Table of Contents

1. Introduction

2. What is Data Ingestion?

3. The Fashion-MNIST Dataset

4. Batching Pattern

4.1. The Problem: Memory Constraints with Large Datasets

4.2. The Solution: Divide and Consume

4.3. Discussion: Trade-offs and Considerations

4.4. Exercises

5. Sharding Pattern

5.1. The Problem: Sequential Processing Bottleneck

5.2. The Solution: Horizontal Data Distribution

5.3. Discussion: Balancing and Automation

5.4. Exercises

6. Caching Pattern

6.1. The Problem: Redundant Multi-Epoch Processing

6.2. The Solution: Smart Data Reuse

6.3. Discussion: Memory vs Disk Trade-offs

6.4. Exercises

7. Answers to Exercises

7.1. Section 4.4

7.2. Section 5.4

7.3. Section 6.4

8. Summary