Skip to main content

Chapter 7: Project Overview and System Architecture

Designing and building a complete distributed machine learning system


Table of Contents

  1. Project Overview
  2. Data Ingestion Component
  3. Model Training Component
  4. Model Serving Component
  5. End-to-End Workflow Integration
  6. Summary and Exercises

Think of this project like building a modern photo processing lab - you need efficient input stations (data ingestion), multiple specialized processing units (distributed training), high-speed output machines (replicated serving), and smart workflow management to coordinate everything seamlessly.


1. Project Overview

1.1. Project Background

We'll build a complete image classification system that demonstrates all the patterns learned in previous chapters. This isn't just a toy example - it's a production-ready system architecture that scales to handle real-world demands.

Project Scope:

  • Dataset: Fashion-MNIST (28×28 grayscale images, 10 clothing categories)
  • Task: Multi-class image classification
  • Scale: Distributed training and serving on Kubernetes clusters
  • Technologies: TensorFlow, Kubernetes, Kubeflow, Docker, Argo Workflows

Why Fashion-MNIST?

Fashion-MNIST Dataset Overview

Training Set
60,000 images
28×28 grayscale

Test Set
10,000 images
10 classes

Size
~30MB compressed
Perfect for learning

Categories:

  • 0: T-shirt/top
  • 1: Trouser
  • 2: Pullover
  • 3: Dress
  • 4: Coat
  • 5: Sandal
  • 6: Shirt
  • 7: Sneaker
  • 8: Bag
  • 9: Ankle boot

1.2. System Components and Architecture

Complete System Architecture:

End-to-End ML System Architecture
Data Pipeline
📥Data Download
⚙️Batching & Processing
💾Cached Dataset
feeds into
Model Training

Model Training 1
(Conv Net)

Model Training 2
(Dense Net)

Model Training 3
(ResNet)

🏆Model Selection
(Top Model)
deploys to
Model Serving
⚖️Load Balancer
Model Server Replica 1
Model Server Replica 2
📊Aggregated Predictions

Key System Components:

📊
Data Ingestion
  • Downloads dataset
  • Batches for memory efficiency
  • Caches for multi-epoch training
🤖
Model Training
  • Three parallel training processes
  • Collective communication
  • Automatic model selection
🚀
Model Serving
  • Replicated services
  • Load balancing
  • High availability
🔄
Workflow Orchestration
  • Argo Workflows
  • Step memoization
  • Async execution

Insight

This architecture demonstrates how all the patterns from previous chapters work together in practice. Each component uses specific patterns optimized for its role, while the overall system uses workflow patterns to coordinate everything efficiently.


2. Data Ingestion Component

2.1. The Problem: Memory Constraints and Multi-Epoch Training

Challenge 1: Memory Limitations

Even though Fashion-MNIST is only 30MB, real ML systems need to handle additional processing:

  • Image transformations (resize, normalize, augment)
  • Complex mathematical operations (convolutions, feature extraction)
  • Memory-intensive preprocessing pipelines

Challenge 2: Multi-Epoch Training Inefficiency

Traditional Sequential Training

Epoch 1: 3 hours (reading from disk)

Epoch 2: 3 hours (re-reading same data!)

Epoch 3: 3 hours (re-reading same data!)

Total: 9 hours with repeated I/O overhead

2.2. The Solution: Batching and Caching Patterns

Batching Pattern Implementation

In plain English: Instead of trying to eat an entire pizza at once, you eat it slice by slice. Batching breaks the dataset into manageable chunks that fit in memory.

In technical terms: Batching partitions the dataset into fixed-size chunks processed sequentially, keeping memory usage constant regardless of dataset size.

Why it matters: Enables processing of datasets larger than available RAM, prevents out-of-memory errors, and improves processing flexibility.

Batching Strategy (batch_size=1000)

Original Dataset: 60,000 images

split into
Batch 1
img 1-1000
Process & Train
Batch 2
img 1001-2000
Process & Train
Batch 3
img 2001-3000
Process & Train
...
Batch 60

Benefits:

  • Memory Efficiency: Only 1/60th of dataset in memory at once
  • Scalability: Can handle datasets larger than available RAM
  • Processing Flexibility: Apply expensive operations per batch

Caching Pattern Implementation

In plain English: Like a chef's mise en place - prepare ingredients once, then cook multiple dishes faster. Cache processed batches after first use.

In technical terms: Caching stores processed batches in memory after initial disk read and processing, eliminating redundant I/O and computation for subsequent epochs.

Why it matters: The initial prep time pays off exponentially when training for multiple epochs, reducing total training time by 40-50%.

Caching Performance
Epoch 1 (Cold Start)
💽Read from Disk
(Slow I/O)
⚙️Process Images
(Expensive)
💾Train + Cache
(Fast Memory)
Epoch 2+ (Cached)
Read from Cache
(Fast Memory)
🚀Train Directly
(No Processing!)
Without Caching
With Caching
Epoch 1
3 hours (Disk + Process + Train)
3.2 hours (Disk + Process + Train + Cache)
Epoch 2
3 hours (Disk + Process + Train)
45 minutes (Memory + Train)
Epoch 3
3 hours (Disk + Process + Train)
45 minutes (Memory + Train)
Total
9 hours
5 hours (44% improvement!)

Cache Management Strategy:

class DataCache:
def __init__(self, memory_limit_gb=8):
self.cache = {}
self.memory_limit = memory_limit_gb * 1024**3
self.current_usage = 0

def get_batch(self, batch_id):
if batch_id in self.cache:
return self.cache[batch_id] # Cache hit!
else:
batch = self.load_and_process_batch(batch_id)
self.cache[batch_id] = batch # Cache for future epochs
return batch

Insight

Think of caching like a chef's mise en place: you spend time prepping ingredients once, then cooking multiple dishes becomes much faster. The initial prep time pays off exponentially when serving multiple customers (epochs).

2.3. Exercises

  1. Q: Where do we store the cache?

    A: In memory.

  2. Q: Can we use the batching pattern when the Fashion-MNIST dataset gets large?

    A: Yes - we can reduce batch size to handle larger datasets.


3. Model Training Component

3.1. The Problem: Choosing the Right Distributed Pattern

We have two main distributed training patterns available:

  • Parameter Server Pattern: Good for very large models that don't fit on single machines
  • Collective Communication Pattern: Good for medium-sized models with communication overhead concerns

Fashion-MNIST Model Characteristics:

  • Model Size: ~30MB (fits easily on single machine)
  • Training Data: 60,000 images (moderate size)
  • Complexity: CNN architecture (medium complexity)

3.2. The Solution: Collective Communication for Medium Models

Why Not Parameter Servers?

In plain English: Using parameter servers for small models is like hiring a team of messengers to pass notes between desks in the same room - the coordination overhead is worse than just talking directly.

In technical terms: Parameter servers introduce communication bottlenecks and synchronization delays that exceed computation time for models that fit on single machines.

Why it matters: Choosing the wrong pattern can make training slower, not faster. Match the pattern to your model size and communication requirements.

Parameter Server Challenges
🔧Worker 1
∂w₁/∂L
push gradients
🗄️Parameter Server
Bottleneck!
pull updates
🔧Worker 2
∂w₂/∂L

Problems: Network overhead > computation time for small models

Collective Communication Advantage

In plain English: Like a synchronized swimming team - everyone has the same routine (model), performs their part simultaneously (forward pass), coordinates together (AllReduce), and maintains perfect synchronization (model updates).

In technical terms: All workers maintain full model copies, process different data shards in parallel, and synchronize gradients via efficient AllReduce operations.

Why it matters: Eliminates central bottlenecks, enables peer-to-peer communication, and ensures all workers stay perfectly synchronized.

Collective Communication Setup

Worker 1
Full Model Copy
Data Shard 1

Worker 2
Full Model Copy
Data Shard 2

Worker 3
Full Model Copy
Data Shard 3

AllReduce
🔄

Gradient Aggregation
(Ring-based communication)

Worker 1
Updated Model
(Same)

Worker 2
Updated Model
(Same)

Worker 3
Updated Model
(Same)

AllReduce Process Detail:

1
Step 1: Forward Pass (Parallel)
Worker 1: Data[0:20k] → gradients_1 Worker 2: Data[20k:40k] → gradients_2 Worker 3: Data[40k:60k] → gradients_3
2
Step 2: AllReduce Communication
Ring operation aggregates all gradients Result: All workers have sum(gradients_1+2+3)
3
Step 3: Model Update (Synchronized)
All workers: model = model - learning_rate * avg(gradients)

Three-Model Training Architecture:

Parallel Model Training
Training 1 (CNN)
W1 W2 W3
↓ AllReduce
🏆Model A
92% accuracy
Training 2 (Dense)
W1 W2 W3
↓ AllReduce
📊Model B
89% accuracy
Training 3 (ResNet)
W1 W2 W3
↓ AllReduce
📈Model C
85% accuracy
🏆

Model Selection → Choose Best → Model A

Insight

Collective communication is like a synchronized swimming team: everyone has the same routine (model), performs their part simultaneously (forward pass), coordinates together (AllReduce), and maintains perfect synchronization (model updates).

3.3. Exercises

  1. Q: Why isn't the parameter server pattern a good fit for our model?

    A: There are blocking communications between workers and parameter servers, creating unnecessary overhead for a model that fits on single machines.

  2. Q: Does each worker store different parts of the model when using the collective communication pattern?

    A: No, each worker stores exactly the same copy of the model.


4. Model Serving Component

4.1. The Problem: Single Server Bottlenecks

In plain English: Using a single server for all requests is like having only one checkout line at a busy grocery store - everyone waits while the line gets longer.

In technical terms: Single-server architectures create performance bottlenecks under load, exhibit poor latency characteristics, and represent single points of failure.

Why it matters: Production systems need high availability, low latency, and the ability to handle traffic spikes without degradation.

Single Server Bottleneck
User 1
User 2
User 3
...
User 50
all requests to

Model Server
⚠️ Bottleneck!
Queue: 47 requests waiting

Response time: 30+ seconds

Performance Issues:

  • Queue Buildup: Requests pile up during peak hours
  • Poor Latency: Users wait 30+ seconds for simple predictions
  • Single Point of Failure: Server down = entire service down
  • Resource Waste: Can't utilize multiple machines

4.2. The Solution: Replicated Services with Load Balancing

In plain English: Multiple checkout lines with someone directing customers to the shortest one. Everyone gets served faster, and if one line closes, others keep working.

In technical terms: Deploy multiple identical service replicas behind a load balancer that distributes incoming requests across replicas using routing algorithms.

Why it matters: Achieves linear scalability (3x servers = 3x throughput), improves availability through redundancy, and reduces latency by distributing load.

Load Balanced Model Serving
👥

50 User Requests

⚖️

Load Balancer
(Round Robin)

distributes evenly

Model Server
Replica 1
Queue: 17

Model Server
Replica 2
Queue: 17

Model Server
Replica 3
Queue: 16

Response time: ~3 seconds per request

Load Balancing Strategies

🔄
Round Robin
  • ✅ Simple implementation
  • ✅ Even distribution
  • ✅ No state tracking
  • ❌ Doesn't consider load
  • ❌ Assumes equal servers
📊
Least Connections
  • ✅ Considers actual load
  • ✅ Better for varying complexity
  • ✅ Automatic balancing
  • ❌ More complex tracking
  • ❌ Connection monitoring overhead
#️⃣
Hash-Based
  • ✅ Session affinity
  • ✅ Caching benefits per user
  • ✅ Predictable routing
  • ❌ Uneven distribution possible
  • ❌ Failures affect specific users

Complete Model Serving Flow

📤
User Upload
Client sends image for classification
⚖️
Load Balancer
Routes request to optimal replica
🤖
Model Server
Processes image and runs inference
📊
Prediction
Returns class, confidence, metadata

Prediction Result:

{
"image_id": "user_image_123",
"predicted_class": "T-shirt/top",
"confidence": 0.94,
"processing_time_ms": 45,
"server_replica": "replica-2"
}
Single Server
3 Replicated Servers
Throughput
20 requests/second
60 requests/second (3x)
Latency
15-30 seconds (variable)
2-5 seconds (6x improvement)
Availability
95% (single point of failure)
99.9% (redundancy)
Cost-Efficiency
1x cost for 1x throughput
3x cost for 3x throughput (linear)

Insight

Replicated services are like having multiple checkout lines at a grocery store instead of one. The load balancer is the person directing customers to the shortest line, ensuring everyone gets served quickly and no single cashier gets overwhelmed.

4.3. Exercises

  1. Q: What happens when we don't have a load balancer as part of the model serving system?

    A: We cannot balance or distribute the model serving requests among the replicas.


5. End-to-End Workflow Integration

5.1. The Problems: Redundant Work and Sequential Blocking

Problem 1: Redundant Data Processing

In plain English: Like washing the same dishes every day even though they're already clean. If the dataset hasn't changed, why re-process it?

In technical terms: Without memoization, workflow systems re-execute unchanged data processing steps on every run, wasting computational resources.

Why it matters: For datasets that update monthly but experiments that run daily, you're doing 29x more work than necessary.

Current Workflow (Inefficient)
Day 1: Dataset updated → Full pipeline (6 hours)
Day 2: New model test → Full pipeline (6 hours) ← Same data!
Day 3: Hyperparameter tuning → Full pipeline (6 hours) ← Same data!
...
Day 30: Next dataset update
⚠️

Total wasted time: 29 × 3 hours = 87 hours of redundant processing

Problem 2: Sequential Execution Blocking

In plain English: Like waiting for the slowest runner to finish a marathon before announcing that the fastest runner won. You knew the winner 6 hours ago!

In technical terms: Sequential workflow execution blocks deployment of ready models while waiting for slower training jobs to complete, delaying time-to-service.

Why it matters: Users could be served by the best model hours earlier if we didn't wait for all training to finish.

Sequential Execution Bottleneck
Training 1: 2 hours → 92% accuracy ✅ Best model!
Training 2: 2 hours → 89% accuracy

Training 3: 6 hours → 85% accuracy ← Blocks everything!

Model Selection: Waits 6 hours → Choose Training 1
Model Serving: Waits 6 hours → Deploy best model

Problem: Best model ready at 2 hours, but users wait 10 hours total!

5.2. The Solutions: Step Memoization and Async Execution

Solution 1: Step Memoization Pattern

In plain English: Smart caching that remembers what work has been done and skips it if nothing changed. Like checking if dishes are clean before washing them.

In technical terms: Step memoization stores metadata about completed workflow steps and conditionally skips re-execution based on input staleness and change thresholds.

Why it matters: Reduces redundant computation by 80-90% for workflows with infrequent source data updates.

Step Memoization Decision Tree
🔄Workflow Triggered

Check Cache
Last Update: 15 days ago
Threshold: 30 days

Data Fresh?
(< 30 days)
NO

Run Data Ingestion
Update Cache

Data Fresh?
(< 30 days)
YES

Skip Data Ingestion
Use Cached Dataset

Continue with Model Training

Cache Implementation Strategy:

class WorkflowCache:
def __init__(self):
self.data_cache = {
'dataset_location': '/cache/fashion_mnist',
'last_updated': '2024-01-15',
'record_count': 60000,
'data_hash': 'abc123...'
}

def should_skip_data_ingestion(self):
# Time-based check
days_since_update = (datetime.now() - self.data_cache['last_updated']).days
if days_since_update < 30:
return True

# Content-based check
current_source_count = self.get_source_record_count()
cached_count = self.data_cache['record_count']
change_percentage = abs(current_source_count - cached_count) / cached_count

return change_percentage < 0.1 # Skip if < 10% change

Solution 2: Asynchronous Execution Pattern

In plain English: Like a restaurant kitchen working on multiple orders simultaneously and serving dishes as they're ready, not waiting for the entire table's order to complete.

In technical terms: Asynchronous workflow execution runs training jobs in parallel and progressively deploys models as they complete, enabling immediate value delivery.

Why it matters: Reduces time-to-service from 10 hours to 2 hours - users get a good model immediately and better models as they become available.

Async Workflow (Optimized)
⏱️

Time 0: All training starts in parallel

Training 1: 2h
Training 2: 2h
Training 3: 6h
Time 2h: Training 1 Complete
🏆92% accuracy ✅ Best model!
✓ Deploy Model 1 immediately
✓ Users get service with 92% accuracy
→ Training 2,3 continue running
Time 4h: Training 2 Complete
📊89% accuracy
✓ Compare: 92% vs 89%
→ Keep Model 1 deployed (better)
→ Training 3 continues
Time 8h: Training 3 Complete
📈85% accuracy
✓ Final comparison: 92% vs 89% vs 85%
✓ Confirm Model 1 is best
System optimization complete

User Experience:
Hour 0-2: No service (training)
Hour 2-8: Good service (92% model)
Hour 8+: Optimal service (confirmed best)

Async Implementation Flow:

class AsyncWorkflowManager:
def execute_training_pipeline(self):
# Start all training jobs in parallel
futures = []
for model_config in self.model_configs:
future = self.executor.submit(self.train_model, model_config)
futures.append(future)

# Deploy models as they complete
deployed_models = []
for future in as_completed(futures):
model = future.result()

if self.is_better_than_current(model):
self.deploy_model(model)
deployed_models.append(model)
self.notify_users(f"New model deployed: {model.accuracy}")

# Final optimization when all complete
best_model = max(deployed_models, key=lambda m: m.accuracy)
self.finalize_deployment(best_model)
Sequential Approach
With Async + Memoization
Data Ingestion
3 hours (always runs)
0 hours (cached, skip)
Training Time
10 hours (sequential)
2 hours (first model)
Time-to-Service
13 hours total
2 hours (85% reduction!)
User Experience
No service until 13h
Good service at 2h, progressive improvement

Insight

Async workflow execution is like a restaurant kitchen: instead of completing one dish entirely before starting the next, skilled chefs work on multiple dishes simultaneously, serving completed items immediately while others continue cooking. Customers get fed faster, and the kitchen operates more efficiently.

5.3. Exercises

  1. Q: Which component can benefit the most from step memoization?

    A: The data ingestion component.

  2. Q: How do we tell whether a step's execution can be skipped if its workflow has been triggered to run again?

    A: Using the metadata in the step cache.


6. Summary and Exercises

Key Concepts Mastered

Complete ML System Implementation
Workflow Orchestration
Step Memoization (skip redundant work)Async Execution (progressive deployment)Workflow Management (Argo integration)
Data Ingestion
Batching PatternCaching PatternMemory Optimization
Model Training
Collective CommunicationParallel TrainingModel Selection
Model Serving
Replicated ServicesLoad BalancingHigh Availability

Design Decisions Summary

📊
Data Strategy
  • Batching for memory efficiency
  • Caching for multi-epoch optimization
  • 44% training efficiency gain
🤖
Training Strategy
  • Collective communication
  • Medium-sized Fashion-MNIST models
  • Linear scalability with workers
🚀
Serving Strategy
  • Replicated services
  • Load balancing
  • 3x throughput improvement
🔄
Workflow Strategy
  • Step memoization
  • Async execution
  • 85% faster time-to-service

Real-World Production Benefits

  • 85% faster time-to-service through async deployment
  • 44% training efficiency gain through intelligent caching
  • 3x throughput improvement via replicated serving
  • Linear scalability with additional compute resources

Technology Integration Preview

Next Chapter Technologies:

  • TensorFlow: Model building and training
  • Kubernetes: Container orchestration and scaling
  • Kubeflow: ML-specific Kubernetes workflows
  • Docker: Containerization and deployment
  • Argo Workflows: End-to-end pipeline orchestration

Insight

This chapter bridges theory and practice by showing exactly how to combine all distributed ML patterns into a cohesive, production-ready system. Each design choice optimizes for real-world constraints while maintaining the flexibility to scale and evolve.

Practice Exercises

System Design:

  1. Calculate memory requirements for different batch sizes with Fashion-MNIST
  2. Design cache eviction strategy for long-running production systems
  3. Plan load balancing strategy for globally distributed model serving

Performance Optimization: 4. Estimate training time improvements with different numbers of workers 5. Calculate optimal cache size vs processing speed trade-offs 6. Design monitoring strategy for async workflow execution

Scalability Planning: 7. Plan system architecture for 1M+ daily image classification requests 8. Design data pipeline for streaming real-time fashion image updates 9. Create disaster recovery strategy for multi-region model serving


Previous: Chapter 6: Operation Patterns | Next: Chapter 8: Overview of Relevant Technologies