Chapter 7: Project Overview and System Architecture
Designing and building a complete distributed machine learning system
Table of Contents
- Project Overview
- Data Ingestion Component
- Model Training Component
- Model Serving Component
- End-to-End Workflow Integration
- Summary and Exercises
Think of this project like building a modern photo processing lab - you need efficient input stations (data ingestion), multiple specialized processing units (distributed training), high-speed output machines (replicated serving), and smart workflow management to coordinate everything seamlessly.
1. Project Overview
1.1. Project Background
We'll build a complete image classification system that demonstrates all the patterns learned in previous chapters. This isn't just a toy example - it's a production-ready system architecture that scales to handle real-world demands.
Project Scope:
- Dataset: Fashion-MNIST (28×28 grayscale images, 10 clothing categories)
- Task: Multi-class image classification
- Scale: Distributed training and serving on Kubernetes clusters
- Technologies: TensorFlow, Kubernetes, Kubeflow, Docker, Argo Workflows
Why Fashion-MNIST?
Training Set
60,000 images
28×28 grayscale
Test Set
10,000 images
10 classes
Size
~30MB compressed
Perfect for learning
Categories:
- 0: T-shirt/top
- 1: Trouser
- 2: Pullover
- 3: Dress
- 4: Coat
- 5: Sandal
- 6: Shirt
- 7: Sneaker
- 8: Bag
- 9: Ankle boot
1.2. System Components and Architecture
Complete System Architecture:
Model Training 1
(Conv Net)
Model Training 2
(Dense Net)
Model Training 3
(ResNet)
(Top Model)
Key System Components:
- Downloads dataset
- Batches for memory efficiency
- Caches for multi-epoch training
- Three parallel training processes
- Collective communication
- Automatic model selection
- Replicated services
- Load balancing
- High availability
- Argo Workflows
- Step memoization
- Async execution
Insight
This architecture demonstrates how all the patterns from previous chapters work together in practice. Each component uses specific patterns optimized for its role, while the overall system uses workflow patterns to coordinate everything efficiently.
2. Data Ingestion Component
2.1. The Problem: Memory Constraints and Multi-Epoch Training
Challenge 1: Memory Limitations
Even though Fashion-MNIST is only 30MB, real ML systems need to handle additional processing:
- Image transformations (resize, normalize, augment)
- Complex mathematical operations (convolutions, feature extraction)
- Memory-intensive preprocessing pipelines
Challenge 2: Multi-Epoch Training Inefficiency
Epoch 1: 3 hours (reading from disk)
Epoch 2: 3 hours (re-reading same data!)
Epoch 3: 3 hours (re-reading same data!)
Total: 9 hours with repeated I/O overhead
2.2. The Solution: Batching and Caching Patterns
Batching Pattern Implementation
In plain English: Instead of trying to eat an entire pizza at once, you eat it slice by slice. Batching breaks the dataset into manageable chunks that fit in memory.
In technical terms: Batching partitions the dataset into fixed-size chunks processed sequentially, keeping memory usage constant regardless of dataset size.
Why it matters: Enables processing of datasets larger than available RAM, prevents out-of-memory errors, and improves processing flexibility.
Original Dataset: 60,000 images
img 1-1000
img 1001-2000
img 2001-3000
Batch 60
Benefits:
- Memory Efficiency: Only 1/60th of dataset in memory at once
- Scalability: Can handle datasets larger than available RAM
- Processing Flexibility: Apply expensive operations per batch
Caching Pattern Implementation
In plain English: Like a chef's mise en place - prepare ingredients once, then cook multiple dishes faster. Cache processed batches after first use.
In technical terms: Caching stores processed batches in memory after initial disk read and processing, eliminating redundant I/O and computation for subsequent epochs.
Why it matters: The initial prep time pays off exponentially when training for multiple epochs, reducing total training time by 40-50%.
(Slow I/O)
(Expensive)
(Fast Memory)
(Fast Memory)
(No Processing!)
Cache Management Strategy:
class DataCache:
def __init__(self, memory_limit_gb=8):
self.cache = {}
self.memory_limit = memory_limit_gb * 1024**3
self.current_usage = 0
def get_batch(self, batch_id):
if batch_id in self.cache:
return self.cache[batch_id] # Cache hit!
else:
batch = self.load_and_process_batch(batch_id)
self.cache[batch_id] = batch # Cache for future epochs
return batch
Insight
Think of caching like a chef's mise en place: you spend time prepping ingredients once, then cooking multiple dishes becomes much faster. The initial prep time pays off exponentially when serving multiple customers (epochs).
2.3. Exercises
-
Q: Where do we store the cache?
A: In memory.
-
Q: Can we use the batching pattern when the Fashion-MNIST dataset gets large?
A: Yes - we can reduce batch size to handle larger datasets.
3. Model Training Component
3.1. The Problem: Choosing the Right Distributed Pattern
We have two main distributed training patterns available:
- Parameter Server Pattern: Good for very large models that don't fit on single machines
- Collective Communication Pattern: Good for medium-sized models with communication overhead concerns
Fashion-MNIST Model Characteristics:
- Model Size: ~30MB (fits easily on single machine)
- Training Data: 60,000 images (moderate size)
- Complexity: CNN architecture (medium complexity)
3.2. The Solution: Collective Communication for Medium Models
Why Not Parameter Servers?
In plain English: Using parameter servers for small models is like hiring a team of messengers to pass notes between desks in the same room - the coordination overhead is worse than just talking directly.
In technical terms: Parameter servers introduce communication bottlenecks and synchronization delays that exceed computation time for models that fit on single machines.
Why it matters: Choosing the wrong pattern can make training slower, not faster. Match the pattern to your model size and communication requirements.
∂w₁/∂L
Bottleneck!
∂w₂/∂L
Problems: Network overhead > computation time for small models
Collective Communication Advantage
In plain English: Like a synchronized swimming team - everyone has the same routine (model), performs their part simultaneously (forward pass), coordinates together (AllReduce), and maintains perfect synchronization (model updates).
In technical terms: All workers maintain full model copies, process different data shards in parallel, and synchronize gradients via efficient AllReduce operations.
Why it matters: Eliminates central bottlenecks, enables peer-to-peer communication, and ensures all workers stay perfectly synchronized.
Worker 1
Full Model Copy
Data Shard 1
Worker 2
Full Model Copy
Data Shard 2
Worker 3
Full Model Copy
Data Shard 3
Gradient Aggregation
(Ring-based communication)
Worker 1
Updated Model
(Same)
Worker 2
Updated Model
(Same)
Worker 3
Updated Model
(Same)
AllReduce Process Detail:
Three-Model Training Architecture:
↓ AllReduce
92% accuracy
↓ AllReduce
89% accuracy
↓ AllReduce
85% accuracy
Model Selection → Choose Best → Model A
Insight
Collective communication is like a synchronized swimming team: everyone has the same routine (model), performs their part simultaneously (forward pass), coordinates together (AllReduce), and maintains perfect synchronization (model updates).
3.3. Exercises
-
Q: Why isn't the parameter server pattern a good fit for our model?
A: There are blocking communications between workers and parameter servers, creating unnecessary overhead for a model that fits on single machines.
-
Q: Does each worker store different parts of the model when using the collective communication pattern?
A: No, each worker stores exactly the same copy of the model.
4. Model Serving Component
4.1. The Problem: Single Server Bottlenecks
In plain English: Using a single server for all requests is like having only one checkout line at a busy grocery store - everyone waits while the line gets longer.
In technical terms: Single-server architectures create performance bottlenecks under load, exhibit poor latency characteristics, and represent single points of failure.
Why it matters: Production systems need high availability, low latency, and the ability to handle traffic spikes without degradation.
Model Server
⚠️ Bottleneck!
Queue: 47 requests waiting
Response time: 30+ seconds
Performance Issues:
- Queue Buildup: Requests pile up during peak hours
- Poor Latency: Users wait 30+ seconds for simple predictions
- Single Point of Failure: Server down = entire service down
- Resource Waste: Can't utilize multiple machines
4.2. The Solution: Replicated Services with Load Balancing
In plain English: Multiple checkout lines with someone directing customers to the shortest one. Everyone gets served faster, and if one line closes, others keep working.
In technical terms: Deploy multiple identical service replicas behind a load balancer that distributes incoming requests across replicas using routing algorithms.
Why it matters: Achieves linear scalability (3x servers = 3x throughput), improves availability through redundancy, and reduces latency by distributing load.
50 User Requests
Load Balancer
(Round Robin)
Model Server
Replica 1
Queue: 17
Model Server
Replica 2
Queue: 17
Model Server
Replica 3
Queue: 16
Response time: ~3 seconds per request
Load Balancing Strategies
- ✅ Simple implementation
- ✅ Even distribution
- ✅ No state tracking
- ❌ Doesn't consider load
- ❌ Assumes equal servers
- ✅ Considers actual load
- ✅ Better for varying complexity
- ✅ Automatic balancing
- ❌ More complex tracking
- ❌ Connection monitoring overhead
- ✅ Session affinity
- ✅ Caching benefits per user
- ✅ Predictable routing
- ❌ Uneven distribution possible
- ❌ Failures affect specific users
Complete Model Serving Flow
Prediction Result:
{
"image_id": "user_image_123",
"predicted_class": "T-shirt/top",
"confidence": 0.94,
"processing_time_ms": 45,
"server_replica": "replica-2"
}
Insight
Replicated services are like having multiple checkout lines at a grocery store instead of one. The load balancer is the person directing customers to the shortest line, ensuring everyone gets served quickly and no single cashier gets overwhelmed.
4.3. Exercises
-
Q: What happens when we don't have a load balancer as part of the model serving system?
A: We cannot balance or distribute the model serving requests among the replicas.
5. End-to-End Workflow Integration
5.1. The Problems: Redundant Work and Sequential Blocking
Problem 1: Redundant Data Processing
In plain English: Like washing the same dishes every day even though they're already clean. If the dataset hasn't changed, why re-process it?
In technical terms: Without memoization, workflow systems re-execute unchanged data processing steps on every run, wasting computational resources.
Why it matters: For datasets that update monthly but experiments that run daily, you're doing 29x more work than necessary.
Total wasted time: 29 × 3 hours = 87 hours of redundant processing
Problem 2: Sequential Execution Blocking
In plain English: Like waiting for the slowest runner to finish a marathon before announcing that the fastest runner won. You knew the winner 6 hours ago!
In technical terms: Sequential workflow execution blocks deployment of ready models while waiting for slower training jobs to complete, delaying time-to-service.
Why it matters: Users could be served by the best model hours earlier if we didn't wait for all training to finish.
Training 3: 6 hours → 85% accuracy ← Blocks everything!
Problem: Best model ready at 2 hours, but users wait 10 hours total!
5.2. The Solutions: Step Memoization and Async Execution
Solution 1: Step Memoization Pattern
In plain English: Smart caching that remembers what work has been done and skips it if nothing changed. Like checking if dishes are clean before washing them.
In technical terms: Step memoization stores metadata about completed workflow steps and conditionally skips re-execution based on input staleness and change thresholds.
Why it matters: Reduces redundant computation by 80-90% for workflows with infrequent source data updates.
Check Cache
Last Update: 15 days ago
Threshold: 30 days
(< 30 days)
Run Data Ingestion
Update Cache
(< 30 days)
Skip Data Ingestion
Use Cached Dataset
Cache Implementation Strategy:
class WorkflowCache:
def __init__(self):
self.data_cache = {
'dataset_location': '/cache/fashion_mnist',
'last_updated': '2024-01-15',
'record_count': 60000,
'data_hash': 'abc123...'
}
def should_skip_data_ingestion(self):
# Time-based check
days_since_update = (datetime.now() - self.data_cache['last_updated']).days
if days_since_update < 30:
return True
# Content-based check
current_source_count = self.get_source_record_count()
cached_count = self.data_cache['record_count']
change_percentage = abs(current_source_count - cached_count) / cached_count
return change_percentage < 0.1 # Skip if < 10% change
Solution 2: Asynchronous Execution Pattern
In plain English: Like a restaurant kitchen working on multiple orders simultaneously and serving dishes as they're ready, not waiting for the entire table's order to complete.
In technical terms: Asynchronous workflow execution runs training jobs in parallel and progressively deploys models as they complete, enabling immediate value delivery.
Why it matters: Reduces time-to-service from 10 hours to 2 hours - users get a good model immediately and better models as they become available.
Time 0: All training starts in parallel
User Experience:
Hour 0-2: No service (training)
Hour 2-8: Good service (92% model)
Hour 8+: Optimal service (confirmed best)
Async Implementation Flow:
class AsyncWorkflowManager:
def execute_training_pipeline(self):
# Start all training jobs in parallel
futures = []
for model_config in self.model_configs:
future = self.executor.submit(self.train_model, model_config)
futures.append(future)
# Deploy models as they complete
deployed_models = []
for future in as_completed(futures):
model = future.result()
if self.is_better_than_current(model):
self.deploy_model(model)
deployed_models.append(model)
self.notify_users(f"New model deployed: {model.accuracy}")
# Final optimization when all complete
best_model = max(deployed_models, key=lambda m: m.accuracy)
self.finalize_deployment(best_model)
Insight
Async workflow execution is like a restaurant kitchen: instead of completing one dish entirely before starting the next, skilled chefs work on multiple dishes simultaneously, serving completed items immediately while others continue cooking. Customers get fed faster, and the kitchen operates more efficiently.
5.3. Exercises
-
Q: Which component can benefit the most from step memoization?
A: The data ingestion component.
-
Q: How do we tell whether a step's execution can be skipped if its workflow has been triggered to run again?
A: Using the metadata in the step cache.
6. Summary and Exercises
Key Concepts Mastered
Design Decisions Summary
- Batching for memory efficiency
- Caching for multi-epoch optimization
- 44% training efficiency gain
- Collective communication
- Medium-sized Fashion-MNIST models
- Linear scalability with workers
- Replicated services
- Load balancing
- 3x throughput improvement
- Step memoization
- Async execution
- 85% faster time-to-service
Real-World Production Benefits
- 85% faster time-to-service through async deployment
- 44% training efficiency gain through intelligent caching
- 3x throughput improvement via replicated serving
- Linear scalability with additional compute resources
Technology Integration Preview
Next Chapter Technologies:
- TensorFlow: Model building and training
- Kubernetes: Container orchestration and scaling
- Kubeflow: ML-specific Kubernetes workflows
- Docker: Containerization and deployment
- Argo Workflows: End-to-end pipeline orchestration
Insight
This chapter bridges theory and practice by showing exactly how to combine all distributed ML patterns into a cohesive, production-ready system. Each design choice optimizes for real-world constraints while maintaining the flexibility to scale and evolve.
Practice Exercises
System Design:
- Calculate memory requirements for different batch sizes with Fashion-MNIST
- Design cache eviction strategy for long-running production systems
- Plan load balancing strategy for globally distributed model serving
Performance Optimization: 4. Estimate training time improvements with different numbers of workers 5. Calculate optimal cache size vs processing speed trade-offs 6. Design monitoring strategy for async workflow execution
Scalability Planning: 7. Plan system architecture for 1M+ daily image classification requests 8. Design data pipeline for streaming real-time fashion image updates 9. Create disaster recovery strategy for multi-region model serving
Previous: Chapter 6: Operation Patterns | Next: Chapter 8: Overview of Relevant Technologies