Chapter 7: Project Overview and System Architecture

Designing and building a complete distributed machine learning system

Project Overview
- 1.1. Project Background
- 1.2. System Components and Architecture
Data Ingestion Component
Model Training Component
Model Serving Component
- 4.1. The Problem: Single Server Bottlenecks
- 4.2. The Solution: Replicated Services with Load Balancing
- 4.3. Exercises
End-to-End Workflow Integration
Summary and Exercises

Think of this project like building a modern photo processing lab - you need efficient input stations (data ingestion), multiple specialized processing units (distributed training), high-speed output machines (replicated serving), and smart workflow management to coordinate everything seamlessly.

1. Project Overview

1.1. Project Background

We'll build a complete image classification system that demonstrates all the patterns learned in previous chapters. This isn't just a toy example - it's a production-ready system architecture that scales to handle real-world demands.

Project Scope:

Dataset: Fashion-MNIST (28×28 grayscale images, 10 clothing categories)
Task: Multi-class image classification
Scale: Distributed training and serving on Kubernetes clusters
Technologies: TensorFlow, Kubernetes, Kubeflow, Docker, Argo Workflows

Why Fashion-MNIST?

Fashion-MNIST Dataset Overview

Training Set
60,000 images
28×28 grayscale

Test Set
10,000 images
10 classes

Size
~30MB compressed
Perfect for learning

Categories:

0: T-shirt/top
1: Trouser
2: Pullover
3: Dress
4: Coat
5: Sandal
6: Shirt
7: Sneaker
8: Bag
9: Ankle boot

1.2. System Components and Architecture

Complete System Architecture:

End-to-End ML System Architecture

Data Pipeline

📥Data Download

→

⚙️Batching & Processing

→

💾Cached Dataset

feeds into↓

Model Training

Model Training 1
(Conv Net)

Model Training 2
(Dense Net)

Model Training 3
(ResNet)

↓

🏆Model Selection
(Top Model)

deploys to↓

Model Serving

⚖️Load Balancer

→

Model Server Replica 1

Model Server Replica 2

↓

📊Aggregated Predictions

Key System Components:

📊

Data Ingestion

Downloads dataset
Batches for memory efficiency
Caches for multi-epoch training

🤖

Model Training

Three parallel training processes
Collective communication
Automatic model selection

🚀

Model Serving

Replicated services
Load balancing
High availability

🔄

Workflow Orchestration

Argo Workflows
Step memoization
Async execution

Insight

This architecture demonstrates how all the patterns from previous chapters work together in practice. Each component uses specific patterns optimized for its role, while the overall system uses workflow patterns to coordinate everything efficiently.

2. Data Ingestion Component

2.1. The Problem: Memory Constraints and Multi-Epoch Training

Challenge 1: Memory Limitations

Even though Fashion-MNIST is only 30MB, real ML systems need to handle additional processing:

Image transformations (resize, normalize, augment)
Complex mathematical operations (convolutions, feature extraction)
Memory-intensive preprocessing pipelines

Challenge 2: Multi-Epoch Training Inefficiency

Traditional Sequential Training

Epoch 1: 3 hours (reading from disk)

Epoch 2: 3 hours (re-reading same data!)

Epoch 3: 3 hours (re-reading same data!)

Total: 9 hours with repeated I/O overhead

2.2. The Solution: Batching and Caching Patterns

Batching Pattern Implementation

In plain English: Instead of trying to eat an entire pizza at once, you eat it slice by slice. Batching breaks the dataset into manageable chunks that fit in memory.

In technical terms: Batching partitions the dataset into fixed-size chunks processed sequentially, keeping memory usage constant regardless of dataset size.

Why it matters: Enables processing of datasets larger than available RAM, prevents out-of-memory errors, and improves processing flexibility.

Batching Strategy (batch_size=1000)

Original Dataset: 60,000 images

split into↓

Batch 1
img 1-1000

↓

Process & Train

Batch 2
img 1001-2000

↓

Process & Train

Batch 3
img 2001-3000

↓

Process & Train

...
Batch 60

Benefits:

Memory Efficiency: Only 1/60th of dataset in memory at once
Scalability: Can handle datasets larger than available RAM
Processing Flexibility: Apply expensive operations per batch

Caching Pattern Implementation

In plain English: Like a chef's mise en place - prepare ingredients once, then cook multiple dishes faster. Cache processed batches after first use.

In technical terms: Caching stores processed batches in memory after initial disk read and processing, eliminating redundant I/O and computation for subsequent epochs.

Why it matters: The initial prep time pays off exponentially when training for multiple epochs, reducing total training time by 40-50%.

Caching Performance

Epoch 1 (Cold Start)

💽Read from Disk
(Slow I/O)

→

⚙️Process Images
(Expensive)

→

💾Train + Cache
(Fast Memory)

Epoch 2+ (Cached)

⚡Read from Cache
(Fast Memory)

→

🚀Train Directly
(No Processing!)

Without Caching

With Caching

Epoch 1

3 hours (Disk + Process + Train)

3.2 hours (Disk + Process + Train + Cache)

Epoch 2

3 hours (Disk + Process + Train)

45 minutes (Memory + Train)

Epoch 3

3 hours (Disk + Process + Train)

45 minutes (Memory + Train)

Total

9 hours

5 hours (44% improvement!)

Cache Management Strategy:

class DataCache:
    def __init__(self, memory_limit_gb=8):
        self.cache = {}
        self.memory_limit = memory_limit_gb * 1024**3
        self.current_usage = 0

    def get_batch(self, batch_id):
        if batch_id in self.cache:
            return self.cache[batch_id]  # Cache hit!
        else:
            batch = self.load_and_process_batch(batch_id)
            self.cache[batch_id] = batch  # Cache for future epochs
            return batch

Insight

Think of caching like a chef's mise en place: you spend time prepping ingredients once, then cooking multiple dishes becomes much faster. The initial prep time pays off exponentially when serving multiple customers (epochs).

2.3. Exercises

Q: Where do we store the cache?

A: In memory.
Q: Can we use the batching pattern when the Fashion-MNIST dataset gets large?

A: Yes - we can reduce batch size to handle larger datasets.

3. Model Training Component

3.1. The Problem: Choosing the Right Distributed Pattern

We have two main distributed training patterns available:

Parameter Server Pattern: Good for very large models that don't fit on single machines
Collective Communication Pattern: Good for medium-sized models with communication overhead concerns

Fashion-MNIST Model Characteristics:

Model Size: ~30MB (fits easily on single machine)
Training Data: 60,000 images (moderate size)
Complexity: CNN architecture (medium complexity)

3.2. The Solution: Collective Communication for Medium Models

Why Not Parameter Servers?

In plain English: Using parameter servers for small models is like hiring a team of messengers to pass notes between desks in the same room - the coordination overhead is worse than just talking directly.

In technical terms: Parameter servers introduce communication bottlenecks and synchronization delays that exceed computation time for models that fit on single machines.

Why it matters: Choosing the wrong pattern can make training slower, not faster. Match the pattern to your model size and communication requirements.

Parameter Server Challenges

🔧Worker 1
∂w₁/∂L

push gradients→

🗄️Parameter Server
Bottleneck!

pull updates←

🔧Worker 2
∂w₂/∂L

Problems: Network overhead > computation time for small models

Collective Communication Advantage

In plain English: Like a synchronized swimming team - everyone has the same routine (model), performs their part simultaneously (forward pass), coordinates together (AllReduce), and maintains perfect synchronization (model updates).

In technical terms: All workers maintain full model copies, process different data shards in parallel, and synchronize gradients via efficient AllReduce operations.

Why it matters: Eliminates central bottlenecks, enables peer-to-peer communication, and ensures all workers stay perfectly synchronized.

Collective Communication Setup

Worker 1
Full Model Copy
Data Shard 1

Worker 2
Full Model Copy
Data Shard 2

Worker 3
Full Model Copy
Data Shard 3

AllReduce↓

🔄

Gradient Aggregation
(Ring-based communication)

↓

Worker 1
Updated Model
(Same)

Worker 2
Updated Model
(Same)

Worker 3
Updated Model
(Same)

AllReduce Process Detail:

Step 1: Forward Pass (Parallel)

Worker 1: Data[0:20k] → gradients_1 Worker 2: Data[20k:40k] → gradients_2 Worker 3: Data[40k:60k] → gradients_3

↓

Step 2: AllReduce Communication

Ring operation aggregates all gradients Result: All workers have sum(gradients_1+2+3)

↓

Step 3: Model Update (Synchronized)

All workers: model = model - learning_rate * avg(gradients)

Three-Model Training Architecture:

Parallel Model Training

Training 1 (CNN)

W1 W2 W3
↓ AllReduce

🏆Model A
92% accuracy

Training 2 (Dense)

W1 W2 W3
↓ AllReduce

📊Model B
89% accuracy

Training 3 (ResNet)

W1 W2 W3
↓ AllReduce

📈Model C
85% accuracy

↓

🏆

Model Selection → Choose Best → Model A

Insight

Collective communication is like a synchronized swimming team: everyone has the same routine (model), performs their part simultaneously (forward pass), coordinates together (AllReduce), and maintains perfect synchronization (model updates).

3.3. Exercises

Q: Why isn't the parameter server pattern a good fit for our model?

A: There are blocking communications between workers and parameter servers, creating unnecessary overhead for a model that fits on single machines.
Q: Does each worker store different parts of the model when using the collective communication pattern?

A: No, each worker stores exactly the same copy of the model.

4. Model Serving Component

4.1. The Problem: Single Server Bottlenecks

In plain English: Using a single server for all requests is like having only one checkout line at a busy grocery store - everyone waits while the line gets longer.

In technical terms: Single-server architectures create performance bottlenecks under load, exhibit poor latency characteristics, and represent single points of failure.

Why it matters: Production systems need high availability, low latency, and the ability to handle traffic spikes without degradation.

Single Server Bottleneck

User 1

User 2

User 3

...

User 50

all requests to→

Model Server
⚠️ Bottleneck!
Queue: 47 requests waiting

Response time: 30+ seconds

Performance Issues:

Queue Buildup: Requests pile up during peak hours
Poor Latency: Users wait 30+ seconds for simple predictions
Single Point of Failure: Server down = entire service down
Resource Waste: Can't utilize multiple machines

4.2. The Solution: Replicated Services with Load Balancing

In plain English: Multiple checkout lines with someone directing customers to the shortest one. Everyone gets served faster, and if one line closes, others keep working.

In technical terms: Deploy multiple identical service replicas behind a load balancer that distributes incoming requests across replicas using routing algorithms.

Why it matters: Achieves linear scalability (3x servers = 3x throughput), improves availability through redundancy, and reduces latency by distributing load.

Load Balanced Model Serving

👥

50 User Requests

↓

⚖️

Load Balancer
(Round Robin)

distributes evenly↓

Model Server
Replica 1
Queue: 17

Model Server
Replica 2
Queue: 17

Model Server
Replica 3
Queue: 16

↓

✅

Response time: ~3 seconds per request

Load Balancing Strategies

🔄

Round Robin

✅ Simple implementation
✅ Even distribution
✅ No state tracking
❌ Doesn't consider load
❌ Assumes equal servers

📊

Least Connections

✅ Considers actual load
✅ Better for varying complexity
✅ Automatic balancing
❌ More complex tracking
❌ Connection monitoring overhead

#️⃣

Hash-Based

✅ Session affinity
✅ Caching benefits per user
✅ Predictable routing
❌ Uneven distribution possible
❌ Failures affect specific users

Complete Model Serving Flow

📤

User Upload

Client sends image for classification

→

⚖️

Load Balancer

Routes request to optimal replica

→

🤖

Model Server

Processes image and runs inference

→

📊

Prediction

Returns class, confidence, metadata

Prediction Result:

{
  "image_id": "user_image_123",
  "predicted_class": "T-shirt/top",
  "confidence": 0.94,
  "processing_time_ms": 45,
  "server_replica": "replica-2"
}

Single Server

3 Replicated Servers

Throughput

20 requests/second

60 requests/second (3x)

Latency

15-30 seconds (variable)

2-5 seconds (6x improvement)

Availability

95% (single point of failure)

99.9% (redundancy)

Cost-Efficiency

1x cost for 1x throughput

3x cost for 3x throughput (linear)

Insight

Replicated services are like having multiple checkout lines at a grocery store instead of one. The load balancer is the person directing customers to the shortest line, ensuring everyone gets served quickly and no single cashier gets overwhelmed.

4.3. Exercises

Q: What happens when we don't have a load balancer as part of the model serving system?

A: We cannot balance or distribute the model serving requests among the replicas.

5. End-to-End Workflow Integration

5.1. The Problems: Redundant Work and Sequential Blocking

Problem 1: Redundant Data Processing

In plain English: Like washing the same dishes every day even though they're already clean. If the dataset hasn't changed, why re-process it?

In technical terms: Without memoization, workflow systems re-execute unchanged data processing steps on every run, wasting computational resources.

Why it matters: For datasets that update monthly but experiments that run daily, you're doing 29x more work than necessary.

Current Workflow (Inefficient)

Day 1: Dataset updated → Full pipeline (6 hours)

Day 2: New model test → Full pipeline (6 hours) ← Same data!

Day 3: Hyperparameter tuning → Full pipeline (6 hours) ← Same data!

...

Day 30: Next dataset update

⚠️

Total wasted time: 29 × 3 hours = 87 hours of redundant processing

Problem 2: Sequential Execution Blocking

In plain English: Like waiting for the slowest runner to finish a marathon before announcing that the fastest runner won. You knew the winner 6 hours ago!

In technical terms: Sequential workflow execution blocks deployment of ready models while waiting for slower training jobs to complete, delaying time-to-service.

Why it matters: Users could be served by the best model hours earlier if we didn't wait for all training to finish.

Sequential Execution Bottleneck

Training 1: 2 hours → 92% accuracy ✅ Best model!

Training 2: 2 hours → 89% accuracy

Training 3: 6 hours → 85% accuracy ← Blocks everything!

↓

Model Selection: Waits 6 hours → Choose Training 1

Model Serving: Waits 6 hours → Deploy best model

⏰

Problem: Best model ready at 2 hours, but users wait 10 hours total!

5.2. The Solutions: Step Memoization and Async Execution

Solution 1: Step Memoization Pattern

In plain English: Smart caching that remembers what work has been done and skips it if nothing changed. Like checking if dishes are clean before washing them.

In technical terms: Step memoization stores metadata about completed workflow steps and conditionally skips re-execution based on input staleness and change thresholds.

Why it matters: Reduces redundant computation by 80-90% for workflows with infrequent source data updates.

Step Memoization Decision Tree

🔄Workflow Triggered

↓

Check Cache
Last Update: 15 days ago
Threshold: 30 days

↓

Data Fresh?
(< 30 days)

NO↓

Run Data Ingestion
Update Cache

Data Fresh?
(< 30 days)

YES↓

Skip Data Ingestion
Use Cached Dataset

↓

✅Continue with Model Training

Cache Implementation Strategy:

class WorkflowCache:
    def __init__(self):
        self.data_cache = {
            'dataset_location': '/cache/fashion_mnist',
            'last_updated': '2024-01-15',
            'record_count': 60000,
            'data_hash': 'abc123...'
        }

    def should_skip_data_ingestion(self):
        # Time-based check
        days_since_update = (datetime.now() - self.data_cache['last_updated']).days
        if days_since_update < 30:
            return True

        # Content-based check
        current_source_count = self.get_source_record_count()
        cached_count = self.data_cache['record_count']
        change_percentage = abs(current_source_count - cached_count) / cached_count

        return change_percentage < 0.1  # Skip if < 10% change

Solution 2: Asynchronous Execution Pattern

In plain English: Like a restaurant kitchen working on multiple orders simultaneously and serving dishes as they're ready, not waiting for the entire table's order to complete.

In technical terms: Asynchronous workflow execution runs training jobs in parallel and progressively deploys models as they complete, enabling immediate value delivery.

Why it matters: Reduces time-to-service from 10 hours to 2 hours - users get a good model immediately and better models as they become available.

Async Workflow (Optimized)

⏱️

Time 0: All training starts in parallel

Training 1: 2h

Training 2: 2h

Training 3: 6h

↓

Time 2h: Training 1 Complete

🏆92% accuracy ✅ Best model!

✓ Deploy Model 1 immediately

✓ Users get service with 92% accuracy

→ Training 2,3 continue running

↓

Time 4h: Training 2 Complete

📊89% accuracy

✓ Compare: 92% vs 89%

→ Keep Model 1 deployed (better)

→ Training 3 continues

↓

Time 8h: Training 3 Complete

📈85% accuracy

✓ Final comparison: 92% vs 89% vs 85%

✓ Confirm Model 1 is best

✅System optimization complete

User Experience:
Hour 0-2: No service (training)
Hour 2-8: Good service (92% model)
Hour 8+: Optimal service (confirmed best)

Async Implementation Flow:

class AsyncWorkflowManager:
    def execute_training_pipeline(self):
        # Start all training jobs in parallel
        futures = []
        for model_config in self.model_configs:
            future = self.executor.submit(self.train_model, model_config)
            futures.append(future)

        # Deploy models as they complete
        deployed_models = []
        for future in as_completed(futures):
            model = future.result()

            if self.is_better_than_current(model):
                self.deploy_model(model)
                deployed_models.append(model)
                self.notify_users(f"New model deployed: {model.accuracy}")

        # Final optimization when all complete
        best_model = max(deployed_models, key=lambda m: m.accuracy)
        self.finalize_deployment(best_model)

Sequential Approach

With Async + Memoization

Data Ingestion

3 hours (always runs)

0 hours (cached, skip)

Training Time

10 hours (sequential)

2 hours (first model)

Time-to-Service

13 hours total

2 hours (85% reduction!)

User Experience

No service until 13h

Good service at 2h, progressive improvement

Insight

Async workflow execution is like a restaurant kitchen: instead of completing one dish entirely before starting the next, skilled chefs work on multiple dishes simultaneously, serving completed items immediately while others continue cooking. Customers get fed faster, and the kitchen operates more efficiently.

5.3. Exercises

Q: Which component can benefit the most from step memoization?

A: The data ingestion component.
Q: How do we tell whether a step's execution can be skipped if its workflow has been triggered to run again?

A: Using the metadata in the step cache.

6. Summary and Exercises

Key Concepts Mastered

Complete ML System Implementation

Workflow Orchestration

Step Memoization (skip redundant work)Async Execution (progressive deployment)Workflow Management (Argo integration)

Data Ingestion

Batching PatternCaching PatternMemory Optimization

Model Training

Collective CommunicationParallel TrainingModel Selection

Model Serving

Replicated ServicesLoad BalancingHigh Availability

Design Decisions Summary

📊

Data Strategy

Batching for memory efficiency
Caching for multi-epoch optimization
44% training efficiency gain

🤖

Training Strategy

Collective communication
Medium-sized Fashion-MNIST models
Linear scalability with workers

🚀

Serving Strategy

Replicated services
Load balancing
3x throughput improvement

🔄

Workflow Strategy

Step memoization
Async execution
85% faster time-to-service

Real-World Production Benefits

85% faster time-to-service through async deployment
44% training efficiency gain through intelligent caching
3x throughput improvement via replicated serving
Linear scalability with additional compute resources

Technology Integration Preview

Next Chapter Technologies:

TensorFlow: Model building and training
Kubernetes: Container orchestration and scaling
Kubeflow: ML-specific Kubernetes workflows
Docker: Containerization and deployment
Argo Workflows: End-to-end pipeline orchestration

Insight

This chapter bridges theory and practice by showing exactly how to combine all distributed ML patterns into a cohesive, production-ready system. Each design choice optimizes for real-world constraints while maintaining the flexibility to scale and evolve.

Practice Exercises

System Design:

Calculate memory requirements for different batch sizes with Fashion-MNIST
Design cache eviction strategy for long-running production systems
Plan load balancing strategy for globally distributed model serving

Performance Optimization: 4. Estimate training time improvements with different numbers of workers 5. Calculate optimal cache size vs processing speed trade-offs 6. Design monitoring strategy for async workflow execution

Scalability Planning: 7. Plan system architecture for 1M+ daily image classification requests 8. Design data pipeline for streaming real-time fashion image updates 9. Create disaster recovery strategy for multi-region model serving

Previous: Chapter 6: Operation Patterns | Next: Chapter 8: Overview of Relevant Technologies

Table of Contents​

1. Project Overview​

1.1. Project Background​

1.2. System Components and Architecture​

2. Data Ingestion Component​

2.1. The Problem: Memory Constraints and Multi-Epoch Training​

2.2. The Solution: Batching and Caching Patterns​

Batching Pattern Implementation​

Caching Pattern Implementation​

2.3. Exercises​

3. Model Training Component​

3.1. The Problem: Choosing the Right Distributed Pattern​

3.2. The Solution: Collective Communication for Medium Models​

Why Not Parameter Servers?​

Collective Communication Advantage​

3.3. Exercises​

4. Model Serving Component​

4.1. The Problem: Single Server Bottlenecks​

4.2. The Solution: Replicated Services with Load Balancing​

Load Balancing Strategies​

Complete Model Serving Flow​

4.3. Exercises​

5. End-to-End Workflow Integration​

5.1. The Problems: Redundant Work and Sequential Blocking​

Problem 1: Redundant Data Processing​

Problem 2: Sequential Execution Blocking​

5.2. The Solutions: Step Memoization and Async Execution​

Solution 1: Step Memoization Pattern​

Solution 2: Asynchronous Execution Pattern​

5.3. Exercises​

6. Summary and Exercises​

Key Concepts Mastered​

Design Decisions Summary​

Real-World Production Benefits​

Technology Integration Preview​

Practice Exercises​

Table of Contents

1. Project Overview

1.1. Project Background

1.2. System Components and Architecture

2. Data Ingestion Component

2.1. The Problem: Memory Constraints and Multi-Epoch Training

2.2. The Solution: Batching and Caching Patterns

Batching Pattern Implementation

Caching Pattern Implementation

2.3. Exercises

3. Model Training Component

3.1. The Problem: Choosing the Right Distributed Pattern

3.2. The Solution: Collective Communication for Medium Models

Why Not Parameter Servers?

Collective Communication Advantage

3.3. Exercises

4. Model Serving Component

4.1. The Problem: Single Server Bottlenecks

4.2. The Solution: Replicated Services with Load Balancing

Load Balancing Strategies

Complete Model Serving Flow

4.3. Exercises

5. End-to-End Workflow Integration

5.1. The Problems: Redundant Work and Sequential Blocking

Problem 1: Redundant Data Processing

Problem 2: Sequential Execution Blocking

5.2. The Solutions: Step Memoization and Async Execution

Solution 1: Step Memoization Pattern

Solution 2: Asynchronous Execution Pattern

5.3. Exercises

6. Summary and Exercises

Key Concepts Mastered

Design Decisions Summary

Real-World Production Benefits

Technology Integration Preview

Practice Exercises