Chapter 6: Operation Patterns

Managing and optimizing distributed machine learning systems in production

What are Operations in Machine Learning Systems?
- 1.1. Beyond Individual Components
- 1.2. The Challenge of Black Box Systems
Scheduling Patterns: Assigning Resources Effectively
- 2.1. The Problem: Resource Competition and Blocking
- 2.2. The Solution: Intelligent Scheduling Strategies
- 2.3. Discussion: Production Considerations
- 2.4. Exercises
Metadata Pattern: Handling Failures Appropriately
- 3.1. The Problem: Complex Failure Scenarios
- 3.2. The Solution: Intelligent Failure Recovery
- 3.3. Discussion: Advanced Metadata Applications
- 3.4. Exercises
Summary and Exercises

Think of ML operations like managing a busy airport — you need intelligent scheduling to manage runway traffic (resource allocation), comprehensive monitoring to track all flights (metadata collection), and smart protocols to handle delays and cancellations (failure recovery) while keeping passengers satisfied.

1. What are Operations in Machine Learning Systems?

Operations in ML systems go beyond managing individual components like data ingestion or model training. They focus on cross-cutting concerns that affect the entire system: resource management, failure handling, monitoring, and optimization.

1.1. Beyond Individual Components

Unlike patterns specific to individual components, operational patterns address challenges that span multiple steps in ML workflows. Think of it as the difference between optimizing individual instruments in an orchestra versus conducting the entire performance.

1.2. The Challenge of Black Box Systems

Problem Scenario:

Complex Workflow with Multiple Failures

Data Ingestion

→

Model Training 1
❌ Failed

→

Model Training 2
✅ Success

↓

Model Training 3
❌ Failed

→

Model Serving A
❌ Failed

↓

Model Serving B
✅ Success

The Challenge: When failures occur, we can only see that something failed, not why it failed or how to fix it efficiently.

Insight

Operations patterns are like installing a comprehensive monitoring and control system in your production facility. Instead of just knowing "something broke," you get detailed diagnostics about what, why, and how to fix it optimally.

2. Scheduling Patterns: Assigning Resources Effectively

2.1. The Problem: Resource Competition and Blocking

Scenario: Multiple data scientists sharing a cluster for model training.

User Stories:

User A: Training fraud detection models
User B: Building condition monitoring for industrial equipment
User C: Developing recommendation systems

Problem with Simple First-Come-First-Served Scheduling:

FIFO Scheduling Creates Long Wait Times

Time →  0    2    4    6    8   10   12   14   16   18   20
User A: ████████████████████████████████████████████████
User B: (waiting...)                                   ████████████████
User C: (waiting...)                                                ████████████

Additional Problems:

Resource Competition: Users wake up at 3 AM to submit jobs when fewer people are online
Distributed Training Blocking: If only 3 out of 4 required workers are available, the job can't start
Resource Waste: Allocated but unused workers sit idle

2.2. The Solution: Intelligent Scheduling Strategies

Core Principle: Divide computational resources equally among users or groups.

Simple Fair-Share Example (4 Users):

Simple Fair-Share (4 Users)

User A
25%

User B
25%

User C
25%

User D
25%

What if User B starts 2 processes?

User A: 25% ├─ Process A1: 25%
User B: 25% ├─ Process B1: 12.5%
            └─ Process B2: 12.5%
User C: 25% ├─ Process C1: 25%
User D: 25% ├─ Process D1: 25%

Hierarchical Fair-Share (Groups + Users):

Hierarchical Resource Allocation

Total: 100%

Group 1
33.3%

U1
11.1%

U2
11.1%

U3
11.1%

Group 2
33.3%

U4
16.7%

U5
16.7%

Group 3
33.3%

U6
8.3%

U7
8.3%

U8
8.3%

U9
8.3%

Group 1: 33.3% ÷ 3 users = 11.1% per user
Group 2: 33.3% ÷ 2 users = 16.7% per user
Group 3: 33.3% ÷ 4 users = 8.3% per user

Insight

Fair-share scheduling is like a restaurant with multiple dining rooms: each room gets equal space, and within each room, tables are shared equally among diners. This prevents any single group from monopolizing the entire restaurant.

Priority Scheduling

The Deadlock Problem:

Deadlock Scenario Requiring Priority Scheduling

Group 1 (3 Users)

Admin (User 1)
11.1% CPU
Normal job running
Admin job WAITING

User 2
11.1% CPU
Job 1 RUNNING
depends on Job 2

User 3
11.1% CPU
Job 2 STUCK
database issue
needs restart

DEADLOCK: Job 1 waits for Job 2 → Job 2 stuck → Admin job waits behind Job 1 → No progress!

Priority Scheduling Solution:

Priority-Based Execution Order

Job A
HIGH

First→

Job C
HIGH

Second→

Job D
MEDIUM

Third→

Job B
LOW

Priority Scheduling with Preemption:

Timeline:
0────5────8────12───16───20
│    │    │    │    │    │
B    A    B    D    B    │
██   ░░   ██   ░░   ██   │
     ██   ░░   ██   ░░   │ Legend:
     ░░   ░░   ░░   ██   │ ██ = Running
     ░░   ░░   ░░   ░░   │ ░░ = Waiting

Jobs:
A (HIGH):    ░░██░░░░░░░░
B (LOW):     ████░████░██
C (HIGH):    ░░░░░░░░░░░░ (not submitted)
D (MEDIUM):  ░░░░░░░░██░░

Time 0: Job B (LOW) starts running
Time 5: Job A (HIGH) submitted → Job B preempted → Job A starts
Time 8: Job A completes → Job B resumes
Time 12: Job D (MEDIUM) submitted → Job B preempted → Job D starts
Time 16: Job D completes → Job B resumes

Gang Scheduling

The Problem: Distributed Training Coordination

Without Gang Scheduling (Resource Waste):

Inefficient Resource Usage Without Gang Scheduling

Worker 1
BLOCKED
❌ missing v0, v2

Network X

Worker 2
READY
✅ has v1
(wasting resources)

Network X

Worker 3
BLOCKED
❌ missing v0, v2

Network instability prevents gradient exchange. Worker 2 wastes resources waiting.

With Gang Scheduling (Coordinated Start):

Gang Scheduling Ensures Coordinated Execution

Before Network Stability:

Worker 1
NOT READY
Waiting...

Worker 2
READY
Waiting for others

Worker 3
NOT READY
Waiting...

Network stabilizes↓

After Network Stability:

Worker 1
READY
All start together

Worker 2
READY
All start together

Worker 3
READY
All start together

Successful AllReduce: v0, v1, v2 exchanged across all workers

Gang Scheduling Decision Process:

Check if ALL workers are ready
If not, don't start ANY workers
Wait until network stabilizes
Start all workers simultaneously

Insight

Gang scheduling is like coordinating a synchronized swimming team: you don't start the routine until all swimmers are ready and in position. Starting with missing teammates would waste everyone's effort and ruin the performance.

2.3. Discussion: Production Considerations

Scheduling Pattern Summary:

Pattern

Characteristics

Fair-Share

Equal resource distribution across users/groups

✅ Prevents monopolization ❌ No urgency handling

Priority

Importance-based execution with preemption

✅ Critical jobs first ❌ Can be gamed

Gang

Coordinated group starts for distributed training

✅ Optimizes communication ❌ Delays if one worker slow

Security Considerations:

Priority Abuse: Malicious users creating only high-priority jobs
Resource Limits: Administrators enforce priority quotas
Fair Access: Prevent any user from starving others

Elastic Scheduling Alternative:

Traditional Gang vs Elastic Scheduling

Traditional Gang Scheduling:

Wait for ALL 4 workers → Start training

Elastic Scheduling:

Worker 1

Worker 2

Start with 2

→

Worker 1

Worker 2

Worker 3

Add worker 3

→

Worker 1

Worker 2

Worker 3

Worker 4

Add worker 4

Advantages: Faster start, better resource utilization
Challenges: Dynamic batch size, learning rate adjustments

2.4. Exercises

Q: Can we only apply fair-share scheduling at the user level?

A: No, we can apply this scheduling strategy at each level of abstraction, such as processes, users, groups, etc.
Q: Is gang scheduling suitable for all distributed model training jobs?

A: No, some machine learning frameworks support elastic scheduling, which allows distributed model training jobs to start with any number of workers available without waiting for all the requested workers to be ready for communication. In this case, gang scheduling is not suitable.

3. Metadata Pattern: Handling Failures Appropriately

3.1. The Problem: Complex Failure Scenarios

Simple Workflow Failure (Easy to Handle):

Simple Linear Failure

Data Ingestion
✅

→

Model Training
❌ Failed

→

Model Serving
(not reached)

Simple Fix: Retry failed step, continue from there

Complex Workflow Failure (Challenging):

Complex Failure Scenario Requiring Intelligent Decisions

Data Ingestion
✅ Complete

↓

Model Training 1
✅ 85% accuracy
(Expected: 85%)

Model Training 2
❌ Failed
(Expected: 80%)

Model Training 3
❌ Failed
(Expected: 75%)

↓

Model Selection (Top 2)
⚠️ Only 1 model available

↓

Result Aggregation
(cannot proceed)

Challenge: Two training steps failed. Should we:

Restart both from scratch? (Slow)
Skip failed steps? (Lower quality)
Something smarter?

3.2. The Solution: Intelligent Failure Recovery

Root Cause Analysis Through Metadata

Failure Types and Recovery Strategies:

Failure Classification

TEMPORARY FAILURES

• Network hiccups
• Brief resource contention
• Worker preemption

Recovery: RETRY

PERMANENT FAILURES

• Out of disk space
• Data source deleted
• Configuration errors

Recovery: FIX ROOT CAUSE

Metadata Collection Example:

Training Step Failure Analysis

Time: 0    5    10   15   20   25   30   35   40
Data: ✅   ✅   ✅   ✅   ✅   ✅   ✅   ❌   ❌
Memory (MB):
      20 ──25 ──23 ──27 ──30 ──28 ──200 ──❌ ──❌
                                    ▲
                              Memory spike!

Metadata Collected Every 5 Minutes:
• Memory usage: 20MB → 25MB → 23MB → 27MB → 30MB → 28MB → 200MB → CRASH
• Data availability: YES → YES → YES → YES → YES → YES → YES → NO → NO
• CPU usage: 45% → 50% → 48% → 52% → 55% → 53% → 95% → CRASH

Diagnosis: Out-of-memory error. Solution: Increase memory allocation.

Advanced Failure Handling Strategies

Strategy 1: Progress-Based Recovery

Progress-Based Recovery Decisions

Training 1
95% complete
92% accuracy
✅ KEEP

Training 2
90% complete
89% accuracy
❌ RETRY/RESUME

Training 3
20% complete
27% accuracy
❌ SKIP/CANCEL

Model Accuracy Trends:
Training 1: 30% → 45% → 60% → 75% → 85% → 90% → 92% (converging)
Training 2: 25% → 40% → 55% → 70% → 82% → 87% → 89% (converging)
Training 3: 35% → 40% → 42% → 38% → 35% → 30% → 27% (declining!)

Strategy 2: Resource-Based Recovery

Cost-Benefit Analysis

Training 3 Performance:
Progress: 1% every 30 minutes → 2 hours per 4% → 50 hours total!
Allocated Resources: 2 CPUs, 4GB RAM

Option A: Continue Training 3
• Time: 50 hours
• Outcome: Likely poor model
• Cost: Block other experiments

Option B: Reallocate Resources
• Action: Cancel Training 3
• Benefit: Speed up Training 1 & 2
• Result: 2 good models in 25 hours

Decision: Cancel Training 3, reallocate resources

Insight

Metadata-driven failure recovery is like having a detailed medical record during emergency treatment. Instead of guessing what's wrong, doctors can see exactly what happened, when it happened, and what treatments will be most effective.

Implementation Example

class MetadataCollector:
    def __init__(self, step_name):
        self.step_name = step_name
        self.metadata = {
            'start_time': time.time(),
            'resource_usage': [],
            'performance_metrics': [],
            'dependencies': [],
            'checkpoints': []
        }

    def record_resource_usage(self):
        self.metadata['resource_usage'].append({
            'timestamp': time.time(),
            'memory_mb': psutil.virtual_memory().used / 1024 / 1024,
            'cpu_percent': psutil.cpu_percent(),
            'disk_io': psutil.disk_io_counters()
        })

    def record_training_progress(self, epoch, accuracy, loss):
        self.metadata['performance_metrics'].append({
            'epoch': epoch,
            'accuracy': accuracy,
            'loss': loss,
            'timestamp': time.time()
        })

    def analyze_failure(self, error):
        # Analyze patterns in metadata to determine failure type
        recent_memory = [r['memory_mb'] for r in self.metadata['resource_usage'][-5:]]
        memory_spike = max(recent_memory) > 2 * min(recent_memory)

        if memory_spike:
            return {
                'failure_type': 'out_of_memory',
                'recommendation': 'increase_memory_allocation',
                'retry': True
            }
        # ... other analysis logic

3.3. Discussion: Advanced Metadata Applications

Network Performance Monitoring

Network Performance Metadata Identifies Bottlenecks

Worker 1
Latency: 50ms
Bandwidth: 1 Gbps

Worker 2
Latency: 45ms
Bandwidth: 1 Gbps

Worker 3
Latency: 500ms!
Bandwidth: 100 Mbps

↓

Parameter Server

Bottleneck: Worker 3 slow!
Action: Replace Worker 3 with elastic scaling

Model Lineage and Provenance

Model Lineage Tracking

Dataset_v1.2

Feature_eng_v3

↓

Preprocessing_A

↓

Hyperparams_set1

→

Model_X_v1.0
92% accuracy

↓

Production Deployment
2024-01-15

Metadata Enables:

Which data contributed to this model?
What preprocessing was applied?
Can we reproduce this exact result?
What happens if we update the dataset?

Governance Benefits:

Reproducibility: Exact recreation of training conditions
Compliance: Audit trail for regulated industries
Debugging: Trace issues back to root causes
Optimization: Identify which factors improve performance

3.4. Exercises

Q: If the training step failed due to loss of training data source, what should we do?

A: We should rerun data ingestion before retrying the model training step since this failure is permanent, and simply retrying would lead to repetitive failures.
Q: What type of metadata can be collected if we look at individual workers or parameter servers?

A: Various network performance metrics while the model is being trained (e.g., bandwidth, throughput, and latency). This type of information is very useful when we want to detect when workers experience poor network performance that blocks the entire training process.

4. Summary and Exercises

Key Concepts Mastered

📊

Scheduling Patterns

Fair-Share: Equal resource distribution
Priority: Importance-based execution
Gang: Coordinated distributed starts
Elastic: Dynamic worker scaling

📋

Metadata Patterns

Root Cause Analysis
Progress-Based Recovery
Resource-Based Decisions
Lineage Tracking

Core Principles

Resource Management: Intelligent scheduling prevents conflicts and maximizes utilization
Failure Recovery: Metadata-driven decisions minimize downtime and optimize outcomes
System Observability: Comprehensive monitoring enables proactive optimization
Operational Efficiency: Patterns reduce manual intervention and improve reliability

Real-World Applications

Multi-tenant ML platforms with fair resource sharing
Production training pipelines with intelligent failure recovery
High-availability inference systems with priority scheduling
Regulatory compliance with comprehensive metadata tracking

Insight

Operations patterns transform ML systems from fragile, hard-to-debug monoliths into resilient, observable, and efficiently managed platforms. Master these patterns to build systems that scale gracefully and fail intelligently.

Practice Exercises

Scheduling Scenarios:

Design fair-share allocation for 3 teams with 2, 4, and 6 members respectively
Plan priority levels for: user experiments, production retraining, system maintenance
Determine when to use gang vs elastic scheduling for different workload types

Failure Recovery: 4. Create metadata collection strategy for a 24/7 recommendation system 5. Design recovery procedures for: data corruption, worker failures, resource exhaustion 6. Plan rollback strategies for failed model deployments

System Design: 7. Architect scheduling system for mixed workloads: batch training, real-time inference, experimentation 8. Design metadata schema for regulatory compliance in financial ML systems 9. Create monitoring dashboard for multi-team ML platform

Previous: Chapter 5: Workflow Patterns | Next: Chapter 7: Project Overview & System Architecture

Table of Contents​

1. What are Operations in Machine Learning Systems?​

1.1. Beyond Individual Components​

1.2. The Challenge of Black Box Systems​

2. Scheduling Patterns: Assigning Resources Effectively​

2.1. The Problem: Resource Competition and Blocking​

2.2. The Solution: Intelligent Scheduling Strategies​

Fair-Share Scheduling​

Priority Scheduling​

Gang Scheduling​

2.3. Discussion: Production Considerations​

2.4. Exercises​

3. Metadata Pattern: Handling Failures Appropriately​

3.1. The Problem: Complex Failure Scenarios​

3.2. The Solution: Intelligent Failure Recovery​

Root Cause Analysis Through Metadata​

Advanced Failure Handling Strategies​

Implementation Example​

3.3. Discussion: Advanced Metadata Applications​

Network Performance Monitoring​

Model Lineage and Provenance​

3.4. Exercises​

4. Summary and Exercises​

Key Concepts Mastered​

Core Principles​

Real-World Applications​

Practice Exercises​

Table of Contents

1. What are Operations in Machine Learning Systems?

1.1. Beyond Individual Components

1.2. The Challenge of Black Box Systems

2. Scheduling Patterns: Assigning Resources Effectively

2.1. The Problem: Resource Competition and Blocking

2.2. The Solution: Intelligent Scheduling Strategies

Fair-Share Scheduling

Priority Scheduling

Gang Scheduling

2.3. Discussion: Production Considerations

2.4. Exercises

3. Metadata Pattern: Handling Failures Appropriately

3.1. The Problem: Complex Failure Scenarios

3.2. The Solution: Intelligent Failure Recovery

Root Cause Analysis Through Metadata

Advanced Failure Handling Strategies

Implementation Example

3.3. Discussion: Advanced Metadata Applications

Network Performance Monitoring

Model Lineage and Provenance

3.4. Exercises

4. Summary and Exercises

Key Concepts Mastered

Core Principles

Real-World Applications

Practice Exercises