Skip to main content

Chapter 6: Operation Patterns

Managing and optimizing distributed machine learning systems in production


Table of Contents

  1. What are Operations in Machine Learning Systems?
  2. Scheduling Patterns: Assigning Resources Effectively
  3. Metadata Pattern: Handling Failures Appropriately
  4. Summary and Exercises

Think of ML operations like managing a busy airport — you need intelligent scheduling to manage runway traffic (resource allocation), comprehensive monitoring to track all flights (metadata collection), and smart protocols to handle delays and cancellations (failure recovery) while keeping passengers satisfied.


1. What are Operations in Machine Learning Systems?

Operations in ML systems go beyond managing individual components like data ingestion or model training. They focus on cross-cutting concerns that affect the entire system: resource management, failure handling, monitoring, and optimization.

1.1. Beyond Individual Components

Unlike patterns specific to individual components, operational patterns address challenges that span multiple steps in ML workflows. Think of it as the difference between optimizing individual instruments in an orchestra versus conducting the entire performance.

1.2. The Challenge of Black Box Systems

Problem Scenario:

Complex Workflow with Multiple Failures
Data Ingestion
Model Training 1
❌ Failed
Model Training 2
✅ Success
Model Training 3
❌ Failed
Model Serving A
❌ Failed
Model Serving B
✅ Success

The Challenge: When failures occur, we can only see that something failed, not why it failed or how to fix it efficiently.

Insight

Operations patterns are like installing a comprehensive monitoring and control system in your production facility. Instead of just knowing "something broke," you get detailed diagnostics about what, why, and how to fix it optimally.


2. Scheduling Patterns: Assigning Resources Effectively

2.1. The Problem: Resource Competition and Blocking

Scenario: Multiple data scientists sharing a cluster for model training.

User Stories:

  • User A: Training fraud detection models
  • User B: Building condition monitoring for industrial equipment
  • User C: Developing recommendation systems

Problem with Simple First-Come-First-Served Scheduling:

FIFO Scheduling Creates Long Wait Times
Time → 0 2 4 6 8 10 12 14 16 18 20
User A: ████████████████████████████████████████████████
User B: (waiting...) ████████████████
User C: (waiting...) ████████████

Additional Problems:

  1. Resource Competition: Users wake up at 3 AM to submit jobs when fewer people are online
  2. Distributed Training Blocking: If only 3 out of 4 required workers are available, the job can't start
  3. Resource Waste: Allocated but unused workers sit idle

2.2. The Solution: Intelligent Scheduling Strategies

Fair-Share Scheduling

Core Principle: Divide computational resources equally among users or groups.

Simple Fair-Share Example (4 Users):

Simple Fair-Share (4 Users)
User A
25%
User B
25%
User C
25%
User D
25%

What if User B starts 2 processes?

User A: 25% ├─ Process A1: 25%
User B: 25% ├─ Process B1: 12.5%
└─ Process B2: 12.5%
User C: 25% ├─ Process C1: 25%
User D: 25% ├─ Process D1: 25%

Hierarchical Fair-Share (Groups + Users):

Hierarchical Resource Allocation
Total: 100%
Group 1
33.3%
U1
11.1%
U2
11.1%
U3
11.1%
Group 2
33.3%
U4
16.7%
U5
16.7%
Group 3
33.3%
U6
8.3%
U7
8.3%
U8
8.3%
U9
8.3%
Group 1: 33.3% ÷ 3 users = 11.1% per user
Group 2: 33.3% ÷ 2 users = 16.7% per user
Group 3: 33.3% ÷ 4 users = 8.3% per user

Insight

Fair-share scheduling is like a restaurant with multiple dining rooms: each room gets equal space, and within each room, tables are shared equally among diners. This prevents any single group from monopolizing the entire restaurant.

Priority Scheduling

The Deadlock Problem:

Deadlock Scenario Requiring Priority Scheduling
Group 1 (3 Users)

Admin (User 1)
11.1% CPU
Normal job running
Admin job WAITING

User 2
11.1% CPU
Job 1 RUNNING
depends on Job 2

User 3
11.1% CPU
Job 2 STUCK
database issue
needs restart

DEADLOCK: Job 1 waits for Job 2 → Job 2 stuck → Admin job waits behind Job 1 → No progress!

Priority Scheduling Solution:

Priority-Based Execution Order
Job A
HIGH
First
Job C
HIGH
Second
Job D
MEDIUM
Third
Job B
LOW

Priority Scheduling with Preemption:

Timeline:
0────5────8────12───16───20
│ │ │ │ │ │
B A B D B │
██ ░░ ██ ░░ ██ │
██ ░░ ██ ░░ │ Legend:
░░ ░░ ░░ ██ │ ██ = Running
░░ ░░ ░░ ░░ │ ░░ = Waiting

Jobs:
A (HIGH): ░░██░░░░░░░░
B (LOW): ████░████░██
C (HIGH): ░░░░░░░░░░░░ (not submitted)
D (MEDIUM): ░░░░░░░░██░░

Time 0: Job B (LOW) starts running
Time 5: Job A (HIGH) submitted → Job B preempted → Job A starts
Time 8: Job A completes → Job B resumes
Time 12: Job D (MEDIUM) submitted → Job B preempted → Job D starts
Time 16: Job D completes → Job B resumes

Gang Scheduling

The Problem: Distributed Training Coordination

Without Gang Scheduling (Resource Waste):

Inefficient Resource Usage Without Gang Scheduling
Worker 1
BLOCKED
❌ missing v0, v2
Network X
Worker 2
READY
✅ has v1
(wasting resources)
Network X
Worker 3
BLOCKED
❌ missing v0, v2
Network instability prevents gradient exchange. Worker 2 wastes resources waiting.

With Gang Scheduling (Coordinated Start):

Gang Scheduling Ensures Coordinated Execution
Before Network Stability:
Worker 1
NOT READY
Waiting...
Worker 2
READY
Waiting for others
Worker 3
NOT READY
Waiting...
Network stabilizes
After Network Stability:
Worker 1
READY
All start together
Worker 2
READY
All start together
Worker 3
READY
All start together
Successful AllReduce: v0, v1, v2 exchanged across all workers

Gang Scheduling Decision Process:

  1. Check if ALL workers are ready
  2. If not, don't start ANY workers
  3. Wait until network stabilizes
  4. Start all workers simultaneously

Insight

Gang scheduling is like coordinating a synchronized swimming team: you don't start the routine until all swimmers are ready and in position. Starting with missing teammates would waste everyone's effort and ruin the performance.

2.3. Discussion: Production Considerations

Scheduling Pattern Summary:

Pattern
Characteristics
Fair-Share
Equal resource distribution across users/groups
✅ Prevents monopolization ❌ No urgency handling
Priority
Importance-based execution with preemption
✅ Critical jobs first ❌ Can be gamed
Gang
Coordinated group starts for distributed training
✅ Optimizes communication ❌ Delays if one worker slow

Security Considerations:

  • Priority Abuse: Malicious users creating only high-priority jobs
  • Resource Limits: Administrators enforce priority quotas
  • Fair Access: Prevent any user from starving others

Elastic Scheduling Alternative:

Traditional Gang vs Elastic Scheduling
Traditional Gang Scheduling:
Wait for ALL 4 workers → Start training
Elastic Scheduling:
Worker 1
Worker 2
Start with 2
Worker 1
Worker 2
Worker 3
Add worker 3
Worker 1
Worker 2
Worker 3
Worker 4
Add worker 4

Advantages: Faster start, better resource utilization
Challenges: Dynamic batch size, learning rate adjustments

2.4. Exercises

  1. Q: Can we only apply fair-share scheduling at the user level?

    A: No, we can apply this scheduling strategy at each level of abstraction, such as processes, users, groups, etc.

  2. Q: Is gang scheduling suitable for all distributed model training jobs?

    A: No, some machine learning frameworks support elastic scheduling, which allows distributed model training jobs to start with any number of workers available without waiting for all the requested workers to be ready for communication. In this case, gang scheduling is not suitable.


3. Metadata Pattern: Handling Failures Appropriately

3.1. The Problem: Complex Failure Scenarios

Simple Workflow Failure (Easy to Handle):

Simple Linear Failure
Data Ingestion
Model Training
❌ Failed
Model Serving
(not reached)
Simple Fix: Retry failed step, continue from there

Complex Workflow Failure (Challenging):

Complex Failure Scenario Requiring Intelligent Decisions
Data Ingestion
✅ Complete
Model Training 1
✅ 85% accuracy
(Expected: 85%)
Model Training 2
❌ Failed
(Expected: 80%)
Model Training 3
❌ Failed
(Expected: 75%)
Model Selection (Top 2)
⚠️ Only 1 model available
Result Aggregation
(cannot proceed)

Challenge: Two training steps failed. Should we:

  1. Restart both from scratch? (Slow)
  2. Skip failed steps? (Lower quality)
  3. Something smarter?

3.2. The Solution: Intelligent Failure Recovery

Root Cause Analysis Through Metadata

Failure Types and Recovery Strategies:

Failure Classification
TEMPORARY FAILURES

• Network hiccups
• Brief resource contention
• Worker preemption

Recovery: RETRY
PERMANENT FAILURES

• Out of disk space
• Data source deleted
• Configuration errors

Recovery: FIX ROOT CAUSE

Metadata Collection Example:

Training Step Failure Analysis
Time: 0 5 10 15 20 25 30 35 40
Data: ✅ ✅ ✅ ✅ ✅ ✅ ✅ ❌ ❌
Memory (MB):
20 ──25 ──23 ──27 ──30 ──28 ──200 ──❌ ──❌
Memory spike!

Metadata Collected Every 5 Minutes:
• Memory usage: 20MB → 25MB → 23MB → 27MB → 30MB → 28MB → 200MB → CRASH
• Data availability: YES → YES → YES → YES → YES → YES → YES → NO → NO
• CPU usage: 45% → 50% → 48% → 52% → 55% → 53% → 95% → CRASH

Diagnosis: Out-of-memory error. Solution: Increase memory allocation.

Advanced Failure Handling Strategies

Strategy 1: Progress-Based Recovery

Progress-Based Recovery Decisions
Training 1
95% complete
92% accuracy
✅ KEEP
Training 2
90% complete
89% accuracy
❌ RETRY/RESUME
Training 3
20% complete
27% accuracy
❌ SKIP/CANCEL

Model Accuracy Trends:
Training 1: 30% → 45% → 60% → 75% → 85% → 90% → 92% (converging)
Training 2: 25% → 40% → 55% → 70% → 82% → 87% → 89% (converging)
Training 3: 35% → 40% → 42% → 38% → 35% → 30% → 27% (declining!)

Strategy 2: Resource-Based Recovery

Cost-Benefit Analysis

Training 3 Performance:
Progress: 1% every 30 minutes → 2 hours per 4% → 50 hours total!
Allocated Resources: 2 CPUs, 4GB RAM

Option A: Continue Training 3
• Time: 50 hours
• Outcome: Likely poor model
• Cost: Block other experiments

Option B: Reallocate Resources
• Action: Cancel Training 3
• Benefit: Speed up Training 1 & 2
• Result: 2 good models in 25 hours

Decision: Cancel Training 3, reallocate resources

Insight

Metadata-driven failure recovery is like having a detailed medical record during emergency treatment. Instead of guessing what's wrong, doctors can see exactly what happened, when it happened, and what treatments will be most effective.

Implementation Example

class MetadataCollector:
def __init__(self, step_name):
self.step_name = step_name
self.metadata = {
'start_time': time.time(),
'resource_usage': [],
'performance_metrics': [],
'dependencies': [],
'checkpoints': []
}

def record_resource_usage(self):
self.metadata['resource_usage'].append({
'timestamp': time.time(),
'memory_mb': psutil.virtual_memory().used / 1024 / 1024,
'cpu_percent': psutil.cpu_percent(),
'disk_io': psutil.disk_io_counters()
})

def record_training_progress(self, epoch, accuracy, loss):
self.metadata['performance_metrics'].append({
'epoch': epoch,
'accuracy': accuracy,
'loss': loss,
'timestamp': time.time()
})

def analyze_failure(self, error):
# Analyze patterns in metadata to determine failure type
recent_memory = [r['memory_mb'] for r in self.metadata['resource_usage'][-5:]]
memory_spike = max(recent_memory) > 2 * min(recent_memory)

if memory_spike:
return {
'failure_type': 'out_of_memory',
'recommendation': 'increase_memory_allocation',
'retry': True
}
# ... other analysis logic

3.3. Discussion: Advanced Metadata Applications

Network Performance Monitoring

Network Performance Metadata Identifies Bottlenecks

Worker 1
Latency: 50ms
Bandwidth: 1 Gbps

Worker 2
Latency: 45ms
Bandwidth: 1 Gbps

Worker 3
Latency: 500ms!
Bandwidth: 100 Mbps

Parameter Server

Bottleneck: Worker 3 slow!
Action: Replace Worker 3 with elastic scaling

Model Lineage and Provenance

Model Lineage Tracking
Dataset_v1.2
Feature_eng_v3
Preprocessing_A
Hyperparams_set1
Model_X_v1.0
92% accuracy
Production Deployment
2024-01-15

Metadata Enables:

  • Which data contributed to this model?
  • What preprocessing was applied?
  • Can we reproduce this exact result?
  • What happens if we update the dataset?

Governance Benefits:

  • Reproducibility: Exact recreation of training conditions
  • Compliance: Audit trail for regulated industries
  • Debugging: Trace issues back to root causes
  • Optimization: Identify which factors improve performance

3.4. Exercises

  1. Q: If the training step failed due to loss of training data source, what should we do?

    A: We should rerun data ingestion before retrying the model training step since this failure is permanent, and simply retrying would lead to repetitive failures.

  2. Q: What type of metadata can be collected if we look at individual workers or parameter servers?

    A: Various network performance metrics while the model is being trained (e.g., bandwidth, throughput, and latency). This type of information is very useful when we want to detect when workers experience poor network performance that blocks the entire training process.


4. Summary and Exercises

Key Concepts Mastered

📊
Scheduling Patterns
  • Fair-Share: Equal resource distribution
  • Priority: Importance-based execution
  • Gang: Coordinated distributed starts
  • Elastic: Dynamic worker scaling
📋
Metadata Patterns
  • Root Cause Analysis
  • Progress-Based Recovery
  • Resource-Based Decisions
  • Lineage Tracking

Core Principles

  1. Resource Management: Intelligent scheduling prevents conflicts and maximizes utilization
  2. Failure Recovery: Metadata-driven decisions minimize downtime and optimize outcomes
  3. System Observability: Comprehensive monitoring enables proactive optimization
  4. Operational Efficiency: Patterns reduce manual intervention and improve reliability

Real-World Applications

  • Multi-tenant ML platforms with fair resource sharing
  • Production training pipelines with intelligent failure recovery
  • High-availability inference systems with priority scheduling
  • Regulatory compliance with comprehensive metadata tracking

Insight

Operations patterns transform ML systems from fragile, hard-to-debug monoliths into resilient, observable, and efficiently managed platforms. Master these patterns to build systems that scale gracefully and fail intelligently.

Practice Exercises

Scheduling Scenarios:

  1. Design fair-share allocation for 3 teams with 2, 4, and 6 members respectively
  2. Plan priority levels for: user experiments, production retraining, system maintenance
  3. Determine when to use gang vs elastic scheduling for different workload types

Failure Recovery: 4. Create metadata collection strategy for a 24/7 recommendation system 5. Design recovery procedures for: data corruption, worker failures, resource exhaustion 6. Plan rollback strategies for failed model deployments

System Design: 7. Architect scheduling system for mixed workloads: batch training, real-time inference, experimentation 8. Design metadata schema for regulatory compliance in financial ML systems 9. Create monitoring dashboard for multi-team ML platform


Previous: Chapter 5: Workflow Patterns | Next: Chapter 7: Project Overview & System Architecture