Chapter 6: Operation Patterns
Managing and optimizing distributed machine learning systems in production
Table of Contents
- What are Operations in Machine Learning Systems?
- Scheduling Patterns: Assigning Resources Effectively
- Metadata Pattern: Handling Failures Appropriately
- Summary and Exercises
Think of ML operations like managing a busy airport — you need intelligent scheduling to manage runway traffic (resource allocation), comprehensive monitoring to track all flights (metadata collection), and smart protocols to handle delays and cancellations (failure recovery) while keeping passengers satisfied.
1. What are Operations in Machine Learning Systems?
Operations in ML systems go beyond managing individual components like data ingestion or model training. They focus on cross-cutting concerns that affect the entire system: resource management, failure handling, monitoring, and optimization.
1.1. Beyond Individual Components
Unlike patterns specific to individual components, operational patterns address challenges that span multiple steps in ML workflows. Think of it as the difference between optimizing individual instruments in an orchestra versus conducting the entire performance.
1.2. The Challenge of Black Box Systems
Problem Scenario:
❌ Failed
✅ Success
❌ Failed
❌ Failed
✅ Success
The Challenge: When failures occur, we can only see that something failed, not why it failed or how to fix it efficiently.
Insight
Operations patterns are like installing a comprehensive monitoring and control system in your production facility. Instead of just knowing "something broke," you get detailed diagnostics about what, why, and how to fix it optimally.
2. Scheduling Patterns: Assigning Resources Effectively
2.1. The Problem: Resource Competition and Blocking
Scenario: Multiple data scientists sharing a cluster for model training.
User Stories:
- User A: Training fraud detection models
- User B: Building condition monitoring for industrial equipment
- User C: Developing recommendation systems
Problem with Simple First-Come-First-Served Scheduling:
Additional Problems:
- Resource Competition: Users wake up at 3 AM to submit jobs when fewer people are online
- Distributed Training Blocking: If only 3 out of 4 required workers are available, the job can't start
- Resource Waste: Allocated but unused workers sit idle
2.2. The Solution: Intelligent Scheduling Strategies
Fair-Share Scheduling
Core Principle: Divide computational resources equally among users or groups.
Simple Fair-Share Example (4 Users):
25%
25%
25%
25%
What if User B starts 2 processes?
User A: 25% ├─ Process A1: 25%
User B: 25% ├─ Process B1: 12.5%
└─ Process B2: 12.5%
User C: 25% ├─ Process C1: 25%
User D: 25% ├─ Process D1: 25%
Hierarchical Fair-Share (Groups + Users):
33.3%
11.1%
11.1%
11.1%
33.3%
16.7%
16.7%
33.3%
8.3%
8.3%
8.3%
8.3%
Group 1: 33.3% ÷ 3 users = 11.1% per user
Group 2: 33.3% ÷ 2 users = 16.7% per user
Group 3: 33.3% ÷ 4 users = 8.3% per user
Insight
Fair-share scheduling is like a restaurant with multiple dining rooms: each room gets equal space, and within each room, tables are shared equally among diners. This prevents any single group from monopolizing the entire restaurant.
Priority Scheduling
The Deadlock Problem:
Admin (User 1)
11.1% CPU
Normal job running
Admin job WAITING
User 2
11.1% CPU
Job 1 RUNNING
depends on Job 2
User 3
11.1% CPU
Job 2 STUCK
database issue
needs restart
DEADLOCK: Job 1 waits for Job 2 → Job 2 stuck → Admin job waits behind Job 1 → No progress!
Priority Scheduling Solution:
HIGH
HIGH
MEDIUM
LOW
Priority Scheduling with Preemption:
Timeline:
0────5────8────12───16───20
│ │ │ │ │ │
B A B D B │
██ ░░ ██ ░░ ██ │
██ ░░ ██ ░░ │ Legend:
░░ ░░ ░░ ██ │ ██ = Running
░░ ░░ ░░ ░░ │ ░░ = Waiting
Jobs:
A (HIGH): ░░██░░░░░░░░
B (LOW): ████░████░██
C (HIGH): ░░░░░░░░░░░░ (not submitted)
D (MEDIUM): ░░░░░░░░██░░
Time 0: Job B (LOW) starts running
Time 5: Job A (HIGH) submitted → Job B preempted → Job A starts
Time 8: Job A completes → Job B resumes
Time 12: Job D (MEDIUM) submitted → Job B preempted → Job D starts
Time 16: Job D completes → Job B resumes
Gang Scheduling
The Problem: Distributed Training Coordination
Without Gang Scheduling (Resource Waste):
BLOCKED
❌ missing v0, v2
READY
✅ has v1
(wasting resources)
BLOCKED
❌ missing v0, v2
With Gang Scheduling (Coordinated Start):
NOT READY
Waiting...
READY
Waiting for others
NOT READY
Waiting...
READY
All start together
READY
All start together
READY
All start together
Gang Scheduling Decision Process:
- Check if ALL workers are ready
- If not, don't start ANY workers
- Wait until network stabilizes
- Start all workers simultaneously
Insight
Gang scheduling is like coordinating a synchronized swimming team: you don't start the routine until all swimmers are ready and in position. Starting with missing teammates would waste everyone's effort and ruin the performance.
2.3. Discussion: Production Considerations
Scheduling Pattern Summary:
Security Considerations:
- Priority Abuse: Malicious users creating only high-priority jobs
- Resource Limits: Administrators enforce priority quotas
- Fair Access: Prevent any user from starving others
Elastic Scheduling Alternative:
Advantages: Faster start, better resource utilization
Challenges: Dynamic batch size, learning rate adjustments
2.4. Exercises
-
Q: Can we only apply fair-share scheduling at the user level?
A: No, we can apply this scheduling strategy at each level of abstraction, such as processes, users, groups, etc.
-
Q: Is gang scheduling suitable for all distributed model training jobs?
A: No, some machine learning frameworks support elastic scheduling, which allows distributed model training jobs to start with any number of workers available without waiting for all the requested workers to be ready for communication. In this case, gang scheduling is not suitable.
3. Metadata Pattern: Handling Failures Appropriately
3.1. The Problem: Complex Failure Scenarios
Simple Workflow Failure (Easy to Handle):
✅
❌ Failed
(not reached)
Complex Workflow Failure (Challenging):
✅ Complete
✅ 85% accuracy
(Expected: 85%)
❌ Failed
(Expected: 80%)
❌ Failed
(Expected: 75%)
⚠️ Only 1 model available
(cannot proceed)
Challenge: Two training steps failed. Should we:
- Restart both from scratch? (Slow)
- Skip failed steps? (Lower quality)
- Something smarter?
3.2. The Solution: Intelligent Failure Recovery
Root Cause Analysis Through Metadata
Failure Types and Recovery Strategies:
• Network hiccups
• Brief resource contention
• Worker preemption
• Out of disk space
• Data source deleted
• Configuration errors
Metadata Collection Example:
Metadata Collected Every 5 Minutes:
• Memory usage: 20MB → 25MB → 23MB → 27MB → 30MB → 28MB → 200MB → CRASH
• Data availability: YES → YES → YES → YES → YES → YES → YES → NO → NO
• CPU usage: 45% → 50% → 48% → 52% → 55% → 53% → 95% → CRASH
Diagnosis: Out-of-memory error. Solution: Increase memory allocation.
Advanced Failure Handling Strategies
Strategy 1: Progress-Based Recovery
95% complete
92% accuracy
✅ KEEP
90% complete
89% accuracy
❌ RETRY/RESUME
20% complete
27% accuracy
❌ SKIP/CANCEL
Model Accuracy Trends:
Training 1: 30% → 45% → 60% → 75% → 85% → 90% → 92% (converging)
Training 2: 25% → 40% → 55% → 70% → 82% → 87% → 89% (converging)
Training 3: 35% → 40% → 42% → 38% → 35% → 30% → 27% (declining!)
Strategy 2: Resource-Based Recovery
Training 3 Performance:
Progress: 1% every 30 minutes → 2 hours per 4% → 50 hours total!
Allocated Resources: 2 CPUs, 4GB RAM
Option A: Continue Training 3
• Time: 50 hours
• Outcome: Likely poor model
• Cost: Block other experiments
Option B: Reallocate Resources
• Action: Cancel Training 3
• Benefit: Speed up Training 1 & 2
• Result: 2 good models in 25 hours
Decision: Cancel Training 3, reallocate resources
Insight
Metadata-driven failure recovery is like having a detailed medical record during emergency treatment. Instead of guessing what's wrong, doctors can see exactly what happened, when it happened, and what treatments will be most effective.
Implementation Example
class MetadataCollector:
def __init__(self, step_name):
self.step_name = step_name
self.metadata = {
'start_time': time.time(),
'resource_usage': [],
'performance_metrics': [],
'dependencies': [],
'checkpoints': []
}
def record_resource_usage(self):
self.metadata['resource_usage'].append({
'timestamp': time.time(),
'memory_mb': psutil.virtual_memory().used / 1024 / 1024,
'cpu_percent': psutil.cpu_percent(),
'disk_io': psutil.disk_io_counters()
})
def record_training_progress(self, epoch, accuracy, loss):
self.metadata['performance_metrics'].append({
'epoch': epoch,
'accuracy': accuracy,
'loss': loss,
'timestamp': time.time()
})
def analyze_failure(self, error):
# Analyze patterns in metadata to determine failure type
recent_memory = [r['memory_mb'] for r in self.metadata['resource_usage'][-5:]]
memory_spike = max(recent_memory) > 2 * min(recent_memory)
if memory_spike:
return {
'failure_type': 'out_of_memory',
'recommendation': 'increase_memory_allocation',
'retry': True
}
# ... other analysis logic
3.3. Discussion: Advanced Metadata Applications
Network Performance Monitoring
Worker 1
Latency: 50ms
Bandwidth: 1 Gbps
Worker 2
Latency: 45ms
Bandwidth: 1 Gbps
Worker 3
Latency: 500ms!
Bandwidth: 100 Mbps
Bottleneck: Worker 3 slow!
Action: Replace Worker 3 with elastic scaling
Model Lineage and Provenance
92% accuracy
2024-01-15
Metadata Enables:
- Which data contributed to this model?
- What preprocessing was applied?
- Can we reproduce this exact result?
- What happens if we update the dataset?
Governance Benefits:
- Reproducibility: Exact recreation of training conditions
- Compliance: Audit trail for regulated industries
- Debugging: Trace issues back to root causes
- Optimization: Identify which factors improve performance
3.4. Exercises
-
Q: If the training step failed due to loss of training data source, what should we do?
A: We should rerun data ingestion before retrying the model training step since this failure is permanent, and simply retrying would lead to repetitive failures.
-
Q: What type of metadata can be collected if we look at individual workers or parameter servers?
A: Various network performance metrics while the model is being trained (e.g., bandwidth, throughput, and latency). This type of information is very useful when we want to detect when workers experience poor network performance that blocks the entire training process.
4. Summary and Exercises
Key Concepts Mastered
- Fair-Share: Equal resource distribution
- Priority: Importance-based execution
- Gang: Coordinated distributed starts
- Elastic: Dynamic worker scaling
- Root Cause Analysis
- Progress-Based Recovery
- Resource-Based Decisions
- Lineage Tracking
Core Principles
- Resource Management: Intelligent scheduling prevents conflicts and maximizes utilization
- Failure Recovery: Metadata-driven decisions minimize downtime and optimize outcomes
- System Observability: Comprehensive monitoring enables proactive optimization
- Operational Efficiency: Patterns reduce manual intervention and improve reliability
Real-World Applications
- Multi-tenant ML platforms with fair resource sharing
- Production training pipelines with intelligent failure recovery
- High-availability inference systems with priority scheduling
- Regulatory compliance with comprehensive metadata tracking
Insight
Operations patterns transform ML systems from fragile, hard-to-debug monoliths into resilient, observable, and efficiently managed platforms. Master these patterns to build systems that scale gracefully and fail intelligently.
Practice Exercises
Scheduling Scenarios:
- Design fair-share allocation for 3 teams with 2, 4, and 6 members respectively
- Plan priority levels for: user experiments, production retraining, system maintenance
- Determine when to use gang vs elastic scheduling for different workload types
Failure Recovery: 4. Create metadata collection strategy for a 24/7 recommendation system 5. Design recovery procedures for: data corruption, worker failures, resource exhaustion 6. Plan rollback strategies for failed model deployments
System Design: 7. Architect scheduling system for mixed workloads: batch training, real-time inference, experimentation 8. Design metadata schema for regulatory compliance in financial ML systems 9. Create monitoring dashboard for multi-team ML platform
Previous: Chapter 5: Workflow Patterns | Next: Chapter 7: Project Overview & System Architecture