Chapter 5: Workflow Patterns
Orchestrating complex machine learning systems with reusable architectural patterns
Table of Contents
- What is Workflow?
- Fan-in and Fan-out Patterns: Composing Complex ML Workflows
- Synchronous and Asynchronous Patterns: Accelerating Workflows
- Step Memoization Pattern: Skipping Redundant Workloads
- Summary and Exercises
Think of building ML workflows like conducting a symphony orchestra - you need to coordinate multiple musicians (components), handle different instrument sections (patterns), and ensure everyone plays in harmony while some may have solo performances at different times.
1. What is Workflow?
In plain English: A workflow is like a recipe with multiple steps that must happen in a specific order, where each step takes input from previous steps and produces output for the next ones.
In technical terms: A workflow is a directed graph of computational steps with explicit dependencies, where each node represents a discrete operation (data ingestion, training, serving) and edges represent data flow between operations.
Why it matters: Proper workflow design ensures efficient resource utilization, enables parallel execution where possible, and provides clear visibility into system behavior and debugging.
A workflow consists of arbitrary combinations of components commonly seen in real-world ML applications:
- Data ingestion - Collecting and preprocessing raw data
- Distributed model training - Building ML models at scale
- Model serving - Deploying models for inference
1.1. Sequential Workflows vs Directed Acyclic Graphs (DAGs)
Simple Sequential Workflow:
Data
Ingestion
Model
Training
Model
Serving
More Complex Workflow with Parallel Paths:
Data Ingestion
Model
Training A
Model
Serving A
Model
Training B
Model
Serving B
Insight
Think of workflows like restaurant operations: simple workflows are like a single chef making one dish at a time, while complex workflows are like multiple chefs working different stations simultaneously to serve multiple dishes.
1.2. Understanding Workflow Complexity
Sequential Workflow: Steps execute one after another in strict order.
Directed Acyclic Graph (DAG): Steps can have dependencies but never form closed loops.
Valid DAG (No Cycles):
Invalid DAG (Has Cycles):
2. Fan-in and Fan-out Patterns: Composing Complex ML Workflows
2.1. The Problem: Training Multiple Models
Imagine you want to build a video tagging system that uses multiple models to capture different aspects of videos. You need to:
- Train different model architectures
- Select the top-performing models
- Use multiple models for better coverage
- Aggregate results for comprehensive tagging
Example Use Case: YouTube-8M video entity tagging with ensemble approach.
2.2. The Solution: Systematic Pattern Application
Baseline Workflow:
Data
Ingestion
Model
Training
Model
Serving
Enhanced Multi-Model Workflow (Fan-out Pattern):
Data Ingestion
Model
Training 1
Model
Training 2
Model
Training 3
Complete Workflow with Model Selection:
(90% acc)
(92% acc)
(75% acc)
Serving A
Serving B
Why Multiple Models Work Better:
Fan-out Pattern Structure:
↓ Fan-out: One input feeds multiple outputs
Fan-in Pattern Structure:
↓ Fan-in: Multiple inputs combine into one output
Insight
Fan-out and fan-in patterns are like a river delta: water flows from one source (fan-out) into multiple channels, then these channels may converge again downstream (fan-in). This natural flow pattern applies beautifully to ML workflows.
2.3. Discussion: When to Use These Patterns
Use fan-in/fan-out patterns when:
- Multiple steps are independent - Steps can run without waiting for each other
- Sequential execution is too slow - Parallel execution provides significant speedup
Avoid these patterns when:
- Steps have strict dependencies (e.g., ensemble models that need all sub-models first)
- Steps need specific execution order
- Resource constraints limit parallel execution
Ensemble Model Challenge:
Must wait for ALL to complete
Insight
Think of dependencies like a cooking recipe: you can chop vegetables in parallel (fan-out), but you can't make the sauce until all ingredients are ready (dependency constraint).
2.4. Exercises
-
Q: If steps are not independent of each other, can we use fan-in or fan-out patterns?
A: No, because we have no guarantee in what order concurrent copies of those steps will run.
-
Q: What's the main problem when trying to build ensemble models with the fan-in pattern?
A: Training an ensemble model depends on completing other model training steps for the sub-models. We cannot use the fan-in pattern because the ensemble model training step will need to wait for other model training to complete before it can start running, which would require extra waiting and delay the entire workflow.
3. Synchronous and Asynchronous Patterns: Accelerating Workflows
3.1. The Problem: Long-Running Step Bottlenecks
Imagine three model training steps with vastly different completion times:
The bottleneck: All subsequent steps (model selection, serving) must wait for the slowest step to complete.
3.2. The Solution: Concurrent Execution Strategies
Naive Approach - Remove Slow Step:
Ingestion
Selection
Problem: We lose the potentially best model from the complex training step.
Better Approach - Asynchronous Execution:
Synchronous vs Asynchronous Execution:
Insight
Asynchronous execution is like a restaurant: as soon as the appetizer is ready, it goes to the table while the main course is still cooking. Customers don't wait for everything to be ready at once.
3.3. Discussion: Speed vs Quality Trade-offs
Early Models vs Final Models:
- Simple Model A
- • Food
- • Car
- 2 entities
- FAST delivery
- Model A + B
- • Food, Car
- • Music, Sports
- 4 entities
- GOOD quality
- A + B + C
- • Food, Car
- • Music, Sports
- • Tech, Art, Science
- 7 entities
- BEST quality
Decision Framework:
- Speed Priority: Deploy models as soon as available
- Quality Priority: Wait for better models to complete
- Balanced: Use progressive deployment with user feedback
Insight
Consider the Uber model: when you request a ride, you see the closest driver immediately (fast result), but the app continues searching for better options and may upgrade your match (quality improvement over time).
3.4. Exercises
-
Q: What causes each model training step to start?
A: Due to the variation in completion times for each model training step, the start of each following step depends on the completion of the previous step.
-
Q: Are steps blocking each other if they are running asynchronously?
A: No, asynchronous steps won't block each other.
-
Q: What do we need to consider when deciding whether to use any available trained model as early as possible?
A: We need to consider whether users prioritize seeing results faster or seeing better results. If the goal is early results, users may not get the quality they expect. If delays are acceptable, waiting for better models is preferable.
4. Step Memoization Pattern: Skipping Redundant Workloads
4.1. The Problem: Unnecessary Re-execution
Scenario 1 - Regular Data Updates:
Ingestion
Training
Serving
Ingestion
Training
Serving
Scenario 2 - Model Experimentation:
Ingestion
(SLOW!)
Training
(CNN)
Serving
Ingestion
(SLOW!)
Training
(RNN)
Serving
4.2. The Solution: Intelligent Caching Strategies
Time-Based Caching:
Content-Based Caching:
Step Memoization Implementation:
def execute_workflow():
# Check cache before executing each step
if should_skip_data_ingestion():
data_location = get_cached_data_location()
else:
data_location = run_data_ingestion()
cache_data_location(data_location)
# Continue with remaining steps
model = run_model_training(data_location)
deploy_model(model)
def should_skip_data_ingestion():
cache = get_cache()
if cache_type == "time_based":
return time_since_update < threshold
elif cache_type == "content_based":
return record_count_change < significant_threshold
Insight
Step memoization is like a smart GPS: it remembers which routes you've taken recently and their conditions. If the route hasn't changed significantly, it skips the expensive route calculation and uses the cached path.
4.3. Discussion: Cache Management Considerations
Cache Lifecycle Management:
Daily Workflow Execution:
1,000 workflows × 100 cached steps = 100,000 caches/day
Storage usage grows linearly! Need garbage collection!
Garbage Collection Strategy:
Cache Content Strategy for Different Steps:
Insight
Cache management is like maintaining a library: you need to periodically remove old, unused books to make space for new ones, while keeping frequently accessed materials easily available.
4.4. Exercises
-
Q: What type of steps can most benefit from step memoization?
A: Steps that are time-consuming or require a huge amount of computational resources.
-
Q: How do we tell whether a step's execution can be skipped?
A: We can use information stored in the cache, such as when the cache was initially created or metadata collected from the step, to decide whether we should skip the execution of a particular step.
-
Q: What do we need to manage and maintain once we've applied the pattern at scale?
A: We need to set up a garbage collection mechanism to recycle and delete created caches automatically.
5. Summary and Exercises
Key Concepts Mastered
- Linear execution flow
- Step dependencies matter
- Order is critical
- No cycles allowed
- Parallel task execution
- Result merging
- Independent workloads
- Better resource utilization
- Concurrent execution
- Non-blocking operations
- Progressive results
- Speed vs quality trade-offs
- Cache results intelligently
- Skip redundant work
- Time & content-based caching
- Garbage collection needed
Core Principles
- Workflow Design: Connect ML components systematically using proven patterns
- Parallel Execution: Use fan-in/fan-out for independent, time-consuming tasks
- Async Optimization: Don't let slow steps block fast ones
- Smart Caching: Avoid redundant computation through intelligent memoization
Real-World Applications
- Video tagging systems with ensemble models
- A/B testing frameworks with parallel experiments
- Model pipeline optimization with cached intermediate results
- Real-time inference with progressive model deployment
Insight
Master these four workflow patterns and you'll have the building blocks to design efficient, scalable ML systems that can handle the complexity demands of production environments while minimizing computational waste.
Practice Exercises
Pattern Recognition:
- Identify which pattern to use when training 5 different models simultaneously
- Design a caching strategy for a daily model retraining pipeline
- Plan async deployment for models with 1-hour vs 8-hour training times
System Design: 4. Architect a workflow for A/B testing 3 recommendation algorithms 5. Design garbage collection for a high-frequency experimentation platform 6. Create a progressive deployment strategy for improving model quality over time
Previous: Chapter 4: Model Serving Patterns | Next: Chapter 6: Operation Patterns