Skip to main content

Chapter 5: Workflow Patterns

Orchestrating complex machine learning systems with reusable architectural patterns


Table of Contents

  1. What is Workflow?
  2. Fan-in and Fan-out Patterns: Composing Complex ML Workflows
  3. Synchronous and Asynchronous Patterns: Accelerating Workflows
  4. Step Memoization Pattern: Skipping Redundant Workloads
  5. Summary and Exercises

Think of building ML workflows like conducting a symphony orchestra - you need to coordinate multiple musicians (components), handle different instrument sections (patterns), and ensure everyone plays in harmony while some may have solo performances at different times.


1. What is Workflow?

In plain English: A workflow is like a recipe with multiple steps that must happen in a specific order, where each step takes input from previous steps and produces output for the next ones.

In technical terms: A workflow is a directed graph of computational steps with explicit dependencies, where each node represents a discrete operation (data ingestion, training, serving) and edges represent data flow between operations.

Why it matters: Proper workflow design ensures efficient resource utilization, enables parallel execution where possible, and provides clear visibility into system behavior and debugging.

A workflow consists of arbitrary combinations of components commonly seen in real-world ML applications:

  • Data ingestion - Collecting and preprocessing raw data
  • Distributed model training - Building ML models at scale
  • Model serving - Deploying models for inference

1.1. Sequential Workflows vs Directed Acyclic Graphs (DAGs)

Simple Sequential Workflow:

Simple ML Workflow with Sequential Execution

Data
Ingestion

Model
Training

Model
Serving

More Complex Workflow with Parallel Paths:

Complex Workflow with Multiple Training and Serving Paths

Data Ingestion

Model
Training A

Model
Serving A

Model
Training B

Model
Serving B

Insight

Think of workflows like restaurant operations: simple workflows are like a single chef making one dish at a time, while complex workflows are like multiple chefs working different stations simultaneously to serve multiple dishes.

1.2. Understanding Workflow Complexity

Sequential Workflow: Steps execute one after another in strict order.

Sequential Execution Pattern
Step A
Done
Step B
Starts after A completes
Step C
Starts after B completes

Directed Acyclic Graph (DAG): Steps can have dependencies but never form closed loops.

Valid DAG (No Cycles):

Valid DAG - No Closed Loops
Step A
Step B
Step C
Step D

Invalid DAG (Has Cycles):

Invalid DAG - Creates a Closed Loop
Step A
↓ Forward
Step B
↓ Forward
Step C
↑ Back to A - Creates cycle!

2. Fan-in and Fan-out Patterns: Composing Complex ML Workflows

2.1. The Problem: Training Multiple Models

Imagine you want to build a video tagging system that uses multiple models to capture different aspects of videos. You need to:

  1. Train different model architectures
  2. Select the top-performing models
  3. Use multiple models for better coverage
  4. Aggregate results for comprehensive tagging

Example Use Case: YouTube-8M video entity tagging with ensemble approach.

2.2. The Solution: Systematic Pattern Application

Baseline Workflow:

Simple Baseline Workflow

Data
Ingestion

Model
Training

Model
Serving

Enhanced Multi-Model Workflow (Fan-out Pattern):

Fan-out Pattern - Multiple Training Paths

Data Ingestion

Model
Training 1

Model
Training 2

Model
Training 3

Complete Workflow with Model Selection:

Complete Workflow with Model Selection and Fan-in
Data Ingestion
Fan-out: Training Phase
Training 1
(90% acc)
Training 2
(92% acc)
Training 3
(75% acc)
Model Selection (Top 2)
Fan-in: Serving Phase
Model
Serving A
Model
Serving B
Result Aggregation

Why Multiple Models Work Better:

Model A Knowledge
Combined Knowledge (A + B)
Entities
Food, Car, Animals, Nature (4 entities)
Food, Car, Animals, Nature, Music, Sports, Technology (7 entities)
Coverage
Limited domain coverage
Broader domain coverage
Accuracy
Single model perspective
Ensemble benefits from diversity

Fan-out Pattern Structure:

Fan-out Pattern: One Input Feeds Multiple Outputs
Data Ingestion

↓ Fan-out: One input feeds multiple outputs

Training 1
Training 2
Training 3

Fan-in Pattern Structure:

Fan-in Pattern: Multiple Inputs Combine into One Output
Serving A
Serving B

↓ Fan-in: Multiple inputs combine into one output

Result Aggregation

Insight

Fan-out and fan-in patterns are like a river delta: water flows from one source (fan-out) into multiple channels, then these channels may converge again downstream (fan-in). This natural flow pattern applies beautifully to ML workflows.

2.3. Discussion: When to Use These Patterns

Use fan-in/fan-out patterns when:

  1. Multiple steps are independent - Steps can run without waiting for each other
  2. Sequential execution is too slow - Parallel execution provides significant speedup

Avoid these patterns when:

  • Steps have strict dependencies (e.g., ensemble models that need all sub-models first)
  • Steps need specific execution order
  • Resource constraints limit parallel execution

Ensemble Model Challenge:

Dependency Constraint with Ensemble Models
Training A
Training B
Training C

Must wait for ALL to complete

Ensemble Training

Insight

Think of dependencies like a cooking recipe: you can chop vegetables in parallel (fan-out), but you can't make the sauce until all ingredients are ready (dependency constraint).

2.4. Exercises

  1. Q: If steps are not independent of each other, can we use fan-in or fan-out patterns?

    A: No, because we have no guarantee in what order concurrent copies of those steps will run.

  2. Q: What's the main problem when trying to build ensemble models with the fan-in pattern?

    A: Training an ensemble model depends on completing other model training steps for the sub-models. We cannot use the fan-in pattern because the ensemble model training step will need to wait for other model training to complete before it can start running, which would require extra waiting and delay the entire workflow.


3. Synchronous and Asynchronous Patterns: Accelerating Workflows

3.1. The Problem: Long-Running Step Bottlenecks

Imagine three model training steps with vastly different completion times:

Duration Differences Causing Workflow Delays
Training 1
1 week
Training 2
1 week
Training 3
2 weeks
Waiting Time
1 extra week!

The bottleneck: All subsequent steps (model selection, serving) must wait for the slowest step to complete.

3.2. The Solution: Concurrent Execution Strategies

Naive Approach - Remove Slow Step:

Simple but Limiting Solution
Data
Ingestion
Fast training only
Training 1
Training 2
Skip Training 3
Model
Selection

Problem: We lose the potentially best model from the complex training step.

Better Approach - Asynchronous Execution:

1
Week 1: Deploy First Model
Training 1 completes → Deploy immediately
2
Week 2: Ensemble with Two Models
Training 2 completes → Update with ensemble
3
Week 3: Full Ensemble
Training 3 completes → Best quality results

Synchronous vs Asynchronous Execution:

Synchronous (Traditional)
Asynchronous (Optimized)
Execution
Step A → Wait → Step B → Wait → Step C
Step A → B, C, D start immediately
Bottlenecks
Each step blocks the next
No waiting between parallel steps
Time to First Result
Must wait for all steps
Deploy as soon as first step completes
User Experience
Long wait, then results
Quick initial results, improving quality

Insight

Asynchronous execution is like a restaurant: as soon as the appetizer is ready, it goes to the table while the main course is still cooking. Customers don't wait for everything to be ready at once.

3.3. Discussion: Speed vs Quality Trade-offs

Early Models vs Final Models:

1
Week 1 Results
  • Simple Model A
  • • Food
  • • Car
  • 2 entities
  • FAST delivery
2
Week 2 Results
  • Model A + B
  • • Food, Car
  • • Music, Sports
  • 4 entities
  • GOOD quality
3
Week 3 Results
  • A + B + C
  • • Food, Car
  • • Music, Sports
  • • Tech, Art, Science
  • 7 entities
  • BEST quality

Decision Framework:

  • Speed Priority: Deploy models as soon as available
  • Quality Priority: Wait for better models to complete
  • Balanced: Use progressive deployment with user feedback

Insight

Consider the Uber model: when you request a ride, you see the closest driver immediately (fast result), but the app continues searching for better options and may upgrade your match (quality improvement over time).

3.4. Exercises

  1. Q: What causes each model training step to start?

    A: Due to the variation in completion times for each model training step, the start of each following step depends on the completion of the previous step.

  2. Q: Are steps blocking each other if they are running asynchronously?

    A: No, asynchronous steps won't block each other.

  3. Q: What do we need to consider when deciding whether to use any available trained model as early as possible?

    A: We need to consider whether users prioritize seeing results faster or seeing better results. If the goal is early results, users may not get the quality they expect. If delays are acceptable, waiting for better models is preferable.


4. Step Memoization Pattern: Skipping Redundant Workloads

4.1. The Problem: Unnecessary Re-execution

Scenario 1 - Regular Data Updates:

Inefficient Re-execution with Data Updates
Week 1: New YouTube videos added
Data
Ingestion
Model
Training
Model
Serving
Week 2: More videos added
Data
Ingestion
Model
Training
Model
Serving

Scenario 2 - Model Experimentation:

Same Data Re-ingested for Different Models
Experiment 1: Try CNN architecture
Data
Ingestion
(SLOW!)
Model
Training
(CNN)
Model
Serving
Experiment 2: Try RNN architecture (Same data!)
Data
Ingestion
(SLOW!)
Model
Training
(RNN)
Model
Serving

4.2. The Solution: Intelligent Caching Strategies

Time-Based Caching:

Workflow Triggered
New experiment started
?
Check Cache
Last updated: 1 week ago Window: 2 weeks
Decision
Data is fresh (< 2 weeks) SKIP ingestion
Start Training
Use cached data directly

Content-Based Caching:

Workflow Triggered
New training job
?
Check Cache
Cached: 1M videos Current: 2M videos
Decision
Significant change (2x) RE-INGEST required
Full Ingestion
Process all new data

Step Memoization Implementation:

def execute_workflow():
# Check cache before executing each step
if should_skip_data_ingestion():
data_location = get_cached_data_location()
else:
data_location = run_data_ingestion()
cache_data_location(data_location)

# Continue with remaining steps
model = run_model_training(data_location)
deploy_model(model)

def should_skip_data_ingestion():
cache = get_cache()
if cache_type == "time_based":
return time_since_update < threshold
elif cache_type == "content_based":
return record_count_change < significant_threshold

Insight

Step memoization is like a smart GPS: it remembers which routes you've taken recently and their conditions. If the route hasn't changed significantly, it skips the expensive route calculation and uses the cached path.

4.3. Discussion: Cache Management Considerations

Cache Lifecycle Management:

Cache Growth Over Time

Daily Workflow Execution:
1,000 workflows × 100 cached steps = 100,000 caches/day

Day 1: 100K caches
Day 2: 200K caches
Day 3: 300K caches
Day 7: 700K caches
Day 30: 3M caches

Storage usage grows linearly! Need garbage collection!

Garbage Collection Strategy:

1
Record Timestamp
Track when cache is last used
2
Periodic Scan
Scan all caches regularly
3
Delete Unused
Remove caches unused for &gt; threshold

Cache Content Strategy for Different Steps:

Cache Content by Step Type
Data Ingestion Cache
Metadata: Record count, last update timestampContent: Data location, schema versionValidation: Data quality metrics
Model Training Cache
Metadata: Model architecture, hyperparametersContent: Model artifacts, performance metricsValidation: Training dataset fingerprint
Model Serving Cache
Metadata: Model version, deployment configContent: Service endpoints, resource usageValidation: Performance benchmarks

Insight

Cache management is like maintaining a library: you need to periodically remove old, unused books to make space for new ones, while keeping frequently accessed materials easily available.

4.4. Exercises

  1. Q: What type of steps can most benefit from step memoization?

    A: Steps that are time-consuming or require a huge amount of computational resources.

  2. Q: How do we tell whether a step's execution can be skipped?

    A: We can use information stored in the cache, such as when the cache was initially created or metadata collected from the step, to decide whether we should skip the execution of a particular step.

  3. Q: What do we need to manage and maintain once we've applied the pattern at scale?

    A: We need to set up a garbage collection mechanism to recycle and delete created caches automatically.


5. Summary and Exercises

Key Concepts Mastered

🔄
Sequential/DAG Patterns
  • Linear execution flow
  • Step dependencies matter
  • Order is critical
  • No cycles allowed
🌟
Fan-in/Fan-out
  • Parallel task execution
  • Result merging
  • Independent workloads
  • Better resource utilization
Synchronous/Asynchronous
  • Concurrent execution
  • Non-blocking operations
  • Progressive results
  • Speed vs quality trade-offs
💾
Step Memoization
  • Cache results intelligently
  • Skip redundant work
  • Time & content-based caching
  • Garbage collection needed

Core Principles

  1. Workflow Design: Connect ML components systematically using proven patterns
  2. Parallel Execution: Use fan-in/fan-out for independent, time-consuming tasks
  3. Async Optimization: Don't let slow steps block fast ones
  4. Smart Caching: Avoid redundant computation through intelligent memoization

Real-World Applications

  • Video tagging systems with ensemble models
  • A/B testing frameworks with parallel experiments
  • Model pipeline optimization with cached intermediate results
  • Real-time inference with progressive model deployment

Insight

Master these four workflow patterns and you'll have the building blocks to design efficient, scalable ML systems that can handle the complexity demands of production environments while minimizing computational waste.

Practice Exercises

Pattern Recognition:

  1. Identify which pattern to use when training 5 different models simultaneously
  2. Design a caching strategy for a daily model retraining pipeline
  3. Plan async deployment for models with 1-hour vs 8-hour training times

System Design: 4. Architect a workflow for A/B testing 3 recommendation algorithms 5. Design garbage collection for a high-frequency experimentation platform 6. Create a progressive deployment strategy for improving model quality over time


Previous: Chapter 4: Model Serving Patterns | Next: Chapter 6: Operation Patterns