Chapter 4: Model Serving Patterns
From trained models to production magic - serving millions of predictions per second
Table of Contents
- What is Model Serving?
- Replicated Services Pattern: Handling Growing Traffic
- Sharded Services Pattern: Processing Large Requests
- Event-Driven Processing Pattern: Dynamic Resource Allocation
- Answers to Exercises
- Summary
Introduction
Restaurant Analogy: You've learned to cook amazing dishes (train models), but now comes the real challenge: running a restaurant that serves hundreds of customers every hour. Model serving is the art of turning your carefully crafted models into production systems that can handle real-world traffic, from a single user to millions of concurrent requests.
Think of the transformation:
- Training: Like perfecting a recipe in your kitchen
- Serving: Like running a restaurant chain that serves that recipe to millions
This chapter explores three fundamental patterns that make this transformation possible: replicated services for handling more customers, sharded services for processing large orders, and event-driven processing for handling unpredictable rush hours.
1. What is Model Serving?
In plain English: Model serving is like setting up a restaurant where customers (users) come with questions (input data), and your trained chef (the model) quickly prepares answers (predictions) for them. The challenge is doing this for millions of customers simultaneously while keeping everyone happy.
In technical terms: Model serving is the process of loading a previously trained machine learning model to generate predictions or make inferences on new input data in a production environment.
Why it matters: The best-trained model is worthless if it can't serve predictions quickly, reliably, and cost-effectively to real users. Model serving bridges the gap between research and production impact.
Model Serving in the ML Pipeline:
Data
Ingestion
Model
Training
Model
Serving ⭐
Users (Millions!)
Traditional vs Distributed Model Serving:
Real-World Example: YouTube Video Tagging
Remember our YouTube-8M model from Chapter 3? Now we need to serve it to real users:
- Response time: < 200ms
- Accuracy: > 95%
- Availability: 99.9% uptime
The Scale Challenge:
- Day 1 (Beta): 10 requests/hour → Single server
- Month 1 (Launch): 1,000 requests/hour → Need replicas
- Year 1 (Popular): 100,000 requests/hour → Need sharding
- Year 3 (Viral): 10M requests/hour → Need everything
- Video files grow from 10MB → 4K videos (1GB+)
- User expectations: instant results
- Business requirements: 24/7 availability
- Cost pressure: efficient resource usage
Serving Architecture Overview:
User
Uploads
Video
Model
Server
Model
Storage
2. Replicated Services Pattern: Handling Growing Traffic
2.1. The Problem: Single Server Bottleneck
In plain English: Imagine a popular restaurant with only one chef. During lunch rush, orders pile up, customers wait 2 hours for food, and many leave hungry and angry. This is exactly what happens with single-server model serving.
In technical terms: A single server has limited CPU, memory, and I/O capacity. When concurrent requests exceed these limits, they queue up, causing exponentially increasing response times and eventually service degradation or failure.
Why it matters: User expectations for web services are ruthless - a 3-second delay can cause 40% of users to abandon a site. Single-server bottlenecks directly translate to lost customers, revenue, and reputation.
The Single Server Reality:
Single Model Server
Processing: Request #1
Status: Busy
Queue: 47 requests waiting
Average wait: 23 minutes
❌ User Experience: Terrible
YouTube Video Tagging - Traffic Growth Story:
Warning
Result: Users abandon the service. Solution needed: Horizontal scaling
Request Processing Timeline:
Single Server Processing (Sequential):
Time: 0s 5s 10s 15s 20s
Request: [███ A ███][██ B ██][████ C ████][█ D █]...
Status: Processing Waiting Waiting Waiting
User A: Happy (5s response)
User B: Annoyed (10s response)
User C: Frustrated (15s response)
User D: Abandoning (20s+ response)
Problem: Each user waits for all previous users
YouTube-8M Video Examples:
- "Cat playing with yarn"
- File size: 45 MB
- Processing time: ~3 seconds
- Expected tags: ["Pet", "Animal", "Indoor"]
- "Mountain biking adventure"
- File size: 120 MB
- Processing time: ~8 seconds
- Expected tags: ["Sport", "Outdoor", "Vehicle"]
- "Cooking pasta tutorial"
- File size: 80 MB
- Processing time: ~5 seconds
- Expected tags: ["Food", "Education", "Indoor"]
Warning
Total wait for Position 3: 16 seconds (unacceptable!)
2.2. The Solution: Horizontal Scaling with Replicas
In plain English: Instead of making one chef work faster (vertical scaling = buying a faster stove), hire more chefs (horizontal scaling = add more servers). Each chef can work on different orders simultaneously.
In technical terms: Horizontal scaling deploys multiple identical instances (replicas) of the model server behind a load balancer. Each replica is stateless and can independently process requests. The load balancer distributes incoming traffic across replicas using algorithms like round-robin or least-connections.
Why it matters: Horizontal scaling is the foundation of modern cloud infrastructure. It provides linear throughput scaling, built-in redundancy, and cost-effective capacity growth without expensive hardware upgrades.
Horizontal vs Vertical Scaling:
Replicated Services Architecture:
User Requests
Load Balancer
Replica 1
(Serving)
Replica 2
(Serving)
Replica 3
(Serving)
Parallel Processing Timeline:
Replicated Servers Processing (Parallel):
Time: 0s 5s 10s 15s 20s
Server 1: [███ A ███][███ E ███][███ I ███]...
Server 2: [██ B ██][████ F ████][██ J ██]...
Server 3: [████ C ████][█ G █][███ K ███]...
User A: Happy (5s response) ✅
User B: Happy (5s response) ✅
User C: Happy (5s response) ✅
All users served simultaneously!
Improvement: 3x throughput, consistent response times
Load Balancer Algorithms:
- Request 1 → Server A
- Request 2 → Server B
- Request 3 → Server C
- Request 4 → Server A (cycle repeats)
- Simple and fair distribution
- Server A: 2 active connections
- Server B: 1 active connection ← New request goes here
- Server C: 3 active connections
- Always choose least loaded
- Server A (powerful): Weight 3 → gets 3/6 requests
- Server B (medium): Weight 2 → gets 2/6 requests
- Server C (basic): Weight 1 → gets 1/6 requests
- Accounts for different server capabilities
Performance Results:
2.3. Discussion: High Availability and Load Balancing
In plain English: High availability means your service stays up even when individual servers fail. It's like having backup generators in a hospital - the lights must never go out. Load balancing is the traffic cop that ensures every server gets a fair share of work.
In technical terms: High availability (HA) is achieved through redundancy and health monitoring. Service Level Agreements (SLAs) define acceptable downtime thresholds. Load balancers use readiness probes to detect unhealthy servers and route traffic only to healthy instances.
Why it matters: Downtime is expensive. For e-commerce, every minute offline can cost thousands of dollars. For critical services like healthcare or finance, downtime can be life-threatening or legally problematic. High availability is not a luxury - it's a business requirement.
Three-Nines Availability:
- Allowed downtime per day: 1.44 minutes
- Allowed downtime per month: 43.8 minutes
- Allowed downtime per year: 8.76 hours
- Achievable with: Replicated Services
- Allowed downtime per day: 8.6 seconds
- Allowed downtime per month: 4.38 minutes
- Allowed downtime per year: 52.56 minutes
- Requires: Advanced redundancy
Readiness Probes:
Load
Balancer
Server A
Status: ✅ Ready
Server B
Status: ❌ Not Ready
- ✅ Can connect to model storage
- ✅ Model loaded successfully
- ✅ Required memory available
- ✅ Network connectivity stable
Insight
In production, readiness probes are crucial. A server that appears "running" but can't access the model storage will accept requests but fail them all, creating a terrible user experience.
Failure Scenarios and Recovery:
2.4. Exercises
-
Are replicated model servers stateless or stateful?
-
What happens when we don't have a load balancer as part of the model serving system?
-
Can we achieve three-nines service-level agreements with only one model server instance?
3. Sharded Services Pattern: Processing Large Requests
3.1. The Problem: High-Resolution Videos Overwhelming Memory
In plain English: Replicated services work great for handling more customers, but what if a customer orders something huge - like editing a 4-hour movie? Even with multiple editors, each individual editor might not have enough resources to handle such a massive project alone.
In technical terms: When individual request payloads exceed single-server memory capacity, replication doesn't help. A 20GB video file can't fit in a server with 4GB RAM, regardless of how many such servers you have. Vertical scaling (upgrading servers) is prohibitively expensive for edge cases.
Why it matters: Modern applications often need to handle large files - 4K videos, high-resolution medical images, genomic data, or massive documents. A serving system that rejects these requests loses valuable use cases and customers.
The Large Request Challenge:
- File size: 20 GB
- Memory needed: 80 GB
- Processing time: 10 minutes
- Status: ❌ No single server can handle
Memory Overflow Scenario:
Each Replica Configuration:
CPU: 4 cores | RAM: 4 GB | Storage: 100 GB | Network: 1 Gbps
Result: Return error to user
User Experience: 😡 Very unhappy
Why Vertical Scaling Isn't Practical:
- 95% of videos: Standard resolution (50 MB)
- 4% of videos: High resolution (2 GB)
- 1% of videos: Professional (20 GB)
- New config: 16 cores, 128 GB RAM, 1 TB storage
- Cost: 10x more expensive
- Resource utilization: 95% of time using 1% of capacity (wasteful!)
- Average utilization: 3% (terrible economics!)
- Paying for resources you rarely use
3.2. The Solution: Divide and Conquer Approach
In plain English: Instead of hiring one super-editor with a massive workstation, divide the movie into scenes and have multiple regular editors work on different scenes simultaneously. Then combine their work into the final product.
In technical terms: Sharding splits large requests into smaller segments that fit within individual server capacity. Each segment is processed by a separate shard (server instance). A sharding function (typically a hash) determines which shard handles each segment. Results are merged after parallel processing.
Why it matters: Sharding enables processing of arbitrarily large requests without expensive vertical scaling. It's the technique behind video processing pipelines, distributed databases, and big data systems. The tradeoff is increased complexity - you now manage state across multiple servers.
Video Sharding Strategy:
Segment 1
7 GB
"Dog playing"
Shard A
(4 GB RAM)
["Dog", "Animal", "Pet"]
Segment 2
7 GB
"Kid laughing"
Shard B
(4 GB RAM)
["Child", "Human", "Play"]
Segment 3
6 GB
"Park scene"
Shard C
(4 GB RAM)
["Park", "Outdoor", "Nature"]
Merge Results
Final Tags: ["Dog", "Animal", "Pet", "Child", "Human", "Play", "Park", "Outdoor", "Nature"]
Sharded Services Architecture:
Large Request (20 GB)
Request Splitter
(7 GB)
(7 GB)
(6 GB)
(Standby)
Result Merger
Final Tags
Sharding Function:
Sharding Function: hash(video_segment) % num_shards
- Segment 1: hash("dog_playing") % 4 = 1 → Shard 1
- Segment 2: hash("kid_laughing") % 4 = 3 → Shard 3
- Segment 3: hash("park_scene") % 4 = 0 → Shard 0
- Segment 4: hash("final_scene") % 4 = 2 → Shard 2
- ✅ Deterministic: Same input → same shard
- ✅ Uniform distribution: Even load across shards
- ✅ Fast computation: Minimal overhead
Performance Comparison:
3.3. Discussion: Stateful Services and Trade-offs
In plain English: Stateless services are like fast-food workers - each handles any order independently with no memory of past orders. Stateful services are like coordinating a wedding - everyone needs to remember who's handling the cake, who's setting up chairs, and how it all comes together.
In technical terms: Stateless services process each request independently, with no shared state between requests. Stateful services maintain context across requests or coordinate between multiple components. Sharded services are stateful because they must track which segment belongs to which request and merge results correctly.
Why it matters: Stateful systems are harder to build, debug, and scale. But some problems require state - you can't merge sharded results without knowing which results belong together. Understanding when stateful complexity is worth the tradeoff is a key distributed systems skill.
Stateful vs Stateless Services:
Challenges with Sharding:
- Original: "Cat jumps over fence"
- Segment 1: "Cat jum" → ❌ Meaningless
- Segment 2: "ps over" → ❌ Meaningless
- Segment 3: "fence" → ✅ Partially meaningful
- Solution: Smart segmentation by scene detection
- Segment 1: 2 GB (fast processing)
- Segment 2: 8 GB (slow processing)
- Segment 3: 10 GB (very slow processing)
- Result: Slowest shard determines total time
- Solution: Dynamic load balancing
- Duplicate tags: ["dog", "dog", "animal"]
- Conflicting tags: ["indoor", "outdoor"]
- Context loss: ["running" without "person"]
- Solution: Intelligent merging algorithms
Insight
Sharding works best for "embarrassingly parallel" problems where segments can be processed independently. Video analysis is ideal, but tasks requiring global context (like sentiment analysis of entire documents) are more challenging.
When to Use Sharding:
- Individual requests > single server capacity
- Work can be divided into independent segments
- Segment results can be meaningfully combined
- Complex state management is acceptable
- Examples: Video/audio processing, large image analysis, batch data, document analysis by pages
- All requests fit on single server
- Work requires global context
- Segments would be meaningless
- Team lacks distributed systems expertise
- Examples: Real-time chat, financial transactions, sequential processing, small requests
3.4. Exercises
-
Would vertical scaling be helpful when handling large requests?
-
Are the model server shards stateful or stateless?
4. Event-Driven Processing Pattern: Dynamic Resource Allocation
4.1. The Problem: Variable Traffic Patterns
In plain English: Traditional model serving is like a taxi stand with a fixed number of taxis always waiting, even at 3 AM when no customers exist. Event-driven serving is like Uber - drivers (resources) appear on-demand when riders (requests) need them.
In technical terms: Fixed resource allocation forces a choice between over-provisioning (expensive, wasteful) or under-provisioning (poor user experience). Event-driven architecture allocates resources dynamically based on actual demand, scaling from zero to thousands of instances automatically.
Why it matters: Most real-world traffic is bursty and unpredictable. Holiday shopping spikes, viral social media posts, or news events can cause 10-100x traffic increases within minutes. Fixed capacity either fails during peaks or wastes money during troughs. Event-driven serving solves both problems.
Real-World Example: Holiday Hotel Price Prediction
Let's build a system that predicts hotel prices for holiday bookings:
Traffic Pattern Reality:
Holiday Booking Traffic Over Time:
Requests
per Hour
↑ 🎄Christmas 🏖️Summer
│ Peak Peak
│ ┌─┐ ┌─┐
│ ┌─┐ │ │ ┌─┐ ┌─┐│ │
│ ┌─│ │ │ │ ┌─┐│ │ ┌─┐ ┌─┐│ ││ │
│─┴┴─┴─┴─┴─┴──┴─┴┴─┴─┴─┴─────────────┴─┴┴─┴┴─┴────► Time
J F M A M J J A S O N D J F M A M J J A S O N D
Problems with Fixed Resources:
❌ Jan-Oct: 90% resources idle (wasted money)
❌ Nov-Dec: 200% demand (poor user experience)
❌ Unpredictable spikes: Conference announcements
Resource Allocation Challenge:
- Resources: 2 CPUs, 20 GB memory
- ✅ Works 70% of year
- ❌ Crashes during holidays
- ❌ Angry customers during peak season
- Resources: 20 CPUs, 200 GB memory
- ✅ Handles all traffic
- ❌ 90% idle most of the year
- ❌ Very expensive ($50,000/year wasted)
- Resources: Adjust monthly based on calendar
- ⚠️ Better than fixed
- ❌ Can't predict unexpected events
- ❌ Requires constant monitoring
- ❌ Human error prone
Unexpected Event Example:
4.2. The Solution: On-Demand Resource Utilization
In plain English: Instead of having servers running 24/7 waiting for work, spawn new "workers" only when requests arrive. Each worker handles one request and then disappears. It's like hiring temporary staff only during rush hours.
In technical terms: Event-driven processing uses Function-as-a-Service (FaaS) platforms where code executes in ephemeral containers triggered by events. The platform manages provisioning, scaling, and deprovisioning automatically. You pay only for actual execution time, not idle capacity.
Why it matters: Event-driven architecture can reduce costs by 80%+ while improving user experience during traffic spikes. It eliminates the over-provisioning vs. under-provisioning dilemma entirely. The tradeoff is cold start latency and stateless-only processing.
Shared Resource Pool Architecture:
Shared Resource Pool
CPU: 1000 cores available
RAM: 5 TB available
Storage: 100 TB available
Hotel Price Predict
Current: 200 cores
800 GB RAM
Data Ingestion
Current: 50 cores
200 GB RAM
Model Training
Current: 100 cores
1 TB RAM
- Traffic spike detected
- Hotel service requests 500 more cores
- Resource pool grants request
- Auto-scale: 200 → 700 cores in 2 minutes
- Handle spike successfully
Event-Driven Processing Flow:
Traffic Spike Handling:
- Active instances: 2
- Resource usage: 8 cores, 32 GB RAM
- Response time: 1.5 seconds
- Cost: $2/hour
- Auto-scaling triggers
- New instances: 40 (scaled up 20x in 3 minutes)
- Resource usage: 160 cores, 640 GB RAM
- Response time: 1.8 seconds (still fast!)
- Cost: $35/hour (only during spike)
- Instances automatically scaled down
- Resource usage: 8 cores, 32 GB RAM
- Response time: 1.5 seconds
- Cost: $2/hour
- ✅ Handled unexpected 20x traffic spike
- ✅ Maintained good user experience
- ✅ Only paid for resources when needed
- ✅ No manual intervention required
Rate Limiting and DDoS Protection:
- Rate limit: 10 requests/hour
- Burst allowance: 3 requests/minute
- Queue position: Lower priority
- Purpose: Prevent abuse while allowing exploration
- Rate limit: 100 requests/hour
- Burst allowance: 20 requests/minute
- Queue position: Higher priority
- Purpose: Better service for legitimate users
- Rate limit: 1,000 requests/hour
- Burst allowance: 100 requests/minute
- Queue position: Highest priority
- Purpose: Enterprise-level service
Cost Comparison:
4.3. Discussion: When to Use Event-Driven Architecture
In plain English: Use event-driven serving when your traffic is unpredictable and you care about costs. Stick with always-on servers when you need ultra-low latency or maintain complex state between requests.
In technical terms: Event-driven architecture excels for stateless, bursty workloads where cold start latency (typically 100ms-3s) is acceptable. Long-running services are better for persistent connections, sub-100ms latency requirements, or stateful processing.
Why it matters: Architecture choice significantly impacts both costs and user experience. Event-driven can reduce costs by 80% but adds 500ms-2s of cold start latency. Understanding your workload characteristics determines the right pattern.
Event-Driven vs Long-Running Services:
Examples:
- Batch processing
- Seasonal applications
- Webhook handlers
- Image/video processing
- ETL pipelines
- Real-time chat
- Live streaming
- Database connections
- Gaming servers
- Financial trading
Cold Start Challenges:
- Container provisioning: 200ms
- Language runtime: 300ms
- Model loading: 2,000ms
- First request: 500ms
- Total cold start: 3 seconds
- Processing only: 500ms
- 6x faster than cold start!
Optimization Strategies:
- Pre-load frequently used models
- Keep some instances ready
- Anticipate traffic patterns
- Reduce loading time
- Load model parts on demand
- 90% requests: Warm instances (500ms)
- 10% requests: Cold start (3s) during scale-up
- Average response: 750ms (acceptable for many use cases)
Insight
Modern serverless platforms (AWS Lambda, Google Cloud Functions) have dramatically reduced cold start times. For many ML applications, the cost savings outweigh the occasional latency spike from cold starts.
Function-as-a-Service vs Traditional Deployment:
4.4. Exercises
-
Suppose we allocate the same amount of computational resources over the lifetime of the model serving system for hotel price prediction. What would the resource utilization rate look like over time?
-
Are the replicated services or sharded services long-running systems?
-
Is event-driven processing stateless or stateful?
5. Answers to Exercises
Section 2.4
-
Stateless - Each replica processes requests independently without maintaining state between requests.
-
Model server replicas would not know which requests to process, leading to conflicts and duplicate work when multiple replicas try to process the same requests.
-
Yes, but only if the single server has no more than 1.44 minutes of downtime per day (99.9% availability requirement).
Section 3.4
-
Yes, vertical scaling helps, but it would decrease overall resource utilization because most requests don't need the extra capacity.
-
Stateful - Shards must maintain partial results from processing their segments until merging is complete.
Section 4.4
-
Resource utilization would vary significantly over time - very low during off-peak periods and potentially overloaded during holidays, leading to poor efficiency.
-
Yes - Both replicated and sharded services require servers to keep running continuously to accept user requests, with computational resources allocated and occupied at all times.
-
Stateless - Event-driven functions process each request independently without maintaining state between invocations.
Summary
What We Learned:
✅ Model Serving Fundamentals: Transforming trained models into production systems ✅ Replicated Services Pattern: Horizontal scaling to handle more concurrent requests ✅ Sharded Services Pattern: Processing large requests that exceed single-server capacity ✅ Event-Driven Processing Pattern: Dynamic resource allocation for variable traffic
Pattern Selection Guide:
| Challenge | Solution | When to Use |
|---|---|---|
| Too many requests | Replicated Services | Predictable traffic, stateless processing |
| Requests too large | Sharded Services | Large payloads, divisible workloads |
| Variable traffic | Event-Driven | Unpredictable spikes, cost optimization |
Performance Improvements:
- Replicated Services: Linear throughput scaling (3x servers ≈ 3x capacity)
- Sharded Services: Handle arbitrarily large requests (20x larger than memory)
- Event-Driven: 80%+ cost reduction with automatic scaling
Real-World Impact:
Insight
Production ML serving systems typically combine all three patterns: replicated services for base load, sharding for large requests, and event-driven scaling for traffic spikes. This creates robust, cost-effective serving infrastructure.
Architecture Evolution:
Typical Production Evolution:
Single Server → Replicated Services → Add Sharding → Event-Driven Scaling
↓ ↓ ↓ ↓
Basic Scalable Handle Large Cost Optimized
Serving Traffic Requests & Dynamic
Next Steps:
In Chapter 5, we'll explore workflow patterns that orchestrate the entire ML pipeline - from data ingestion through training to serving. You'll learn to build automated systems that manage the complete lifecycle.
Remember: Model serving patterns bridge the gap between research and production. Master these, and you can serve any model to any scale of users.
Previous: Chapter 3: Distributed Training Patterns | Next: Chapter 5: Workflow Patterns