Skip to main content

Chapter 4: Model Serving Patterns

From trained models to production magic - serving millions of predictions per second


Table of Contents

  1. What is Model Serving?
  2. Replicated Services Pattern: Handling Growing Traffic
  3. Sharded Services Pattern: Processing Large Requests
  4. Event-Driven Processing Pattern: Dynamic Resource Allocation
  5. Answers to Exercises
  6. Summary

Introduction

Restaurant Analogy: You've learned to cook amazing dishes (train models), but now comes the real challenge: running a restaurant that serves hundreds of customers every hour. Model serving is the art of turning your carefully crafted models into production systems that can handle real-world traffic, from a single user to millions of concurrent requests.

Think of the transformation:

  • Training: Like perfecting a recipe in your kitchen
  • Serving: Like running a restaurant chain that serves that recipe to millions

This chapter explores three fundamental patterns that make this transformation possible: replicated services for handling more customers, sharded services for processing large orders, and event-driven processing for handling unpredictable rush hours.


1. What is Model Serving?

In plain English: Model serving is like setting up a restaurant where customers (users) come with questions (input data), and your trained chef (the model) quickly prepares answers (predictions) for them. The challenge is doing this for millions of customers simultaneously while keeping everyone happy.

In technical terms: Model serving is the process of loading a previously trained machine learning model to generate predictions or make inferences on new input data in a production environment.

Why it matters: The best-trained model is worthless if it can't serve predictions quickly, reliably, and cost-effectively to real users. Model serving bridges the gap between research and production impact.

Model Serving in the ML Pipeline:

ML Pipeline Overview

Data
Ingestion

Model
Training

Model
Serving ⭐

Users (Millions!)

Traditional vs Distributed Model Serving:

Traditional Serving
Distributed Serving
Computational Resources
Personal laptop or single server
Cluster of machines
Dataset Location
Local disk
Remote distributed database
Model & Data Size
Small enough for single machine
Medium to extremely large
Traffic Handling
Sequential processing
Concurrent processing
Availability
Single point of failure
High availability
Scaling
Vertical (bigger machine)
Horizontal (more machines)

Real-World Example: YouTube Video Tagging

Remember our YouTube-8M model from Chapter 3? Now we need to serve it to real users:

User Upload Flow
📤
User uploads video
"My cat playing with a ball"
⚙️
Model Serving System
Processes video and extracts features
🏷️
Predicted tags
["Pet", "Animal", "Toy", "Indoor"]
Requirements:
  • Response time: < 200ms
  • Accuracy: > 95%
  • Availability: 99.9% uptime

The Scale Challenge:

📈
Serving Requirements Growth
  • Day 1 (Beta): 10 requests/hour → Single server
  • Month 1 (Launch): 1,000 requests/hour → Need replicas
  • Year 1 (Popular): 100,000 requests/hour → Need sharding
  • Year 3 (Viral): 10M requests/hour → Need everything
⚠️
Additional Challenges
  • Video files grow from 10MB → 4K videos (1GB+)
  • User expectations: instant results
  • Business requirements: 24/7 availability
  • Cost pressure: efficient resource usage

Serving Architecture Overview:

Model Serving System Architecture

User
Uploads
Video

Request

Model
Server

Model
Storage

Response (Tags)

2. Replicated Services Pattern: Handling Growing Traffic

2.1. The Problem: Single Server Bottleneck

In plain English: Imagine a popular restaurant with only one chef. During lunch rush, orders pile up, customers wait 2 hours for food, and many leave hungry and angry. This is exactly what happens with single-server model serving.

In technical terms: A single server has limited CPU, memory, and I/O capacity. When concurrent requests exceed these limits, they queue up, causing exponentially increasing response times and eventually service degradation or failure.

Why it matters: User expectations for web services are ruthless - a 3-second delay can cause 40% of users to abandon a site. Single-server bottlenecks directly translate to lost customers, revenue, and reputation.

The Single Server Reality:

Single Server Model Serving
Request Queue
[Req #2]
[Req #3]
[Req #4]
[Req #5]
...
[Req #47]

Single Model Server
Processing: Request #1
Status: Busy
Queue: 47 requests waiting
Average wait: 23 minutes
❌ User Experience: Terrible

YouTube Video Tagging - Traffic Growth Story:

1
Week 1: Beta test
10 videos/day → Response: 100ms ✅
2
Month 1: Public launch
1,000 videos/day → Response: 2 seconds ⚠️
3
Month 3: Viral social media post
10,000 videos/day → Response: 20 seconds ❌
4
Month 6: News coverage
100,000 videos/day → Response: 5 minutes ❌

Warning

Result: Users abandon the service. Solution needed: Horizontal scaling

Request Processing Timeline:

Single Server Processing (Sequential):
Time: 0s 5s 10s 15s 20s
Request: [███ A ███][██ B ██][████ C ████][█ D █]...
Status: Processing Waiting Waiting Waiting

User A: Happy (5s response)
User B: Annoyed (10s response)
User C: Frustrated (15s response)
User D: Abandoning (20s+ response)

Problem: Each user waits for all previous users

YouTube-8M Video Examples:

🐱
Queue Position 1
  • "Cat playing with yarn"
  • File size: 45 MB
  • Processing time: ~3 seconds
  • Expected tags: ["Pet", "Animal", "Indoor"]
🚵
Queue Position 2
  • "Mountain biking adventure"
  • File size: 120 MB
  • Processing time: ~8 seconds
  • Expected tags: ["Sport", "Outdoor", "Vehicle"]
🍝
Queue Position 3
  • "Cooking pasta tutorial"
  • File size: 80 MB
  • Processing time: ~5 seconds
  • Expected tags: ["Food", "Education", "Indoor"]

Warning

Total wait for Position 3: 16 seconds (unacceptable!)

2.2. The Solution: Horizontal Scaling with Replicas

In plain English: Instead of making one chef work faster (vertical scaling = buying a faster stove), hire more chefs (horizontal scaling = add more servers). Each chef can work on different orders simultaneously.

In technical terms: Horizontal scaling deploys multiple identical instances (replicas) of the model server behind a load balancer. Each replica is stateless and can independently process requests. The load balancer distributes incoming traffic across replicas using algorithms like round-robin or least-connections.

Why it matters: Horizontal scaling is the foundation of modern cloud infrastructure. It provides linear throughput scaling, built-in redundancy, and cost-effective capacity growth without expensive hardware upgrades.

Horizontal vs Vertical Scaling:

Vertical Scaling (Scale Up)
Horizontal Scaling (Scale Out)
Configuration
Single Server: 4 cores, 8 GB RAM → Upgrade to: Powerful Server: 32 cores, 128 GB RAM
3 Servers: Each with 4 cores, 8 GB RAM
Pros
✅ Simple to implement
✅ Cost effective, ✅ Fault tolerant, ✅ Scalable
Cons
❌ Expensive, ❌ Single point of failure
Requires load balancer

Replicated Services Architecture:

Replicated Model Serving

User Requests

Load Balancer

Replica 1
(Serving)

Model Storage

Replica 2
(Serving)

Model Storage

Replica 3
(Serving)

Model Storage
Key Property: Each replica is stateless and independent

Parallel Processing Timeline:

Replicated Servers Processing (Parallel):
Time: 0s 5s 10s 15s 20s
Server 1: [███ A ███][███ E ███][███ I ███]...
Server 2: [██ B ██][████ F ████][██ J ██]...
Server 3: [████ C ████][█ G █][███ K ███]...

User A: Happy (5s response) ✅
User B: Happy (5s response) ✅
User C: Happy (5s response) ✅
All users served simultaneously!

Improvement: 3x throughput, consistent response times

Load Balancer Algorithms:

🔄
Round Robin
  • Request 1 → Server A
  • Request 2 → Server B
  • Request 3 → Server C
  • Request 4 → Server A (cycle repeats)
  • Simple and fair distribution
⚖️
Least Connections
  • Server A: 2 active connections
  • Server B: 1 active connection ← New request goes here
  • Server C: 3 active connections
  • Always choose least loaded
Weighted Round Robin
  • Server A (powerful): Weight 3 → gets 3/6 requests
  • Server B (medium): Weight 2 → gets 2/6 requests
  • Server C (basic): Weight 1 → gets 1/6 requests
  • Accounts for different server capabilities

Performance Results:

Before (Single Server)
After (3 Replicas + Load Balancer)
Throughput
720 videos/day
2,160 videos/day (3x improvement)
Average response
2-20 seconds
2-5 seconds
Peak response
5+ minutes
8 seconds
User satisfaction
15% ❌
89% ✅
Uptime
95-98%
99.9%
Fault tolerance
None
1 server can fail

2.3. Discussion: High Availability and Load Balancing

In plain English: High availability means your service stays up even when individual servers fail. It's like having backup generators in a hospital - the lights must never go out. Load balancing is the traffic cop that ensures every server gets a fair share of work.

In technical terms: High availability (HA) is achieved through redundancy and health monitoring. Service Level Agreements (SLAs) define acceptable downtime thresholds. Load balancers use readiness probes to detect unhealthy servers and route traffic only to healthy instances.

Why it matters: Downtime is expensive. For e-commerce, every minute offline can cost thousands of dollars. For critical services like healthcare or finance, downtime can be life-threatening or legally problematic. High availability is not a luxury - it's a business requirement.

Three-Nines Availability:

99.9% Availability ("Three Nines")
  • Allowed downtime per day: 1.44 minutes
  • Allowed downtime per month: 43.8 minutes
  • Allowed downtime per year: 8.76 hours
  • Achievable with: Replicated Services
99.99% Availability ("Four Nines")
  • Allowed downtime per day: 8.6 seconds
  • Allowed downtime per month: 4.38 minutes
  • Allowed downtime per year: 52.56 minutes
  • Requires: Advanced redundancy

Readiness Probes:

Server Health Monitoring

Load
Balancer

Health Check

Server A
Status: ✅ Ready

Health Check (Timeout)

Server B
Status: ❌ Not Ready

Readiness Probe Checks:
  • ✅ Can connect to model storage
  • ✅ Model loaded successfully
  • ✅ Required memory available
  • ✅ Network connectivity stable

Insight

In production, readiness probes are crucial. A server that appears "running" but can't access the model storage will accept requests but fail them all, creating a terrible user experience.

Failure Scenarios and Recovery:

Normal Operation (3 servers)
Load: 1000 requests/hour, Per server: 333 requests/hour ✅
⚠️
Server B Fails
Load: 1000 requests/hour, Per server: 500 requests/hour ⚠️ (Slower but still working)
Server C Also Fails
Load: 1000 requests/hour, Per server: 1000 requests/hour ❌ (Overloaded, degraded service)
🔄
Auto-scaling Response
Detect high load → Spin up new servers → Wait for readiness probes → Resume normal operation

2.4. Exercises

  1. Are replicated model servers stateless or stateful?

  2. What happens when we don't have a load balancer as part of the model serving system?

  3. Can we achieve three-nines service-level agreements with only one model server instance?


3. Sharded Services Pattern: Processing Large Requests

3.1. The Problem: High-Resolution Videos Overwhelming Memory

In plain English: Replicated services work great for handling more customers, but what if a customer orders something huge - like editing a 4-hour movie? Even with multiple editors, each individual editor might not have enough resources to handle such a massive project alone.

In technical terms: When individual request payloads exceed single-server memory capacity, replication doesn't help. A 20GB video file can't fit in a server with 4GB RAM, regardless of how many such servers you have. Vertical scaling (upgrading servers) is prohibitively expensive for edge cases.

Why it matters: Modern applications often need to handle large files - 4K videos, high-resolution medical images, genomic data, or massive documents. A serving system that rejects these requests loses valuable use cases and customers.

The Large Request Challenge:

Standard Video (720p)
High-Resolution Video (4K)
File size
50 MB
2 GB
Memory needed
200 MB
8 GB
Processing time
3 seconds
45 seconds
Status
✅ Works fine with replicas
❌ Replica only has 4 GB RAM
Professional Video (8K RAW):
  • File size: 20 GB
  • Memory needed: 80 GB
  • Processing time: 10 minutes
  • Status: ❌ No single server can handle

Memory Overflow Scenario:

Server Resource Limits

Each Replica Configuration:
CPU: 4 cores | RAM: 4 GB | Storage: 100 GB | Network: 1 Gbps

1
Step 1: Download video
✅ (fits in storage)
2
Step 2: Load into memory
❌ (20 GB > 4 GB RAM)
3
Step 3: Process video
❌ (out of memory error)

Result: Return error to user
User Experience: 😡 Very unhappy

Why Vertical Scaling Isn't Practical:

📊
Traffic Reality
  • 95% of videos: Standard resolution (50 MB)
  • 4% of videos: High resolution (2 GB)
  • 1% of videos: Professional (20 GB)
💰
Option 1: Upgrade all replicas
  • New config: 16 cores, 128 GB RAM, 1 TB storage
  • Cost: 10x more expensive
  • Resource utilization: 95% of time using 1% of capacity (wasteful!)
  • Average utilization: 3% (terrible economics!)
Problem
  • Paying for resources you rarely use

3.2. The Solution: Divide and Conquer Approach

In plain English: Instead of hiring one super-editor with a massive workstation, divide the movie into scenes and have multiple regular editors work on different scenes simultaneously. Then combine their work into the final product.

In technical terms: Sharding splits large requests into smaller segments that fit within individual server capacity. Each segment is processed by a separate shard (server instance). A sharding function (typically a hash) determines which shard handles each segment. Results are merged after parallel processing.

Why it matters: Sharding enables processing of arbitrarily large requests without expensive vertical scaling. It's the technique behind video processing pipelines, distributed databases, and big data systems. The tradeoff is increased complexity - you now manage state across multiple servers.

Video Sharding Strategy:

Video Sharding Process
🎬
Original 8K Video (20 GB)
"Dog and kid playing in the park"
✂️
Split Video
Divide into meaningful segments

Segment 1
7 GB
"Dog playing"

Shard A
(4 GB RAM)

["Dog", "Animal", "Pet"]

Segment 2
7 GB
"Kid laughing"

Shard B
(4 GB RAM)

["Child", "Human", "Play"]

Segment 3
6 GB
"Park scene"

Shard C
(4 GB RAM)

["Park", "Outdoor", "Nature"]

Merge Results
Final Tags: ["Dog", "Animal", "Pet", "Child", "Human", "Play", "Park", "Outdoor", "Nature"]

Sharded Services Architecture:

Sharded Model Serving

Large Request (20 GB)

Request Splitter

Shard 1
(7 GB)
Model Storage
Shard 2
(7 GB)
Model Storage
Shard 3
(6 GB)
Model Storage
Shard 4
(Standby)
Model Storage

Result Merger

Final Tags

Sharding Function:

Determining Shard Assignment

Sharding Function: hash(video_segment) % num_shards

Example:
  • Segment 1: hash("dog_playing") % 4 = 1 → Shard 1
  • Segment 2: hash("kid_laughing") % 4 = 3 → Shard 3
  • Segment 3: hash("park_scene") % 4 = 0 → Shard 0
  • Segment 4: hash("final_scene") % 4 = 2 → Shard 2
Key Properties of Hash Function:
  • ✅ Deterministic: Same input → same shard
  • ✅ Uniform distribution: Even load across shards
  • ✅ Fast computation: Minimal overhead

Performance Comparison:

Replicated Services (Failure)
Sharded Services (Success)
Time to result
30 seconds (then failure)
85 seconds (15s split + 60s process + 10s merge)
Result
Error message
Complete tag list
User experience
😡 Frustrated
😊 Happy
Resource utilization
0% (failed)
95% (efficient)
Improvement: 20 GB videos now processable!

3.3. Discussion: Stateful Services and Trade-offs

In plain English: Stateless services are like fast-food workers - each handles any order independently with no memory of past orders. Stateful services are like coordinating a wedding - everyone needs to remember who's handling the cake, who's setting up chairs, and how it all comes together.

In technical terms: Stateless services process each request independently, with no shared state between requests. Stateful services maintain context across requests or coordinate between multiple components. Sharded services are stateful because they must track which segment belongs to which request and merge results correctly.

Why it matters: Stateful systems are harder to build, debug, and scale. But some problems require state - you can't merge sharded results without knowing which results belong together. Understanding when stateful complexity is worth the tradeoff is a key distributed systems skill.

Stateful vs Stateless Services:

Replicated Services (Stateless)
Sharded Services (Stateful)
Independence
✅ Each request is independent
⚠️ Must maintain partial results
Scaling
✅ Easy to scale up/down
⚠️ Complex failure handling
Load Balancing
✅ Simple load balancing
⚠️ Coordination required
Fault Tolerance
✅ Fault tolerant
Can handle large requests
Request Processing
Process any request independently
Must store segment results for merging

Challenges with Sharding:

Problem 1: Meaningless Segments
  • Original: "Cat jumps over fence"
  • Segment 1: "Cat jum" → ❌ Meaningless
  • Segment 2: "ps over" → ❌ Meaningless
  • Segment 3: "fence" → ✅ Partially meaningful
  • Solution: Smart segmentation by scene detection
⚖️
Problem 2: Uneven Segment Sizes
  • Segment 1: 2 GB (fast processing)
  • Segment 2: 8 GB (slow processing)
  • Segment 3: 10 GB (very slow processing)
  • Result: Slowest shard determines total time
  • Solution: Dynamic load balancing
🔀
Problem 3: Result Merging Complexity
  • Duplicate tags: ["dog", "dog", "animal"]
  • Conflicting tags: ["indoor", "outdoor"]
  • Context loss: ["running" without "person"]
  • Solution: Intelligent merging algorithms

Insight

Sharding works best for "embarrassingly parallel" problems where segments can be processed independently. Video analysis is ideal, but tasks requiring global context (like sentiment analysis of entire documents) are more challenging.

When to Use Sharding:

Use Sharding When
  • Individual requests > single server capacity
  • Work can be divided into independent segments
  • Segment results can be meaningfully combined
  • Complex state management is acceptable
  • Examples: Video/audio processing, large image analysis, batch data, document analysis by pages
Don't Use Sharding When
  • All requests fit on single server
  • Work requires global context
  • Segments would be meaningless
  • Team lacks distributed systems expertise
  • Examples: Real-time chat, financial transactions, sequential processing, small requests

3.4. Exercises

  1. Would vertical scaling be helpful when handling large requests?

  2. Are the model server shards stateful or stateless?


4. Event-Driven Processing Pattern: Dynamic Resource Allocation

4.1. The Problem: Variable Traffic Patterns

In plain English: Traditional model serving is like a taxi stand with a fixed number of taxis always waiting, even at 3 AM when no customers exist. Event-driven serving is like Uber - drivers (resources) appear on-demand when riders (requests) need them.

In technical terms: Fixed resource allocation forces a choice between over-provisioning (expensive, wasteful) or under-provisioning (poor user experience). Event-driven architecture allocates resources dynamically based on actual demand, scaling from zero to thousands of instances automatically.

Why it matters: Most real-world traffic is bursty and unpredictable. Holiday shopping spikes, viral social media posts, or news events can cause 10-100x traffic increases within minutes. Fixed capacity either fails during peaks or wastes money during troughs. Event-driven serving solves both problems.

Real-World Example: Holiday Hotel Price Prediction

Let's build a system that predicts hotel prices for holiday bookings:

Hotel Price Prediction System
📝
User Input
Destination: "Aspen, Colorado" | Dates: "Dec 24-28, 2024" | Guests: 4 people | Budget: $500/night
🤖
ML Model Processing
Historical price data, seasonal patterns, local events, demand forecasting
💰
Predicted Prices
Hotel A: $450/night (92% confidence) | Hotel B: $520/night (88% confidence) | Hotel C: $380/night (95% confidence)

Traffic Pattern Reality:

Holiday Booking Traffic Over Time:
Requests
per Hour
↑ 🎄Christmas 🏖️Summer
│ Peak Peak
│ ┌─┐ ┌─┐
│ ┌─┐ │ │ ┌─┐ ┌─┐│ │
│ ┌─│ │ │ │ ┌─┐│ │ ┌─┐ ┌─┐│ ││ │
│─┴┴─┴─┴─┴─┴──┴─┴┴─┴─┴─┴─────────────┴─┴┴─┴┴─┴────► Time
J F M A M J J A S O N D J F M A M J J A S O N D

Problems with Fixed Resources:
❌ Jan-Oct: 90% resources idle (wasted money)
❌ Nov-Dec: 200% demand (poor user experience)
❌ Unpredictable spikes: Conference announcements

Resource Allocation Challenge:

📊
Option 1: Provision for Average Load
  • Resources: 2 CPUs, 20 GB memory
  • ✅ Works 70% of year
  • ❌ Crashes during holidays
  • ❌ Angry customers during peak season
📈
Option 2: Provision for Peak Load
  • Resources: 20 CPUs, 200 GB memory
  • ✅ Handles all traffic
  • ❌ 90% idle most of the year
  • ❌ Very expensive ($50,000/year wasted)
👷
Option 3: Manual Scaling
  • Resources: Adjust monthly based on calendar
  • ⚠️ Better than fixed
  • ❌ Can't predict unexpected events
  • ❌ Requires constant monitoring
  • ❌ Human error prone

Unexpected Event Example:

Surprise Traffic Spike
📰
December 15th, 2024
📰 "Major Tech Conference Announced in Aspen!"
9:00 AM
Normal traffic (50 requests/hour)
⚠️
11:00 AM
500 requests/hour (10x normal!)
12:00 PM
1,000 requests/hour (20x normal!)
💥
1:00 PM
System completely overwhelmed | Response time: timeout | Success rate: 10% | Business impact: Lost bookings worth $500,000

4.2. The Solution: On-Demand Resource Utilization

In plain English: Instead of having servers running 24/7 waiting for work, spawn new "workers" only when requests arrive. Each worker handles one request and then disappears. It's like hiring temporary staff only during rush hours.

In technical terms: Event-driven processing uses Function-as-a-Service (FaaS) platforms where code executes in ephemeral containers triggered by events. The platform manages provisioning, scaling, and deprovisioning automatically. You pay only for actual execution time, not idle capacity.

Why it matters: Event-driven architecture can reduce costs by 80%+ while improving user experience during traffic spikes. It eliminates the over-provisioning vs. under-provisioning dilemma entirely. The tradeoff is cold start latency and stateless-only processing.

Shared Resource Pool Architecture:

Event-Driven Resource Allocation

Shared Resource Pool
CPU: 1000 cores available
RAM: 5 TB available
Storage: 100 TB available

Hotel Price Predict
Current: 200 cores
800 GB RAM

Data Ingestion
Current: 50 cores
200 GB RAM

Model Training
Current: 100 cores
1 TB RAM

Event-Driven Scaling:
  • Traffic spike detected
  • Hotel service requests 500 more cores
  • Resource pool grants request
  • Auto-scale: 200 → 700 cores in 2 minutes
  • Handle spike successfully

Event-Driven Processing Flow:

1
Step 1: User Request Arrives
User: "Predict hotel prices for Aspen Dec 24-28"
2
Step 2: Event Trigger
System: "Hotel prediction event detected" | Request resource allocation: 4 CPU cores, 16 GB RAM
3
Step 3: Resource Provisioning
Pool: "Allocated resources to new function instance" | Startup time: 800ms
4
Step 4: Model Loading & Processing
Instance: "Loading hotel price model from storage" | Processing: "Running prediction on input data" | Duration: 1.2 seconds
5
Step 5: Return Results & Cleanup
User: Receives predicted prices | System: "Request completed, releasing resources" | Pool: "Resources returned to pool"
Total Response Time: 2 seconds | Resource Utilization: 100% (no idle time!)

Traffic Spike Handling:

🌤️
Normal Day (50 requests/hour)
  • Active instances: 2
  • Resource usage: 8 cores, 32 GB RAM
  • Response time: 1.5 seconds
  • Cost: $2/hour
🔥
Conference Spike (1000 requests/hour)
  • Auto-scaling triggers
  • New instances: 40 (scaled up 20x in 3 minutes)
  • Resource usage: 160 cores, 640 GB RAM
  • Response time: 1.8 seconds (still fast!)
  • Cost: $35/hour (only during spike)
☀️
Post-Spike (returns to normal)
  • Instances automatically scaled down
  • Resource usage: 8 cores, 32 GB RAM
  • Response time: 1.5 seconds
  • Cost: $2/hour
Benefits:
  • ✅ Handled unexpected 20x traffic spike
  • ✅ Maintained good user experience
  • ✅ Only paid for resources when needed
  • ✅ No manual intervention required

Rate Limiting and DDoS Protection:

👤
Anonymous Users
  • Rate limit: 10 requests/hour
  • Burst allowance: 3 requests/minute
  • Queue position: Lower priority
  • Purpose: Prevent abuse while allowing exploration
👥
Authenticated Users
  • Rate limit: 100 requests/hour
  • Burst allowance: 20 requests/minute
  • Queue position: Higher priority
  • Purpose: Better service for legitimate users
Premium Users
  • Rate limit: 1,000 requests/hour
  • Burst allowance: 100 requests/minute
  • Queue position: Highest priority
  • Purpose: Enterprise-level service

Cost Comparison:

Traditional Fixed Allocation (Peak Provisioning)
Event-Driven Allocation
Resources
1000 cores, 4 TB RAM (always on)
0-1000 cores (dynamic)
Utilization
5% average, 90% peak
95% when active
Monthly cost
$50,000
$8,000
Wasted spend
$47,500/month (95% waste!)
Savings: $42,000/month (84% cost reduction!)

4.3. Discussion: When to Use Event-Driven Architecture

In plain English: Use event-driven serving when your traffic is unpredictable and you care about costs. Stick with always-on servers when you need ultra-low latency or maintain complex state between requests.

In technical terms: Event-driven architecture excels for stateless, bursty workloads where cold start latency (typically 100ms-3s) is acceptable. Long-running services are better for persistent connections, sub-100ms latency requirements, or stateful processing.

Why it matters: Architecture choice significantly impacts both costs and user experience. Event-driven can reduce costs by 80% but adds 500ms-2s of cold start latency. Understanding your workload characteristics determines the right pattern.

Event-Driven vs Long-Running Services:

Use Event-Driven When
Use Long-Running Services When
Traffic
✅ Highly variable or unpredictable
✅ Consistent traffic patterns
Idle time
✅ Long periods of zero/low traffic
Continuous activity expected
Cost
✅ Cost optimization is critical
Performance is priority
State
✅ Stateless processing only
✅ Stateful processing needed
Latency
Can tolerate cold start delays
✅ Low latency requirements (&lt; 100ms)

Examples:

Event-Driven Good For
  • Batch processing
  • Seasonal applications
  • Webhook handlers
  • Image/video processing
  • ETL pipelines
🔄
Long-Running Good For
  • Real-time chat
  • Live streaming
  • Database connections
  • Gaming servers
  • Financial trading

Cold Start Challenges:

Function Cold Start Analysis
❄️
Cold Start Timeline
  • Container provisioning: 200ms
  • Language runtime: 300ms
  • Model loading: 2,000ms
  • First request: 500ms
  • Total cold start: 3 seconds
🔥
Warm Request Timeline
  • Processing only: 500ms
  • 6x faster than cold start!

Optimization Strategies:

💾
Model Caching
  • Pre-load frequently used models
♨️
Warm Pools
  • Keep some instances ready
📊
Predictive Scaling
  • Anticipate traffic patterns
🪶
Lightweight Models
  • Reduce loading time
Progressive Loading
  • Load model parts on demand
Production Pattern:
  • 90% requests: Warm instances (500ms)
  • 10% requests: Cold start (3s) during scale-up
  • Average response: 750ms (acceptable for many use cases)

Insight

Modern serverless platforms (AWS Lambda, Google Cloud Functions) have dramatically reduced cold start times. For many ML applications, the cost savings outweigh the occasional latency spike from cold starts.

Function-as-a-Service vs Traditional Deployment:

Traditional Deployment
Function-as-a-Service
Deploy
Complex (servers, load balancers, monitoring)
Simple (upload code)
Scale
Manual (predict and provision)
Automatic (based on demand)
Monitor
Custom setup (logs, metrics, alerts)
Built-in (platform provides)
Maintain
High ops overhead
Low ops overhead
Cost
Fixed (pay for capacity)
Variable (pay per execution)
Time to production
2-4 weeks
1 day

4.4. Exercises

  1. Suppose we allocate the same amount of computational resources over the lifetime of the model serving system for hotel price prediction. What would the resource utilization rate look like over time?

  2. Are the replicated services or sharded services long-running systems?

  3. Is event-driven processing stateless or stateful?


5. Answers to Exercises

Section 2.4

  1. Stateless - Each replica processes requests independently without maintaining state between requests.

  2. Model server replicas would not know which requests to process, leading to conflicts and duplicate work when multiple replicas try to process the same requests.

  3. Yes, but only if the single server has no more than 1.44 minutes of downtime per day (99.9% availability requirement).

Section 3.4

  1. Yes, vertical scaling helps, but it would decrease overall resource utilization because most requests don't need the extra capacity.

  2. Stateful - Shards must maintain partial results from processing their segments until merging is complete.

Section 4.4

  1. Resource utilization would vary significantly over time - very low during off-peak periods and potentially overloaded during holidays, leading to poor efficiency.

  2. Yes - Both replicated and sharded services require servers to keep running continuously to accept user requests, with computational resources allocated and occupied at all times.

  3. Stateless - Event-driven functions process each request independently without maintaining state between invocations.


Summary

What We Learned:

Model Serving Fundamentals: Transforming trained models into production systems ✅ Replicated Services Pattern: Horizontal scaling to handle more concurrent requests ✅ Sharded Services Pattern: Processing large requests that exceed single-server capacity ✅ Event-Driven Processing Pattern: Dynamic resource allocation for variable traffic

Pattern Selection Guide:

ChallengeSolutionWhen to Use
Too many requestsReplicated ServicesPredictable traffic, stateless processing
Requests too largeSharded ServicesLarge payloads, divisible workloads
Variable trafficEvent-DrivenUnpredictable spikes, cost optimization

Performance Improvements:

  • Replicated Services: Linear throughput scaling (3x servers ≈ 3x capacity)
  • Sharded Services: Handle arbitrarily large requests (20x larger than memory)
  • Event-Driven: 80%+ cost reduction with automatic scaling

Real-World Impact:

Insight

Production ML serving systems typically combine all three patterns: replicated services for base load, sharding for large requests, and event-driven scaling for traffic spikes. This creates robust, cost-effective serving infrastructure.

Architecture Evolution:

Typical Production Evolution:
Single Server → Replicated Services → Add Sharding → Event-Driven Scaling
↓ ↓ ↓ ↓
Basic Scalable Handle Large Cost Optimized
Serving Traffic Requests & Dynamic

Next Steps:

In Chapter 5, we'll explore workflow patterns that orchestrate the entire ML pipeline - from data ingestion through training to serving. You'll learn to build automated systems that manage the complete lifecycle.


Remember: Model serving patterns bridge the gap between research and production. Master these, and you can serve any model to any scale of users.


Previous: Chapter 3: Distributed Training Patterns | Next: Chapter 5: Workflow Patterns