Chapter 4: Model Serving Patterns

From trained models to production magic - serving millions of predictions per second

What is Model Serving?
Replicated Services Pattern: Handling Growing Traffic
- 2.1. The Problem: Single Server Bottleneck
- 2.2. The Solution: Horizontal Scaling with Replicas
- 2.3. Discussion: High Availability and Load Balancing
- 2.4. Exercises
Sharded Services Pattern: Processing Large Requests
Event-Driven Processing Pattern: Dynamic Resource Allocation
- 4.1. The Problem: Variable Traffic Patterns
- 4.2. The Solution: On-Demand Resource Utilization
- 4.3. Discussion: When to Use Event-Driven Architecture
- 4.4. Exercises
Answers to Exercises
Summary

Introduction

Restaurant Analogy: You've learned to cook amazing dishes (train models), but now comes the real challenge: running a restaurant that serves hundreds of customers every hour. Model serving is the art of turning your carefully crafted models into production systems that can handle real-world traffic, from a single user to millions of concurrent requests.

Think of the transformation:

Training: Like perfecting a recipe in your kitchen
Serving: Like running a restaurant chain that serves that recipe to millions

This chapter explores three fundamental patterns that make this transformation possible: replicated services for handling more customers, sharded services for processing large orders, and event-driven processing for handling unpredictable rush hours.

1. What is Model Serving?

In plain English: Model serving is like setting up a restaurant where customers (users) come with questions (input data), and your trained chef (the model) quickly prepares answers (predictions) for them. The challenge is doing this for millions of customers simultaneously while keeping everyone happy.

In technical terms: Model serving is the process of loading a previously trained machine learning model to generate predictions or make inferences on new input data in a production environment.

Why it matters: The best-trained model is worthless if it can't serve predictions quickly, reliably, and cost-effectively to real users. Model serving bridges the gap between research and production impact.

Model Serving in the ML Pipeline:

ML Pipeline Overview

Data
Ingestion

→

Model
Training

→

Model
Serving ⭐

↓

Users (Millions!)

Traditional vs Distributed Model Serving:

Traditional Serving

Distributed Serving

Computational Resources

Personal laptop or single server

Cluster of machines

Dataset Location

Local disk

Remote distributed database

Model & Data Size

Small enough for single machine

Medium to extremely large

Traffic Handling

Sequential processing

Concurrent processing

Availability

Single point of failure

High availability

Scaling

Vertical (bigger machine)

Horizontal (more machines)

Real-World Example: YouTube Video Tagging

Remember our YouTube-8M model from Chapter 3? Now we need to serve it to real users:

User Upload Flow

📤

User uploads video

"My cat playing with a ball"

↓

⚙️

Model Serving System

Processes video and extracts features

↓

🏷️

Predicted tags

["Pet", "Animal", "Toy", "Indoor"]

Requirements:

Response time: < 200ms
Accuracy: > 95%
Availability: 99.9% uptime

The Scale Challenge:

📈

Serving Requirements Growth

Day 1 (Beta): 10 requests/hour → Single server
Month 1 (Launch): 1,000 requests/hour → Need replicas
Year 1 (Popular): 100,000 requests/hour → Need sharding
Year 3 (Viral): 10M requests/hour → Need everything

⚠️

Additional Challenges

Video files grow from 10MB → 4K videos (1GB+)
User expectations: instant results
Business requirements: 24/7 availability
Cost pressure: efficient resource usage

Serving Architecture Overview:

Model Serving System Architecture

User
Uploads
Video

Request→

Model
Server

↓

Model
Storage

Response (Tags)←

2. Replicated Services Pattern: Handling Growing Traffic

2.1. The Problem: Single Server Bottleneck

In plain English: Imagine a popular restaurant with only one chef. During lunch rush, orders pile up, customers wait 2 hours for food, and many leave hungry and angry. This is exactly what happens with single-server model serving.

In technical terms: A single server has limited CPU, memory, and I/O capacity. When concurrent requests exceed these limits, they queue up, causing exponentially increasing response times and eventually service degradation or failure.

Why it matters: User expectations for web services are ruthless - a 3-second delay can cause 40% of users to abandon a site. Single-server bottlenecks directly translate to lost customers, revenue, and reputation.

The Single Server Reality:

Single Server Model Serving

Request Queue

[Req #2]

[Req #3]

[Req #4]

[Req #5]

...

[Req #47]

→

Single Model Server
Processing: Request #1
Status: Busy
Queue: 47 requests waiting
Average wait: 23 minutes
❌ User Experience: Terrible

YouTube Video Tagging - Traffic Growth Story:

Week 1: Beta test

10 videos/day → Response: 100ms ✅

↓

Month 1: Public launch

1,000 videos/day → Response: 2 seconds ⚠️

↓

Month 3: Viral social media post

10,000 videos/day → Response: 20 seconds ❌

↓

Month 6: News coverage

100,000 videos/day → Response: 5 minutes ❌

Warning

Result: Users abandon the service. Solution needed: Horizontal scaling

Request Processing Timeline:

Single Server Processing (Sequential):
Time:     0s        5s       10s       15s       20s
Request:  [███ A ███][██ B ██][████ C ████][█ D █]...
Status:   Processing  Waiting   Waiting   Waiting

User A: Happy (5s response)
User B: Annoyed (10s response)
User C: Frustrated (15s response)
User D: Abandoning (20s+ response)

Problem: Each user waits for all previous users

YouTube-8M Video Examples:

🐱

Queue Position 1

"Cat playing with yarn"
File size: 45 MB
Processing time: ~3 seconds
Expected tags: ["Pet", "Animal", "Indoor"]

🚵

Queue Position 2

"Mountain biking adventure"
File size: 120 MB
Processing time: ~8 seconds
Expected tags: ["Sport", "Outdoor", "Vehicle"]

🍝

Queue Position 3

"Cooking pasta tutorial"
File size: 80 MB
Processing time: ~5 seconds
Expected tags: ["Food", "Education", "Indoor"]

Warning

Total wait for Position 3: 16 seconds (unacceptable!)

2.2. The Solution: Horizontal Scaling with Replicas

In plain English: Instead of making one chef work faster (vertical scaling = buying a faster stove), hire more chefs (horizontal scaling = add more servers). Each chef can work on different orders simultaneously.

In technical terms: Horizontal scaling deploys multiple identical instances (replicas) of the model server behind a load balancer. Each replica is stateless and can independently process requests. The load balancer distributes incoming traffic across replicas using algorithms like round-robin or least-connections.

Why it matters: Horizontal scaling is the foundation of modern cloud infrastructure. It provides linear throughput scaling, built-in redundancy, and cost-effective capacity growth without expensive hardware upgrades.

Horizontal vs Vertical Scaling:

Vertical Scaling (Scale Up)

Horizontal Scaling (Scale Out)

Configuration

Single Server: 4 cores, 8 GB RAM → Upgrade to: Powerful Server: 32 cores, 128 GB RAM

3 Servers: Each with 4 cores, 8 GB RAM

Pros

✅ Simple to implement

✅ Cost effective, ✅ Fault tolerant, ✅ Scalable

Cons

❌ Expensive, ❌ Single point of failure

Requires load balancer

Replicated Services Architecture:

Replicated Model Serving

User Requests

↓

Load Balancer

↓

Replica 1
(Serving)

Model Storage

↓

Replica 2
(Serving)

Model Storage

↓

Replica 3
(Serving)

Model Storage

Key Property: Each replica is stateless and independent

Parallel Processing Timeline:

Replicated Servers Processing (Parallel):
Time:     0s        5s       10s       15s       20s
Server 1: [███ A ███][███ E ███][███ I ███]...
Server 2: [██ B ██][████ F ████][██ J ██]...
Server 3: [████ C ████][█ G █][███ K ███]...

User A: Happy (5s response)   ✅
User B: Happy (5s response)   ✅
User C: Happy (5s response)   ✅
All users served simultaneously!

Improvement: 3x throughput, consistent response times

Load Balancer Algorithms:

🔄

Round Robin

Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server A (cycle repeats)
Simple and fair distribution

⚖️

Least Connections

Server A: 2 active connections
Server B: 1 active connection ← New request goes here
Server C: 3 active connections
Always choose least loaded

⚡

Weighted Round Robin

Server A (powerful): Weight 3 → gets 3/6 requests
Server B (medium): Weight 2 → gets 2/6 requests
Server C (basic): Weight 1 → gets 1/6 requests
Accounts for different server capabilities

Performance Results:

Before (Single Server)

After (3 Replicas + Load Balancer)

Throughput

720 videos/day

2,160 videos/day (3x improvement)

Average response

2-20 seconds

2-5 seconds

Peak response

5+ minutes

8 seconds

User satisfaction

15% ❌

89% ✅

Uptime

95-98%

99.9%

Fault tolerance

None

1 server can fail

2.3. Discussion: High Availability and Load Balancing

In plain English: High availability means your service stays up even when individual servers fail. It's like having backup generators in a hospital - the lights must never go out. Load balancing is the traffic cop that ensures every server gets a fair share of work.

In technical terms: High availability (HA) is achieved through redundancy and health monitoring. Service Level Agreements (SLAs) define acceptable downtime thresholds. Load balancers use readiness probes to detect unhealthy servers and route traffic only to healthy instances.

Why it matters: Downtime is expensive. For e-commerce, every minute offline can cost thousands of dollars. For critical services like healthcare or finance, downtime can be life-threatening or legally problematic. High availability is not a luxury - it's a business requirement.

Three-Nines Availability:

✅

99.9% Availability ("Three Nines")

Allowed downtime per day: 1.44 minutes
Allowed downtime per month: 43.8 minutes
Allowed downtime per year: 8.76 hours
Achievable with: Replicated Services

⭐

99.99% Availability ("Four Nines")

Allowed downtime per day: 8.6 seconds
Allowed downtime per month: 4.38 minutes
Allowed downtime per year: 52.56 minutes
Requires: Advanced redundancy

Readiness Probes:

Server Health Monitoring

Load
Balancer

Health Check→

Server A
Status: ✅ Ready

Health Check (Timeout)→

Server B
Status: ❌ Not Ready

Readiness Probe Checks:

✅ Can connect to model storage
✅ Model loaded successfully
✅ Required memory available
✅ Network connectivity stable

Insight

In production, readiness probes are crucial. A server that appears "running" but can't access the model storage will accept requests but fail them all, creating a terrible user experience.

Failure Scenarios and Recovery:

✅

Normal Operation (3 servers)

Load: 1000 requests/hour, Per server: 333 requests/hour ✅

↓

⚠️

Server B Fails

Load: 1000 requests/hour, Per server: 500 requests/hour ⚠️ (Slower but still working)

↓

❌

Server C Also Fails

Load: 1000 requests/hour, Per server: 1000 requests/hour ❌ (Overloaded, degraded service)

↓

🔄

Auto-scaling Response

Detect high load → Spin up new servers → Wait for readiness probes → Resume normal operation

2.4. Exercises

Are replicated model servers stateless or stateful?
What happens when we don't have a load balancer as part of the model serving system?
Can we achieve three-nines service-level agreements with only one model server instance?

3. Sharded Services Pattern: Processing Large Requests

3.1. The Problem: High-Resolution Videos Overwhelming Memory

In plain English: Replicated services work great for handling more customers, but what if a customer orders something huge - like editing a 4-hour movie? Even with multiple editors, each individual editor might not have enough resources to handle such a massive project alone.

In technical terms: When individual request payloads exceed single-server memory capacity, replication doesn't help. A 20GB video file can't fit in a server with 4GB RAM, regardless of how many such servers you have. Vertical scaling (upgrading servers) is prohibitively expensive for edge cases.

Why it matters: Modern applications often need to handle large files - 4K videos, high-resolution medical images, genomic data, or massive documents. A serving system that rejects these requests loses valuable use cases and customers.

The Large Request Challenge:

Standard Video (720p)

High-Resolution Video (4K)

File size

50 MB

2 GB

Memory needed

200 MB

8 GB

Processing time

3 seconds

45 seconds

Status

✅ Works fine with replicas

❌ Replica only has 4 GB RAM

Professional Video (8K RAW):

File size: 20 GB
Memory needed: 80 GB
Processing time: 10 minutes
Status: ❌ No single server can handle

Memory Overflow Scenario:

Server Resource Limits

Each Replica Configuration:
CPU: 4 cores | RAM: 4 GB | Storage: 100 GB | Network: 1 Gbps

Step 1: Download video

✅ (fits in storage)

↓

Step 2: Load into memory

❌ (20 GB > 4 GB RAM)

↓

Step 3: Process video

❌ (out of memory error)

Result: Return error to user
User Experience: 😡 Very unhappy

Why Vertical Scaling Isn't Practical:

📊

Traffic Reality

95% of videos: Standard resolution (50 MB)
4% of videos: High resolution (2 GB)
1% of videos: Professional (20 GB)

💰

Option 1: Upgrade all replicas

New config: 16 cores, 128 GB RAM, 1 TB storage
Cost: 10x more expensive
Resource utilization: 95% of time using 1% of capacity (wasteful!)
Average utilization: 3% (terrible economics!)

❌

Problem

Paying for resources you rarely use

3.2. The Solution: Divide and Conquer Approach

In plain English: Instead of hiring one super-editor with a massive workstation, divide the movie into scenes and have multiple regular editors work on different scenes simultaneously. Then combine their work into the final product.

In technical terms: Sharding splits large requests into smaller segments that fit within individual server capacity. Each segment is processed by a separate shard (server instance). A sharding function (typically a hash) determines which shard handles each segment. Results are merged after parallel processing.

Why it matters: Sharding enables processing of arbitrarily large requests without expensive vertical scaling. It's the technique behind video processing pipelines, distributed databases, and big data systems. The tradeoff is increased complexity - you now manage state across multiple servers.

Video Sharding Strategy:

Video Sharding Process

🎬

Original 8K Video (20 GB)

"Dog and kid playing in the park"

↓

✂️

Split Video

Divide into meaningful segments

Segment 1
7 GB
"Dog playing"

↓

Shard A
(4 GB RAM)

↓

["Dog", "Animal", "Pet"]

Segment 2
7 GB
"Kid laughing"

↓

Shard B
(4 GB RAM)

↓

["Child", "Human", "Play"]

Segment 3
6 GB
"Park scene"

↓

Shard C
(4 GB RAM)

↓

["Park", "Outdoor", "Nature"]

↓

Merge Results
Final Tags: ["Dog", "Animal", "Pet", "Child", "Human", "Play", "Park", "Outdoor", "Nature"]

Sharded Services Architecture:

Sharded Model Serving

Large Request (20 GB)

↓

Request Splitter

Shard 1
(7 GB)

Model Storage

Shard 2
(7 GB)

Model Storage

Shard 3
(6 GB)

Model Storage

Shard 4
(Standby)

Model Storage

↓

Result Merger

↓

Final Tags

Sharding Function:

Determining Shard Assignment

Sharding Function: hash(video_segment) % num_shards

Example:

Segment 1: hash("dog_playing") % 4 = 1 → Shard 1
Segment 2: hash("kid_laughing") % 4 = 3 → Shard 3
Segment 3: hash("park_scene") % 4 = 0 → Shard 0
Segment 4: hash("final_scene") % 4 = 2 → Shard 2

Key Properties of Hash Function:

✅ Deterministic: Same input → same shard
✅ Uniform distribution: Even load across shards
✅ Fast computation: Minimal overhead

Performance Comparison:

Replicated Services (Failure)

Sharded Services (Success)

Time to result

30 seconds (then failure)

85 seconds (15s split + 60s process + 10s merge)

Result

Error message

Complete tag list

User experience

😡 Frustrated

😊 Happy

Resource utilization

0% (failed)

95% (efficient)

Improvement: 20 GB videos now processable!

3.3. Discussion: Stateful Services and Trade-offs

In plain English: Stateless services are like fast-food workers - each handles any order independently with no memory of past orders. Stateful services are like coordinating a wedding - everyone needs to remember who's handling the cake, who's setting up chairs, and how it all comes together.

In technical terms: Stateless services process each request independently, with no shared state between requests. Stateful services maintain context across requests or coordinate between multiple components. Sharded services are stateful because they must track which segment belongs to which request and merge results correctly.

Why it matters: Stateful systems are harder to build, debug, and scale. But some problems require state - you can't merge sharded results without knowing which results belong together. Understanding when stateful complexity is worth the tradeoff is a key distributed systems skill.

Stateful vs Stateless Services:

Replicated Services (Stateless)

Sharded Services (Stateful)

Independence

✅ Each request is independent

⚠️ Must maintain partial results

Scaling

✅ Easy to scale up/down

⚠️ Complex failure handling

Load Balancing

✅ Simple load balancing

⚠️ Coordination required

Fault Tolerance

✅ Fault tolerant

Can handle large requests

Request Processing

Process any request independently

Must store segment results for merging

Challenges with Sharding:

❌

Problem 1: Meaningless Segments

Original: "Cat jumps over fence"
Segment 1: "Cat jum" → ❌ Meaningless
Segment 2: "ps over" → ❌ Meaningless
Segment 3: "fence" → ✅ Partially meaningful
Solution: Smart segmentation by scene detection

⚖️

Problem 2: Uneven Segment Sizes

Segment 1: 2 GB (fast processing)
Segment 2: 8 GB (slow processing)
Segment 3: 10 GB (very slow processing)
Result: Slowest shard determines total time
Solution: Dynamic load balancing

🔀

Problem 3: Result Merging Complexity

Duplicate tags: ["dog", "dog", "animal"]
Conflicting tags: ["indoor", "outdoor"]
Context loss: ["running" without "person"]
Solution: Intelligent merging algorithms

Insight

Sharding works best for "embarrassingly parallel" problems where segments can be processed independently. Video analysis is ideal, but tasks requiring global context (like sentiment analysis of entire documents) are more challenging.

When to Use Sharding:

✅

Use Sharding When

Individual requests > single server capacity
Work can be divided into independent segments
Segment results can be meaningfully combined
Complex state management is acceptable
Examples: Video/audio processing, large image analysis, batch data, document analysis by pages

❌

Don't Use Sharding When

All requests fit on single server
Work requires global context
Segments would be meaningless
Team lacks distributed systems expertise
Examples: Real-time chat, financial transactions, sequential processing, small requests

3.4. Exercises

Would vertical scaling be helpful when handling large requests?
Are the model server shards stateful or stateless?

4. Event-Driven Processing Pattern: Dynamic Resource Allocation

4.1. The Problem: Variable Traffic Patterns

In plain English: Traditional model serving is like a taxi stand with a fixed number of taxis always waiting, even at 3 AM when no customers exist. Event-driven serving is like Uber - drivers (resources) appear on-demand when riders (requests) need them.

In technical terms: Fixed resource allocation forces a choice between over-provisioning (expensive, wasteful) or under-provisioning (poor user experience). Event-driven architecture allocates resources dynamically based on actual demand, scaling from zero to thousands of instances automatically.

Why it matters: Most real-world traffic is bursty and unpredictable. Holiday shopping spikes, viral social media posts, or news events can cause 10-100x traffic increases within minutes. Fixed capacity either fails during peaks or wastes money during troughs. Event-driven serving solves both problems.

Real-World Example: Holiday Hotel Price Prediction

Let's build a system that predicts hotel prices for holiday bookings:

Hotel Price Prediction System

📝

User Input

Destination: "Aspen, Colorado" | Dates: "Dec 24-28, 2024" | Guests: 4 people | Budget: $500/night

↓

🤖

ML Model Processing

Historical price data, seasonal patterns, local events, demand forecasting

↓

💰

Predicted Prices

Hotel A: $450/night (92% confidence) | Hotel B: $520/night (88% confidence) | Hotel C: $380/night (95% confidence)

Traffic Pattern Reality:

Holiday Booking Traffic Over Time:
Requests
per Hour
  ↑     🎄Christmas                           🏖️Summer
  │      Peak                                  Peak
  │        ┌─┐                                  ┌─┐
  │    ┌─┐ │ │     ┌─┐                      ┌─┐│ │
  │  ┌─│ │ │ │  ┌─┐│ │ ┌─┐              ┌─┐│ ││ │
  │─┴┴─┴─┴─┴─┴──┴─┴┴─┴─┴─┴─────────────┴─┴┴─┴┴─┴────► Time
     J F M A M J J A S O N D J F M A M J J A S O N D

Problems with Fixed Resources:
❌ Jan-Oct: 90% resources idle (wasted money)
❌ Nov-Dec: 200% demand (poor user experience)
❌ Unpredictable spikes: Conference announcements

Resource Allocation Challenge:

📊

Option 1: Provision for Average Load

Resources: 2 CPUs, 20 GB memory
✅ Works 70% of year
❌ Crashes during holidays
❌ Angry customers during peak season

📈

Option 2: Provision for Peak Load

Resources: 20 CPUs, 200 GB memory
✅ Handles all traffic
❌ 90% idle most of the year
❌ Very expensive ($50,000/year wasted)

👷

Option 3: Manual Scaling

Resources: Adjust monthly based on calendar
⚠️ Better than fixed
❌ Can't predict unexpected events
❌ Requires constant monitoring
❌ Human error prone

Unexpected Event Example:

Surprise Traffic Spike

📰

December 15th, 2024

📰 "Major Tech Conference Announced in Aspen!"

↓

✅

9:00 AM

Normal traffic (50 requests/hour)

↓

⚠️

11:00 AM

500 requests/hour (10x normal!)

↓

❌

12:00 PM

1,000 requests/hour (20x normal!)

↓

💥

1:00 PM

System completely overwhelmed | Response time: timeout | Success rate: 10% | Business impact: Lost bookings worth $500,000

4.2. The Solution: On-Demand Resource Utilization

In plain English: Instead of having servers running 24/7 waiting for work, spawn new "workers" only when requests arrive. Each worker handles one request and then disappears. It's like hiring temporary staff only during rush hours.

In technical terms: Event-driven processing uses Function-as-a-Service (FaaS) platforms where code executes in ephemeral containers triggered by events. The platform manages provisioning, scaling, and deprovisioning automatically. You pay only for actual execution time, not idle capacity.

Why it matters: Event-driven architecture can reduce costs by 80%+ while improving user experience during traffic spikes. It eliminates the over-provisioning vs. under-provisioning dilemma entirely. The tradeoff is cold start latency and stateless-only processing.

Shared Resource Pool Architecture:

Event-Driven Resource Allocation

Shared Resource Pool
CPU: 1000 cores available
RAM: 5 TB available
Storage: 100 TB available

↓

Hotel Price Predict
Current: 200 cores
800 GB RAM

↓

Data Ingestion
Current: 50 cores
200 GB RAM

↓

Model Training
Current: 100 cores
1 TB RAM

Event-Driven Scaling:

Traffic spike detected
Hotel service requests 500 more cores
Resource pool grants request
Auto-scale: 200 → 700 cores in 2 minutes
Handle spike successfully

Event-Driven Processing Flow:

Step 1: User Request Arrives

User: "Predict hotel prices for Aspen Dec 24-28"

↓

Step 2: Event Trigger

System: "Hotel prediction event detected" | Request resource allocation: 4 CPU cores, 16 GB RAM

↓

Step 3: Resource Provisioning

Pool: "Allocated resources to new function instance" | Startup time: 800ms

↓

Step 4: Model Loading & Processing

Instance: "Loading hotel price model from storage" | Processing: "Running prediction on input data" | Duration: 1.2 seconds

↓

Step 5: Return Results & Cleanup

User: Receives predicted prices | System: "Request completed, releasing resources" | Pool: "Resources returned to pool"

Total Response Time: 2 seconds | Resource Utilization: 100% (no idle time!)

Traffic Spike Handling:

🌤️

Normal Day (50 requests/hour)

Active instances: 2
Resource usage: 8 cores, 32 GB RAM
Response time: 1.5 seconds
Cost: $2/hour

🔥

Conference Spike (1000 requests/hour)

Auto-scaling triggers
New instances: 40 (scaled up 20x in 3 minutes)
Resource usage: 160 cores, 640 GB RAM
Response time: 1.8 seconds (still fast!)
Cost: $35/hour (only during spike)

☀️

Post-Spike (returns to normal)

Instances automatically scaled down
Resource usage: 8 cores, 32 GB RAM
Response time: 1.5 seconds
Cost: $2/hour

Benefits:

✅ Handled unexpected 20x traffic spike
✅ Maintained good user experience
✅ Only paid for resources when needed
✅ No manual intervention required

Rate Limiting and DDoS Protection:

👤

Anonymous Users

Rate limit: 10 requests/hour
Burst allowance: 3 requests/minute
Queue position: Lower priority
Purpose: Prevent abuse while allowing exploration

👥

Authenticated Users

Rate limit: 100 requests/hour
Burst allowance: 20 requests/minute
Queue position: Higher priority
Purpose: Better service for legitimate users

⭐

Premium Users

Rate limit: 1,000 requests/hour
Burst allowance: 100 requests/minute
Queue position: Highest priority
Purpose: Enterprise-level service

Cost Comparison:

Traditional Fixed Allocation (Peak Provisioning)

Event-Driven Allocation

Resources

1000 cores, 4 TB RAM (always on)

0-1000 cores (dynamic)

Utilization

5% average, 90% peak

95% when active

Monthly cost

$50,000

$8,000

Wasted spend

$47,500/month (95% waste!)

Savings: $42,000/month (84% cost reduction!)

4.3. Discussion: When to Use Event-Driven Architecture

In plain English: Use event-driven serving when your traffic is unpredictable and you care about costs. Stick with always-on servers when you need ultra-low latency or maintain complex state between requests.

In technical terms: Event-driven architecture excels for stateless, bursty workloads where cold start latency (typically 100ms-3s) is acceptable. Long-running services are better for persistent connections, sub-100ms latency requirements, or stateful processing.

Why it matters: Architecture choice significantly impacts both costs and user experience. Event-driven can reduce costs by 80% but adds 500ms-2s of cold start latency. Understanding your workload characteristics determines the right pattern.

Event-Driven vs Long-Running Services:

Use Event-Driven When

Use Long-Running Services When

Traffic

✅ Highly variable or unpredictable

✅ Consistent traffic patterns

Idle time

✅ Long periods of zero/low traffic

Continuous activity expected

Cost

✅ Cost optimization is critical

Performance is priority

State

✅ Stateless processing only

✅ Stateful processing needed

Latency

Can tolerate cold start delays

✅ Low latency requirements (< 100ms)

Examples:

⚡

Event-Driven Good For

Batch processing
Seasonal applications
Webhook handlers
Image/video processing
ETL pipelines

🔄

Long-Running Good For

Real-time chat
Live streaming
Database connections
Gaming servers
Financial trading

Cold Start Challenges:

Function Cold Start Analysis

❄️

Cold Start Timeline

Container provisioning: 200ms
Language runtime: 300ms
Model loading: 2,000ms
First request: 500ms
Total cold start: 3 seconds

🔥

Warm Request Timeline

Processing only: 500ms
6x faster than cold start!

Optimization Strategies:

💾

Model Caching

Pre-load frequently used models

♨️

Warm Pools

Keep some instances ready

📊

Predictive Scaling

Anticipate traffic patterns

🪶

Lightweight Models

Reduce loading time

⏳

Progressive Loading

Load model parts on demand

Production Pattern:

90% requests: Warm instances (500ms)
10% requests: Cold start (3s) during scale-up
Average response: 750ms (acceptable for many use cases)

Insight

Modern serverless platforms (AWS Lambda, Google Cloud Functions) have dramatically reduced cold start times. For many ML applications, the cost savings outweigh the occasional latency spike from cold starts.

Function-as-a-Service vs Traditional Deployment:

Traditional Deployment

Function-as-a-Service

Deploy

Complex (servers, load balancers, monitoring)

Simple (upload code)

Scale

Manual (predict and provision)

Automatic (based on demand)

Monitor

Custom setup (logs, metrics, alerts)

Built-in (platform provides)

Maintain

High ops overhead

Low ops overhead

Cost

Fixed (pay for capacity)

Variable (pay per execution)

Time to production

2-4 weeks

1 day

4.4. Exercises

Suppose we allocate the same amount of computational resources over the lifetime of the model serving system for hotel price prediction. What would the resource utilization rate look like over time?
Are the replicated services or sharded services long-running systems?
Is event-driven processing stateless or stateful?

5. Answers to Exercises

Section 2.4

Stateless - Each replica processes requests independently without maintaining state between requests.
Model server replicas would not know which requests to process, leading to conflicts and duplicate work when multiple replicas try to process the same requests.
Yes, but only if the single server has no more than 1.44 minutes of downtime per day (99.9% availability requirement).

Section 3.4

Yes, vertical scaling helps, but it would decrease overall resource utilization because most requests don't need the extra capacity.
Stateful - Shards must maintain partial results from processing their segments until merging is complete.

Section 4.4

Resource utilization would vary significantly over time - very low during off-peak periods and potentially overloaded during holidays, leading to poor efficiency.
Yes - Both replicated and sharded services require servers to keep running continuously to accept user requests, with computational resources allocated and occupied at all times.
Stateless - Event-driven functions process each request independently without maintaining state between invocations.

Summary

What We Learned:

✅ Model Serving Fundamentals: Transforming trained models into production systems ✅ Replicated Services Pattern: Horizontal scaling to handle more concurrent requests ✅ Sharded Services Pattern: Processing large requests that exceed single-server capacity ✅ Event-Driven Processing Pattern: Dynamic resource allocation for variable traffic

Pattern Selection Guide:

Challenge	Solution	When to Use
Too many requests	Replicated Services	Predictable traffic, stateless processing
Requests too large	Sharded Services	Large payloads, divisible workloads
Variable traffic	Event-Driven	Unpredictable spikes, cost optimization

Performance Improvements:

Replicated Services: Linear throughput scaling (3x servers ≈ 3x capacity)
Sharded Services: Handle arbitrarily large requests (20x larger than memory)
Event-Driven: 80%+ cost reduction with automatic scaling

Real-World Impact:

Insight

Production ML serving systems typically combine all three patterns: replicated services for base load, sharding for large requests, and event-driven scaling for traffic spikes. This creates robust, cost-effective serving infrastructure.

Architecture Evolution:

Typical Production Evolution:
Single Server → Replicated Services → Add Sharding → Event-Driven Scaling
     ↓                ↓                    ↓               ↓
   Basic            Scalable          Handle Large    Cost Optimized
  Serving          Traffic           Requests        & Dynamic

Next Steps:

In Chapter 5, we'll explore workflow patterns that orchestrate the entire ML pipeline - from data ingestion through training to serving. You'll learn to build automated systems that manage the complete lifecycle.

Remember: Model serving patterns bridge the gap between research and production. Master these, and you can serve any model to any scale of users.

Previous: Chapter 3: Distributed Training Patterns | Next: Chapter 5: Workflow Patterns

Table of Contents​

Introduction​

1. What is Model Serving?​

2. Replicated Services Pattern: Handling Growing Traffic​

2.1. The Problem: Single Server Bottleneck​

2.2. The Solution: Horizontal Scaling with Replicas​

2.3. Discussion: High Availability and Load Balancing​

2.4. Exercises​

3. Sharded Services Pattern: Processing Large Requests​

3.1. The Problem: High-Resolution Videos Overwhelming Memory​

3.2. The Solution: Divide and Conquer Approach​

3.3. Discussion: Stateful Services and Trade-offs​

3.4. Exercises​

4. Event-Driven Processing Pattern: Dynamic Resource Allocation​

4.1. The Problem: Variable Traffic Patterns​

4.2. The Solution: On-Demand Resource Utilization​

4.3. Discussion: When to Use Event-Driven Architecture​

4.4. Exercises​

5. Answers to Exercises​

Section 2.4​

Section 3.4​

Section 4.4​

Summary​

Table of Contents

Introduction

1. What is Model Serving?

2. Replicated Services Pattern: Handling Growing Traffic

2.1. The Problem: Single Server Bottleneck

2.2. The Solution: Horizontal Scaling with Replicas

2.3. Discussion: High Availability and Load Balancing

2.4. Exercises

3. Sharded Services Pattern: Processing Large Requests

3.1. The Problem: High-Resolution Videos Overwhelming Memory

3.2. The Solution: Divide and Conquer Approach

3.3. Discussion: Stateful Services and Trade-offs

3.4. Exercises

4. Event-Driven Processing Pattern: Dynamic Resource Allocation

4.1. The Problem: Variable Traffic Patterns

4.2. The Solution: On-Demand Resource Utilization

4.3. Discussion: When to Use Event-Driven Architecture

4.4. Exercises

5. Answers to Exercises

Section 2.4

Section 3.4

Section 4.4

Summary