Chapter 9: A Complete Implementation

Building the end-to-end distributed machine learning system: From Fashion-MNIST to production

Data Ingestion
- 1.1. Single-Node Data Pipeline
- 1.2. Distributed Data Pipeline
- 1.3. Exercises
Model Training
- 2.1. Model Definition and Single-Node Training
- 2.2. Distributed Model Training
- 2.3. Model Selection
- 2.4. Exercises
Model Serving
- 3.1. Single-Server Model Inference
- 3.2. Replicated Model Servers
- 3.3. Exercises
The End-to-End Workflow
- 4.1. Sequential Steps
- 4.2. Step Memoization
- 4.3. Exercises
Summary and Exercises

Think of this chapter as building a complete smart factory - we'll construct every production line (data pipeline), assembly station (model training), quality control center (model selection), packaging facility (model serving), and master control system (workflow orchestration) that works together seamlessly.

1. Data Ingestion

The foundation of our distributed ML system is robust data ingestion. We'll implement both single-node and distributed pipelines using the Fashion-MNIST dataset to feed our training infrastructure.

Data Ingestion Component

Fashion-MNIST
Download

60K training, 10K test→

Batching &
Processing

normalize→

Cached
Dataset

Raw Images
28×28 grayscale
0-255 range

Preprocessed
Normalized
0-1 range

Ready for
Distributed
Training

Fashion-MNIST Dataset Overview:

📊

Dataset Statistics

Training Set: 60,000 images (28×28 grayscale)
Test Set: 10,000 images (28×28 grayscale)
Pixel Range: 0-255 (uint8) → Normalized to 0-1 (float32)
Classes: 10 clothing categories
Size: ~30MB compressed

👕

10 Fashion Classes

0: T-shirt/top
1: Trouser
2: Pullover
3: Dress
4: Coat
5: Sandal
6: Shirt
7: Sneaker
8: Bag
9: Ankle boot

1.1. Single-Node Data Pipeline

Let's start with a local data pipeline that works on a single machine before scaling to distributed processing.

Core Data Loading Implementation:

import tensorflow_datasets as tfds
import tensorflow as tf

def make_datasets_unbatched():
    """
    Create unbatched Fashion-MNIST dataset with preprocessing

    Returns:
        tf.data.Dataset: Preprocessed, cached, and shuffled dataset
    """
    def scale(image, label):
        # Convert to float32 and normalize to [0, 1] range
        image = tf.cast(image, tf.float32)
        image /= 255.0
        return image, label

    # Load Fashion-MNIST from TensorFlow Datasets
    datasets, dataset_info = tfds.load(
        name='fashion_mnist',
        with_info=True,
        as_supervised=True
    )

    # Apply preprocessing pipeline
    train_dataset = (datasets['train']
                    .map(scale, num_parallel_calls=tf.data.AUTOTUNE)
                    .cache()
                    .shuffle(buffer_size=10000))

    return train_dataset, dataset_info

Single-Node Pipeline Flow

📥

Raw Images

28×28 uint8, 0-255 range

→

⚙️

Normalize

Convert to float32, divide by 255

→

💾

Cache

Store in memory for fast access

→

🔀

Shuffle

Randomize with 10K buffer

Inspect Dataset Properties:

# Load and inspect the dataset
ds, info = make_datasets_unbatched()

print("Dataset Information:")
print(f"Dataset: {info.name}")
print(f"Classes: {info.features['label'].names}")
print(f"Total training examples: {info.splits['train'].num_examples}")

# Check dataset structure
print(f"\nDataset element spec:")
print(ds.element_spec)
# Output: (TensorSpec(shape=(28, 28, 1), dtype=tf.float32, name=None),
#          TensorSpec(shape=(), dtype=tf.int64, name=None))

Insight

Think of the data pipeline as a smart conveyor belt in a factory: raw materials (images) enter, get processed through stations (normalization, caching), and come out ready for the assembly line (model training). Caching is like having a buffer zone that prevents having to fetch materials from the warehouse repeatedly.

1.2. Distributed Data Pipeline

For distributed training, we need to configure data sharding and proper batch sizing across multiple workers.

Distributed Configuration:

def create_distributed_dataset(strategy):
    """
    Create distributed dataset for multi-worker training

    Args:
        strategy: tf.distribute.Strategy object

    Returns:
        tf.data.Dataset: Distributed dataset ready for training
    """

    # Calculate batch sizes for distributed training
    BATCH_SIZE_PER_REPLICA = 64
    GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

    with strategy.scope():
        # Create base dataset
        ds_train = make_datasets_unbatched()[0]

        # Apply batching and repetition
        ds_train = ds_train.batch(GLOBAL_BATCH_SIZE).repeat()

        # Configure automatic sharding
        options = tf.data.Options()
        options.experimental_distribute.auto_shard_policy = \
            tf.data.experimental.AutoShardPolicy.DATA
        ds_train = ds_train.with_options(options)

    return ds_train

Multi-Worker Data Distribution

Fashion-MNIST Dataset (60K images)

automatic sharding↓

Automatic Sharding
(DATA policy)

↓

Worker 0

Images 0-19999

Batch: 64

Worker 1

Images 20000-39999

Batch: 64

Worker 2

Images 40000-59999

Batch: 64

Advanced Data Sharding Explanation:

# Sharding ensures no data overlap between workers
# Each worker processes different portions of the dataset

# Example with 2 workers, batch size 64:
# Total batch size = 64 * 2 = 128
# Worker 0: Processes samples 0, 128, 256, ...
# Worker 1: Processes samples 64, 192, 320, ...

# This guarantees:
# 1. No duplicate processing
# 2. Deterministic data distribution
# 3. Efficient gradient aggregation

1.3. Exercises

Q: What happens if we don't configure automatic sharding for distributed training?

A: Without automatic sharding, all workers would process the same data, leading to redundant computation and incorrect gradient updates since each worker would see the full dataset instead of unique shards.
Q: Why do we use cache() in the data pipeline?

A: Caching stores preprocessed data in memory, eliminating the need to repeatedly decode and normalize images in subsequent epochs, significantly speeding up training.

2. Model Training

Our training component implements three CNN variants with distributed execution, followed by automated model selection. This demonstrates the collective communication pattern at scale.

Distributed Model Training

Model Type 1: Basic CNN

Conv Layers

W1 · W2

AllReduce

Model Type 2: CNN + Dropout

Conv + Dropout

W1 · W2

AllReduce

Model Type 3: CNN + BatchNorm

Conv + BatchNorm

W1 · W2

AllReduce

compare performance↓

Model Selection (Best Accuracy)

2.1. Model Definition and Single-Node Training

Let's define three CNN architectures that compete for the best performance.

Basic CNN Model:

from tensorflow.keras import layers, models
import tensorflow as tf

def build_and_compile_cnn_model():
    """
    Build basic CNN model for Fashion-MNIST classification

    Returns:
        tf.keras.Model: Compiled CNN model
    """
    print("Training Basic CNN model")

    model = models.Sequential([
        # Input layer with explicit name for serving
        layers.Input(shape=(28, 28, 1), name='image_bytes'),

        # Convolutional feature extraction
        layers.Conv2D(32, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),

        # Classification head
        layers.Flatten(),
        layers.Dense(64, activation='relu'),
        layers.Dense(10, activation='softmax')  # 10 fashion classes
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    model.summary()
    return model

Enhanced CNN with Dropout:

def build_and_compile_cnn_model_with_dropout():
    """
    Build CNN model with dropout regularization

    Returns:
        tf.keras.Model: Compiled CNN model with dropout
    """
    print("Training CNN model with Dropout")

    model = models.Sequential([
        layers.Input(shape=(28, 28, 1), name='image_bytes'),

        # Feature extraction with dropout
        layers.Conv2D(32, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Dropout(0.5),  # Regularization layer
        layers.Conv2D(64, (3, 3), activation='relu'),

        # Classification with regularization
        layers.Flatten(),
        layers.Dense(64, activation='relu'),
        layers.Dense(10, activation='softmax')
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    return model

Enhanced CNN with Batch Normalization:

def build_and_compile_cnn_model_with_batch_norm():
    """
    Build CNN model with batch normalization

    Returns:
        tf.keras.Model: Compiled CNN model with batch normalization
    """
    print("Training CNN model with Batch Normalization")

    model = models.Sequential([
        layers.Input(shape=(28, 28, 1), name='image_bytes'),

        # Normalized feature extraction
        layers.Conv2D(32, (3, 3), activation='relu'),
        layers.BatchNormalization(),
        layers.Activation('sigmoid'),
        layers.MaxPooling2D((2, 2)),

        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.BatchNormalization(),
        layers.Activation('sigmoid'),
        layers.MaxPooling2D((2, 2)),

        layers.Conv2D(64, (3, 3), activation='relu'),

        # Classification head
        layers.Flatten(),
        layers.Dense(64, activation='relu'),
        layers.Dense(10, activation='softmax')
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    return model

Basic CNN

Enhanced Variants

Architecture

Conv → Pool → Conv → Pool → Conv → Dense

Dropout: Adds regularization layers | BatchNorm: Normalizes activations

Parameters

~93K parameters

Dropout: ~93K | BatchNorm: Similar with BN params

Training Speed

Baseline

Dropout: Similar | BatchNorm: Slightly slower

Generalization

May overfit

Dropout: Better | BatchNorm: Better + stable training

Training with Callbacks:

import os
import argparse

def setup_training_callbacks(args):
    """
    Configure training callbacks for monitoring and checkpointing

    Args:
        args: Command line arguments with directories

    Returns:
        list: Configured callbacks
    """
    checkpoint_prefix = os.path.join(args.checkpoint_dir, "ckpt_{epoch}")

    # Custom callback to monitor learning rate
    class PrintLR(tf.keras.callbacks.Callback):
        def on_epoch_end(self, epoch, logs=None):
            current_lr = self.model.optimizer.learning_rate.numpy()
            print(f'\nLearning rate for epoch {epoch + 1} is {current_lr}')

    # Learning rate decay function
    def lr_decay(epoch):
        return 0.001 * (0.9 ** epoch)

    callbacks = [
        tf.keras.callbacks.TensorBoard(log_dir='./logs'),
        tf.keras.callbacks.ModelCheckpoint(
            filepath=checkpoint_prefix,
            save_weights_only=True
        ),
        tf.keras.callbacks.LearningRateScheduler(lr_decay),
        PrintLR()
    ]

    return callbacks

# Train model with callbacks
def train_single_model(model, dataset, args):
    """Train model with full monitoring setup"""
    callbacks = setup_training_callbacks(args)

    history = model.fit(
        dataset,
        epochs=1,  # Reduced for demo
        steps_per_epoch=70,
        callbacks=callbacks
    )

    return history

2.2. Distributed Model Training

Now let's implement distributed training using TensorFlow's MultiWorkerMirroredStrategy.

Distributed Training Setup:

import json
import os

def create_distributed_strategy():
    """
    Create and configure distributed training strategy

    Returns:
        tf.distribute.Strategy: Configured distribution strategy
    """
    strategy = tf.distribute.MultiWorkerMirroredStrategy(
        communication_options=tf.distribute.experimental.CommunicationOptions(
            implementation=tf.distribute.experimental.CollectiveCommunication.AUTO
        )
    )

    print(f"Number of devices: {strategy.num_replicas_in_sync}")
    return strategy

def distributed_training_pipeline(args):
    """
    Complete distributed training pipeline

    Args:
        args: Command line arguments
    """
    # Create distribution strategy
    strategy = create_distributed_strategy()

    # Configure batch size for distributed training
    BATCH_SIZE_PER_REPLICA = 64
    GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

    with strategy.scope():
        # Create distributed dataset
        ds_train = create_distributed_dataset(strategy)

        # Select model type based on arguments
        if args.model_type == "cnn":
            model = build_and_compile_cnn_model()
        elif args.model_type == "dropout":
            model = build_and_compile_cnn_model_with_dropout()
        elif args.model_type == "batch_norm":
            model = build_and_compile_cnn_model_with_batch_norm()
        else:
            raise ValueError(f"Unsupported model type: {args.model_type}")

    # Train model (outside strategy scope)
    history = model.fit(
        ds_train,
        epochs=1,
        steps_per_epoch=70
    )

    # Save model (only on chief worker)
    save_model_distributed(model, args)

    return model, history

Chief Worker Model Saving:

def save_model_distributed(model, args):
    """
    Save model only on chief worker to avoid conflicts

    Args:
        model: Trained model
        args: Command line arguments
    """
    def is_chief():
        """Check if current worker is the chief"""
        tf_config = json.loads(os.environ.get('TF_CONFIG', '{}'))
        task_index = tf_config.get('task', {}).get('index', 0)
        return task_index == 0

    # Determine save path based on worker role
    if is_chief():
        model_path = args.saved_model_dir
        print("Chief worker: Saving final model")
    else:
        # Worker-specific temporary path
        tf_config = json.loads(os.environ.get('TF_CONFIG', '{}'))
        task_index = tf_config.get('task', {}).get('index', 0)
        model_path = f"{args.saved_model_dir}/worker_tmp_{task_index}"
        print(f"Worker {task_index}: Saving temporary model")

    model.save(model_path)
    print(f"Model saved to: {model_path}")

Distributed Training Coordination

🚀

Time 0: Initialize Workers

Worker 0 (Chief), Worker 1, Worker 2 with data shards A, B, C

↓

⚡

Time 1: Forward Pass (Parallel)

Each worker computes gradients on their data shard independently

↓

🔄

Time 2: AllReduce Communication

Aggregate gradients: Σ(∇w₁ + ∇w₂ + ∇w₃) / 3

↓

✅

Time 3: Synchronized Update

All workers apply the same averaged gradients to their model copies

Container and Kubernetes Setup:

# Dockerfile for distributed training
FROM python:3.9

# Install TensorFlow and dependencies
RUN pip install tensorflow==2.11.0 tensorflow_datasets==4.7.0

# Copy training scripts
COPY multi-worker-distributed-training.py /
COPY model-selection.py /
COPY data-ingestion.py /

# Set working directory
WORKDIR /

# Build and import to k3d cluster
docker build -f Dockerfile -t kubeflow/multi-worker-strategy:v0.1 .
k3d image import kubeflow/multi-worker-strategy:v0.1 --cluster distml
kubectl config set-context --current --namespace=kubeflow

Persistent Storage Configuration:

# multi-worker-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: strategy-volume
  namespace: kubeflow
spec:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 1Gi

Insight

Distributed training is like coordinating a synchronized dance troupe: each dancer (worker) performs their part simultaneously, but they must synchronize at key moments (AllReduce) to maintain perfect coordination. The chief dancer (chief worker) takes responsibility for the final performance recording (model saving).

TFJob Configuration:

# multi-worker-tfjob.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: multi-worker-training
  namespace: kubeflow
spec:
  runPolicy:
    cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
          - name: tensorflow
            image: kubeflow/multi-worker-strategy:v0.1
            imagePullPolicy: IfNotPresent
            command:
            - "python"
            - "/multi-worker-distributed-training.py"
            - "--saved_model_dir"
            - "/trained_model/saved_model_versions/1/"
            - "--checkpoint_dir"
            - "/trained_model/checkpoint"
            - "--model_type"
            - "cnn"
            volumeMounts:
            - mountPath: /trained_model
              name: training
            resources:
              limits:
                cpu: 500m
                memory: 1Gi
          volumes:
          - name: training
            persistentVolumeClaim:
              claimName: strategy-volume

2.3. Model Selection

After training all three models, we implement automated selection based on validation accuracy.

Model Evaluation and Selection:

import shutil
import numpy as np
import tensorflow_datasets as tfds

def evaluate_and_select_best_model():
    """
    Evaluate all trained models and select the best one

    Returns:
        str: Path to the best model
    """
    def scale(image, label):
        """Preprocessing function for evaluation"""
        image = tf.cast(image, tf.float32)
        image /= 255.0
        return image, label

    # Load test dataset
    datasets, _ = tfds.load(
        name='fashion_mnist',
        with_info=True,
        as_supervised=True
    )
    test_ds = (datasets['test']
               .map(scale)
               .cache()
               .shuffle(10000)
               .batch(64))

    best_accuracy = 0.0
    best_model_path = ""
    model_results = {}

    # Evaluate each model
    model_types = ["cnn", "dropout", "batch_norm"]
    for i, model_type in enumerate(model_types, 1):
        model_path = f"trained_model/saved_model_versions/{i}"

        try:
            # Load and evaluate model
            model = tf.keras.models.load_model(model_path)
            loss, accuracy = model.evaluate(test_ds, verbose=0)

            model_results[model_type] = {
                'path': model_path,
                'accuracy': accuracy,
                'loss': loss
            }

            print(f"Model {i} ({model_type}): Accuracy = {accuracy:.4f}, Loss = {loss:.4f}")

            # Track best model
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_model_path = model_path

        except Exception as e:
            print(f"Error evaluating model {i}: {e}")
            continue

    # Copy best model to serving location
    destination = "trained_model/saved_model_versions/4"
    if os.path.exists(destination):
        shutil.rmtree(destination)

    if best_model_path:
        shutil.copytree(best_model_path, destination)
        print(f"\nBest model (accuracy: {best_accuracy:.4f}) copied to {destination}")

        # Print detailed results
        print("\nModel Comparison:")
        print("=" * 50)
        for model_type, results in model_results.items():
            status = "★ SELECTED" if results['path'] == best_model_path else ""
            print(f"{model_type:12} | Acc: {results['accuracy']:.4f} | {status}")

    return destination

Model Performance Comparison

Basic CNN: 71.55% accuracy | Loss: 0.7520

CNN + Dropout: 72.67% accuracy | Loss: 0.7568

★ CNN + BatchNorm: 72.82% accuracy | Loss: 0.7683 (SELECTED)

↓

Final Model saved to: saved_model_versions/4/

2.4. Exercises

Q: Why do we only save the model on the chief worker?

A: Saving on all workers could cause conflicts and corruption. The chief worker ensures a single, consistent model is saved while other workers may have temporary or incomplete states.
Q: How does automatic data sharding prevent workers from processing duplicate data?

A: The AutoShardPolicy.DATA ensures each worker receives unique, non-overlapping shards of the dataset based on worker index, guaranteeing no duplicate processing across the cluster.

3. Model Serving

Our serving component implements both single-instance and replicated serving architectures using KServe for production-grade inference.

Model Serving Architecture

Client
Requests

JSON input→

Load
Balancer

distribute→

Replicated
Services

↓

Replica 1

TF Serving

Fashion Model v4

Replica 2

TF Serving

Fashion Model v4

Replica 3

TF Serving

Fashion Model v4

3.1. Single-Server Model Inference

Let's start with basic model serving before scaling to production.

Serving Signature Configuration:

def create_serving_signature(model):
    """
    Create serving signature for image input processing

    Args:
        model: Trained TensorFlow model

    Returns:
        dict: Serving signatures for TensorFlow Serving
    """

    def _preprocess_image(bytes_inputs):
        """
        Preprocess base64 encoded images for inference

        Args:
            bytes_inputs: Base64 encoded image tensor

        Returns:
            tf.Tensor: Preprocessed image tensor
        """
        # Decode JPEG from base64
        decoded = tf.io.decode_jpeg(bytes_inputs, channels=1)

        # Resize to model input size
        resized = tf.image.resize(decoded, size=(28, 28))

        # Convert to uint8 and normalize
        normalized = tf.cast(resized, dtype=tf.uint8)

        return normalized

    @tf.function(input_signature=[
        tf.TensorSpec([None], dtype=tf.string, name='image_bytes')
    ])
    def serve_image_fn(bytes_inputs):
        """
        Serving function for image classification

        Args:
            bytes_inputs: Batch of base64 encoded images

        Returns:
            tf.Tensor: Classification predictions
        """
        # Preprocess all images in batch
        decoded_images = tf.map_fn(
            _preprocess_image,
            bytes_inputs,
            dtype=tf.uint8
        )

        # Run inference
        predictions = model(decoded_images)

        return predictions

    # Create concrete function for serving
    signatures = {
        "serving_default": serve_image_fn.get_concrete_function(
            tf.TensorSpec(shape=[None], dtype=tf.string, name='image_bytes')
        )
    }

    return signatures

def save_model_with_signature(model, model_path):
    """
    Save model with proper serving signature

    Args:
        model: Trained model
        model_path: Path to save model
    """
    signatures = create_serving_signature(model)
    tf.saved_model.save(model, model_path, signatures=signatures)
    print(f"Model with serving signature saved to: {model_path}")

Input Format Specification:

{
  "instances": [
    {
      "image_bytes": {
        "b64": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBQYFBAYGBQYHBwYIChAKCgkJChQODwwQFxQYGBcUFhYaHSUfGhsjHBYWICwgIyYnKSopGR8tMC0oMCUoKSj..."
      }
    }
  ]
}

Python Inference Client:

import requests
import json
import base64

def send_inference_request(image_path, model_endpoint):
    """
    Send inference request to model serving endpoint

    Args:
        image_path: Path to image file
        model_endpoint: Serving endpoint URL

    Returns:
        dict: Prediction response
    """
    # Read and encode image
    with open(image_path, 'rb') as image_file:
        image_data = image_file.read()
        encoded_image = base64.b64encode(image_data).decode('utf-8')

    # Prepare request payload
    payload = {
        "instances": [
            {
                "image_bytes": {
                    "b64": encoded_image
                }
            }
        ]
    }

    # Send request
    response = requests.post(
        url=model_endpoint,
        data=json.dumps(payload),
        headers={'Content-Type': 'application/json'}
    )

    if response.status_code == 200:
        return response.json()
    else:
        raise Exception(f"Request failed: {response.status_code} - {response.text}")

# Example usage
result = send_inference_request(
    image_path="test_image.jpg",
    model_endpoint="http://localhost:8080/v1/models/fashion-sample:predict"
)

print("Prediction:", result['predictions'])

3.2. Replicated Model Servers

For production workloads, we implement autoscaling replicated services using KServe.

KServe Inference Service:

# inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fashion-sample
  namespace: kubeflow
spec:
  predictor:
    model:
      modelFormat:
        name: tensorflow
      storageUri: "pvc://strategy-volume/saved_model_versions"

Autoscaling Configuration:

# inference-service-autoscaling.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fashion-sample
  namespace: kubeflow
spec:
  predictor:
    # Autoscaling parameters
    scaleTarget: 1        # Target concurrent requests per pod
    scaleMetric: concurrency
    minReplicas: 1
    maxReplicas: 10

    model:
      modelFormat:
        name: tensorflow
      storageUri: "pvc://strategy-volume/saved_model_versions"

      # Resource constraints
      resources:
        requests:
          cpu: 100m
          memory: 512Mi
        limits:
          cpu: 1000m
          memory: 2Gi

Deployment and Testing:

# Install KServe (if not already installed)
curl -s "https://raw.githubusercontent.com/kserve/kserve/v0.10.0-rc1/hack/quick_install.sh" | bash

# Deploy inference service
kubectl create -f inference-service-autoscaling.yaml

# Check service status
kubectl get inferenceservice
# NAME            URL                                   READY   AGE
# fashion-sample  http://fashion-sample.kubeflow...     True    30s

# Port forward for local access
INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system \
  --selector="app=istio-ingressgateway" \
  --output jsonpath='{.items[0].metadata.name}')

kubectl port-forward --namespace istio-system \
  svc/${INGRESS_GATEWAY_SERVICE} 8080:80

Load Testing with Autoscaling:

# Install hey load testing tool
# https://github.com/rakyll/hey

# Get service hostname
SERVICE_HOSTNAME=$(kubectl get inferenceservice fashion-sample \
  -o jsonpath='{.status.url}' | cut -d "/" -f 3)

# Send sustained load to trigger autoscaling
hey -z 30s -c 5 -m POST \
  -host ${SERVICE_HOSTNAME} \
  -D inference-input.json \
  "http://localhost:8080/v1/models/fashion-sample:predict"

Autoscaling Response to Load

Time 0s: Load 1 RPS → Pods: 1 | Latency: 50ms

Time 10s: Load 25 RPS → Pods: 1 | Latency: 80ms (scaling triggered)

Time 20s: Load 50 RPS → Pods: 3 | Latency: 45ms (scaled up)

Time 30s: Load 75 RPS → Pods: 5 | Latency: 40ms (max load)

Time 40s: Load 10 RPS → Pods: 2 | Latency: 50ms (scaling down)

Time 60s: Load 1 RPS → Pods: 1 | Latency: 45ms (steady state)

Production Performance Results:

Load Test Results (30 seconds, 5 concurrent requests):

Summary:
  Total:        30.0475 secs
  Slowest:      0.2797 secs
  Fastest:      0.0043 secs
  Average:      0.0522 secs
  Requests/sec: 95.7483
  Total data:   230,160 bytes

Response time histogram:
  0.004 [1]     |
  0.032 [1437]  |████████████████████████████████████████
  0.059 [3]     |
  0.087 [823]   |███████████████████████
  0.114 [527]   |███████████████

Latency distribution:
  50% in 0.0337 secs
  90% in 0.0966 secs
  99% in 0.1835 secs

Status code distribution:
  [200] 2877 responses

Insight

KServe autoscaling is like having a smart restaurant that automatically adjusts staff based on customer flow: during rush hours it adds more servers (pods), and during quiet times it reduces staff to save costs, all while maintaining consistent service quality.

3.3. Exercises

Q: What happens if we don't configure proper serving signatures for our model?

A: Without proper serving signatures, TensorFlow Serving won't know how to preprocess inputs, leading to inference failures or incorrect predictions due to input format mismatches.
Q: How does KServe determine when to scale up or down?

A: KServe monitors the configured metric (concurrency, RPS, CPU, memory) and compares it to the target value. It scales up when demand exceeds the target and scales down when demand drops below the target for a sustained period.

4. The End-to-End Workflow

Now we orchestrate all components into a complete, automated ML pipeline using Argo Workflows, demonstrating advanced workflow patterns.

End-to-End ML Workflow

📊

Step 1: Data Ingestion

Download Fashion-MNIST, cache for 1 hour (memoized)

↓

🤖

Step 2: Parallel Model Training

Train 3 CNN variants (Basic, Dropout, BatchNorm) with 2 workers each

↓

🏆

Step 3: Model Selection

Evaluate all 3 models and select best accuracy

↓

🚀

Step 4: Model Serving

Deploy best model with KServe autoscaling

4.1. Sequential Steps

Let's implement the complete workflow using Argo Workflows with proper dependency management.

Main Workflow Definition:

# workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: fashion-ml-pipeline-
  namespace: kubeflow
spec:
  entrypoint: ml-pipeline

  # Pod cleanup strategy
  podGC:
    strategy: OnPodSuccess

  # Shared persistent volume
  volumes:
  - name: model-storage
    persistentVolumeClaim:
      claimName: strategy-volume

  templates:
  # Main workflow entry point
  - name: ml-pipeline
    steps:
    # Step 1: Data ingestion (with memoization)
    - - name: data-ingestion
        template: data-ingestion-step

    # Step 2: Parallel model training
    - - name: train-models
        template: parallel-training-steps

    # Step 3: Model selection
    - - name: select-best-model
        template: model-selection-step

    # Step 4: Deploy serving
    - - name: deploy-serving
        template: model-serving-step

  # Data ingestion with caching
  - name: data-ingestion-step
    serviceAccountName: argo
    memoize:
      key: "fashion-mnist-data"
      maxAge: "1h"
      cache:
        configMap:
          name: pipeline-cache
          key: data-cache
    container:
      image: kubeflow/multi-worker-strategy:v0.1
      imagePullPolicy: IfNotPresent
      command: ["python", "/data-ingestion.py"]
      volumeMounts:
      - name: model-storage
        mountPath: /trained_model

  # Parallel training steps
  - name: parallel-training-steps
    steps:
    - - name: train-cnn
        template: train-cnn-model
      - name: train-dropout
        template: train-dropout-model
      - name: train-batchnorm
        template: train-batchnorm-model

  # Individual training job templates
  - name: train-cnn-model
    serviceAccountName: training-operator
    resource:
      action: create
      setOwnerReference: true
      successCondition: status.replicaStatuses.Worker.succeeded = 2
      failureCondition: status.replicaStatuses.Worker.failed > 0
      manifest: |
        apiVersion: kubeflow.org/v1
        kind: TFJob
        metadata:
          generateName: cnn-training-
        spec:
          runPolicy:
            cleanPodPolicy: None
          tfReplicaSpecs:
            Worker:
              replicas: 2
              restartPolicy: Never
              template:
                spec:
                  containers:
                  - name: tensorflow
                    image: kubeflow/multi-worker-strategy:v0.1
                    imagePullPolicy: IfNotPresent
                    command:
                    - "python"
                    - "/multi-worker-distributed-training.py"
                    - "--saved_model_dir"
                    - "/trained_model/saved_model_versions/1/"
                    - "--checkpoint_dir"
                    - "/trained_model/checkpoint/cnn"
                    - "--model_type"
                    - "cnn"
                    volumeMounts:
                    - mountPath: /trained_model
                      name: training
                    resources:
                      limits:
                        cpu: 500m
                        memory: 1Gi
                      requests:
                        cpu: 100m
                        memory: 512Mi
                  volumes:
                  - name: training
                    persistentVolumeClaim:
                      claimName: strategy-volume

  # Similar templates for dropout and batchnorm models...

  # Model selection step
  - name: model-selection-step
    serviceAccountName: argo
    container:
      image: kubeflow/multi-worker-strategy:v0.1
      imagePullPolicy: IfNotPresent
      command: ["python", "/model-selection.py"]
      volumeMounts:
      - name: model-storage
        mountPath: /trained_model

  # Model serving deployment
  - name: model-serving-step
    serviceAccountName: training-operator
    resource:
      action: create
      setOwnerReference: true
      successCondition: status.modelStatus.states.transitionStatus = UpToDate
      manifest: |
        apiVersion: serving.kserve.io/v1beta1
        kind: InferenceService
        metadata:
          name: fashion-sample
        spec:
          predictor:
            scaleTarget: 1
            scaleMetric: concurrency
            model:
              modelFormat:
                name: tensorflow
              storageUri: "pvc://strategy-volume/saved_model_versions"
              resources:
                requests:
                  cpu: 100m
                  memory: 512Mi
                limits:
                  cpu: 1000m
                  memory: 2Gi

Workflow Execution Monitoring:

# Submit the complete workflow
kubectl create -f workflow.yaml

# Monitor workflow progress
kubectl get workflows

# Watch detailed workflow status
kubectl get workflow fashion-ml-pipeline-xxxxx -o yaml

# Monitor pods during execution
watch kubectl get pods

# Check specific step logs
kubectl logs fashion-ml-pipeline-xxxxx-data-ingestion-xxxxx

4.2. Step Memoization

Implement intelligent caching to avoid redundant data processing and speed up development iterations.

Advanced Memoization Configuration:

# Enhanced data ingestion with smart caching
- name: data-ingestion-step
  serviceAccountName: argo
  memoize:
    key: "fashion-mnist-{{workflow.parameters.dataset-version}}"
    maxAge: "24h"  # Cache for 24 hours
    cache:
      configMap:
        name: pipeline-cache
        key: data-cache
  inputs:
    parameters:
    - name: dataset-version
      value: "3.0.1"  # TensorFlow Datasets version
  container:
    image: kubeflow/multi-worker-strategy:v0.1
    imagePullPolicy: IfNotPresent
    command: ["python", "/data-ingestion.py"]
    args: ["--version", "{{inputs.parameters.dataset-version}}"]
    volumeMounts:
    - name: model-storage
      mountPath: /trained_model

Run #1 (Cold Cache)

Run #2 (Cache Hit)

Data Ingestion

3 minutes

<1 second (CACHED)

Model Training

12 minutes

Model Selection

2 minutes

Model Serving

1 minute

Total Time

18 minutes

15 minutes (17% faster)

Cache Management:

# Check cache status
kubectl get configmap pipeline-cache -o yaml

# View cache details
kubectl describe configmap pipeline-cache

# Monitor cache hits in workflow status
kubectl get workflow fashion-ml-pipeline-xxxxx -o jsonpath='{.status.nodes}'

Production Cache Management:

# Python script for cache cleanup
import yaml
import subprocess
from datetime import datetime, timedelta

def cleanup_expired_caches(max_age_hours=24):
    """
    Clean up expired workflow caches

    Args:
        max_age_hours: Maximum cache age before cleanup
    """
    # Get all cache ConfigMaps
    result = subprocess.run([
        'kubectl', 'get', 'configmap',
        '-l', 'workflows.argoproj.io/configmap-type=Cache',
        '-o', 'yaml'
    ], capture_output=True, text=True)

    if result.returncode != 0:
        print(f"Error getting ConfigMaps: {result.stderr}")
        return

    configmaps = yaml.safe_load(result.stdout)
    cutoff_time = datetime.now() - timedelta(hours=max_age_hours)

    for cm in configmaps.get('items', []):
        creation_time = datetime.fromisoformat(
            cm['metadata']['creationTimestamp'].replace('Z', '+00:00')
        )

        if creation_time < cutoff_time:
            cm_name = cm['metadata']['name']
            print(f"Deleting expired cache: {cm_name}")

            subprocess.run([
                'kubectl', 'delete', 'configmap', cm_name
            ])

# Run cache cleanup
cleanup_expired_caches(max_age_hours=24)

Insight

Step memoization in ML workflows is like having a smart assistant that remembers what you've already done: "You downloaded this dataset yesterday and it hasn't changed, so let's skip that step and go straight to the new experiment you want to try." This dramatically speeds up the development cycle.

4.3. Exercises

Q: When would step memoization be inappropriate for data ingestion?

A: When dealing with streaming data, frequently updated datasets, or when data freshness is critical for model performance. Memoization should be disabled for real-time or rapidly changing data sources.
Q: How does Argo Workflows ensure that dependent steps wait for parallel training to complete?

A: Argo uses a DAG structure where the model selection step explicitly depends on all three training steps. The workflow engine automatically waits for all parallel training jobs to succeed before proceeding to selection.

5. Summary and Exercises

Complete System Implementation

Congratulations! You've built a production-ready distributed machine learning system that demonstrates every pattern and technology covered in this tutorial.

Complete ML System Architecture

Workflow Orchestration Layer

Argo Workflows for end-to-end automationStep memoization for efficiencyParallel execution for speedComprehensive monitoring and logging

Model Serving Layer

KServe-based inference servicesAutoscaling based on request loadLoad balancing across replicasREST API with JSON input/output

Model Training Layer

Three CNN variants (Basic, Dropout, BatchNorm)Distributed training with AllReduceAutomatic model selection based on accuracyPersistent model storage

Data Ingestion Layer

Fashion-MNIST pipeline with cachingDistributed data shardingAutomatic preprocessing and normalization

Key Achievements

📊

Data Ingestion Patterns

Batching pattern for memory efficiency
Caching pattern for multi-epoch training
Sharding pattern for distributed processing

🤖

Model Training Patterns

Collective communication with AllReduce
Multi-worker coordination
Fault-tolerant distributed training

🚀

Model Serving Patterns

Replicated services with load balancing
Autoscaling based on demand
High-availability deployment

🔄

Workflow Patterns

Fan-out/fan-in for parallel model training
Step memoization for efficiency
Asynchronous execution optimization

Production Readiness Checklist

System Reliability:

☑ Fault tolerance in distributed training
☑ Automatic model selection and validation
☑ Autoscaling for varying loads
☑ Persistent storage for model artifacts
☑ Comprehensive error handling
☑ Resource limits and requests configured
☑ Health checks and monitoring

Performance Optimization:

☑ Data pipeline optimization with caching
☑ Parallel model training execution
☑ Efficient batch processing
☑ Load balancing across replicas
☑ Step memoization for fast iterations
☑ Resource-aware scheduling

Operational Excellence:

☑ Complete workflow automation
☑ Model version management
☑ Cache management and cleanup
☑ Monitoring and observability
☑ Configuration management
☑ Disaster recovery procedures

Real-World Deployment Considerations

Scaling to Production:

Data Volume: System handles Fashion-MNIST (70K images) → Scale to millions of images
Model Complexity: CNN variants → Large language models, transformers
Request Load: Local testing → Thousands of requests per second
Geographic Distribution: Single cluster → Multi-region deployment

Cost Optimization:

Use spot instances for training workloads
Implement intelligent cache management
Optimize resource requests and limits
Use horizontal pod autoscaling effectively

Security Considerations:

Implement network policies
Use service accounts with minimal permissions
Encrypt data at rest and in transit
Regular security updates and vulnerability scanning

Advanced Extensions

Next-Level Enhancements:

A/B Testing Framework: Deploy multiple model versions simultaneously
Model Drift Detection: Monitor for data and model drift in production
Federated Learning: Extend to privacy-preserving distributed training
Multi-Modal Models: Combine text, image, and structured data
Real-Time Training: Implement online learning with streaming data

Insight

You've built more than just a machine learning system - you've created a comprehensive platform that demonstrates enterprise-grade ML engineering. This foundation can be adapted for any domain: recommendation systems, fraud detection, autonomous vehicles, or medical diagnosis. The patterns and practices you've learned scale from startup MVP to Fortune 500 production systems.

Hands-On Exercises

System Validation:

End-to-End Testing:
- Run the complete workflow from data ingestion to serving
- Verify model accuracy meets expectations (>70%)
- Test autoscaling with load generation
- Validate cache performance with multiple runs
Performance Benchmarking:
- Measure training time with different worker counts
- Compare single-node vs distributed training efficiency
- Test serving latency under various loads
- Analyze resource utilization patterns
Failure Testing:
- Simulate worker failures during training
- Test model serving resilience with pod disruptions
- Verify cache recovery after system restarts
- Validate workflow retry mechanisms

Advanced Challenges:

Custom Model Integration:
- Replace CNN with a transformer architecture
- Implement custom loss functions
- Add model compression techniques
- Integrate with external data sources
Production Hardening:
- Implement proper logging and monitoring
- Add authentication and authorization
- Configure network policies and security
- Set up backup and disaster recovery
Scale Testing:
- Test with larger datasets (CIFAR-10, ImageNet)
- Implement distributed data preprocessing
- Optimize for GPU acceleration
- Scale to multi-cluster deployments

Final Project: Your Own ML System

Design Challenge: Build a distributed ML system for your domain using the patterns and technologies from this tutorial.

Requirements:

Choose a real dataset and problem
Implement at least 3 of the 5 pattern categories
Use the complete technology stack
Document your design decisions
Demonstrate production readiness

Example Domains:

E-commerce: Product recommendation with collaborative filtering
Finance: Fraud detection with real-time scoring
Healthcare: Medical image analysis with privacy constraints
IoT: Predictive maintenance with sensor data
Social Media: Content moderation with multi-modal inputs

Congratulations! You've completed the journey from distributed ML fundamentals to production implementation. You now have the knowledge and hands-on experience to build scalable, reliable machine learning systems that can handle real-world challenges.

The patterns you've learned are timeless, but the technologies will evolve. Keep the principles, adapt the tools, and continue building amazing ML systems!

Previous: Chapter 8: Overview of Relevant Technologies

Table of Contents​

1. Data Ingestion​

1.1. Single-Node Data Pipeline​

1.2. Distributed Data Pipeline​

1.3. Exercises​

2. Model Training​

2.1. Model Definition and Single-Node Training​

2.2. Distributed Model Training​

2.3. Model Selection​

2.4. Exercises​

3. Model Serving​

3.1. Single-Server Model Inference​

3.2. Replicated Model Servers​

3.3. Exercises​

4. The End-to-End Workflow​

4.1. Sequential Steps​

4.2. Step Memoization​

4.3. Exercises​

5. Summary and Exercises​

Complete System Implementation​

Key Achievements​

Production Readiness Checklist​

Real-World Deployment Considerations​

Advanced Extensions​

Hands-On Exercises​

Final Project: Your Own ML System​

Table of Contents

1. Data Ingestion

1.1. Single-Node Data Pipeline

1.2. Distributed Data Pipeline

1.3. Exercises

2. Model Training

2.1. Model Definition and Single-Node Training

2.2. Distributed Model Training

2.3. Model Selection

2.4. Exercises

3. Model Serving

3.1. Single-Server Model Inference

3.2. Replicated Model Servers

3.3. Exercises

4. The End-to-End Workflow

4.1. Sequential Steps

4.2. Step Memoization

4.3. Exercises

5. Summary and Exercises

Complete System Implementation

Key Achievements

Production Readiness Checklist

Real-World Deployment Considerations

Advanced Extensions

Hands-On Exercises

Final Project: Your Own ML System