Chapter 8: Overview of Relevant Technologies

Mastering the production ML technology stack: TensorFlow, Kubernetes, Kubeflow, and Argo Workflows

TensorFlow: The Machine Learning Framework
- 1.1. The Basics: Training Models and Hyperparameter Tuning
- 1.2. Exercises
Kubernetes: The Distributed Container Orchestration System
- 2.1. The Basics: Clusters, Pods, and Container Management
- 2.2. Exercises
Kubeflow: Machine Learning Workloads on Kubernetes
- 3.1. The Basics: Distributed Training with TFJob
- 3.2. Exercises
Argo Workflows: Container-Native Workflow Engine
- 4.1. The Basics: From Simple Tasks to Complex DAGs
- 4.2. Exercises
Summary and Exercises

Think of this chapter as learning to use a professional kitchen - TensorFlow is your cooking expertise, Kubernetes is the kitchen infrastructure, Kubeflow provides specialized ML cooking equipment, and Argo Workflows orchestrates the entire service from prep to delivery.

1. TensorFlow: The Machine Learning Framework

TensorFlow is the foundation of our ML development stack - a comprehensive platform for building, training, and deploying machine learning models at scale. It's the industry standard used by major tech companies for everything from image classification to recommendation systems.

In plain English: TensorFlow is like a complete woodworking shop - it provides all the tools (functions), raw materials (data structures), and blueprints (architectures) you need to craft machine learning models from scratch or modify existing designs.

In technical terms: TensorFlow is an end-to-end open-source platform for machine learning that provides a comprehensive ecosystem of tools, libraries, and community resources for building and deploying ML-powered applications.

Why it matters: TensorFlow handles the complex mathematics and computational optimization automatically, letting you focus on designing models and solving problems rather than implementing backpropagation algorithms from scratch.

TensorFlow Ecosystem Overview

🌐

TensorFlow.js

Browser & Node.js
Client-side ML
Real-time inference

📱

TensorFlow Lite

Mobile & Edge
Low latency
Optimized models

🏭

TFX Pipeline

Production MLOps
End-to-end workflows
Scalable serving

🚀

TensorFlow Serving

Model deployment
REST & gRPC APIs
Version management

🏪

TensorFlow Hub

Pre-trained models
Transfer learning
Model repository

📊

Core Framework

Research & Dev
Model building
Custom training loops

Insight

TensorFlow's ecosystem is like a complete auto manufacturing plant: TensorFlow Core designs the cars, TensorFlow Lite makes compact city cars, TensorFlow Serving runs the dealerships, and TFX manages the entire production line.

1.1 The Basics: Training Models and Hyperparameter Tuning

Environment Setup

Step 1: Create Conda Environment

# Create isolated Python environment
conda create --name dist-ml python=3.9 -y
conda activate dist-ml

# Install TensorFlow
pip install --upgrade pip
pip install tensorflow==2.10.0

# Handle potential NumPy conflicts
pip install numpy --ignore-installed

Step 2: Verify Installation

import tensorflow as tf
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {tf.config.list_physical_devices('GPU')}")

Basic MNIST Classification

Dataset Overview:

MNIST Handwritten Digits Dataset

Dataset Structure

• Training: 60,000 images (28×28 grayscale)

• Testing: 10,000 images (28×28 grayscale)

• Classes: 10 digits (0-9)

• Pixel values: 0-255 (uint8)

Complete Training Pipeline:

MNIST Classification Training

Loading Python runtime...

Full TensorFlow Code:

import tensorflow as tf

# 1. Load and inspect data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

print(f"Training images shape: {x_train.shape}")  # (60000, 28, 28)
print(f"Training labels shape: {y_train.shape}")  # (60000,)
print(f"Pixel value range: {x_train.min()} to {x_train.max()}")  # 0 to 255

# 2. Preprocessing: Normalize pixel values to [0, 1]
def preprocess(ds):
    return ds / 255.0

x_train = preprocess(x_train)
x_test = preprocess(x_test)

print(f"After preprocessing: {x_train.min()} to {x_train.max()}")  # 0.0 to 1.0

# 3. Define model architecture
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),    # 28×28 → 784 features
    tf.keras.layers.Dense(128, activation='relu'),     # Hidden layer
    tf.keras.layers.Dropout(0.2),                      # Prevent overfitting
    tf.keras.layers.Dense(10, activation='softmax')    # 10-class output
])

# 4. Compile model with optimizer, loss, and metrics
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# 5. Train the model
print("Training model...")
history = model.fit(x_train, y_train, epochs=5)

# 6. Evaluate on test data
test_loss, test_accuracy = model.evaluate(x_test, y_test)
print(f"Test accuracy: {test_accuracy:.4f}")

# 7. Save trained model
model.save('my_model.h5')
print("Model saved successfully!")

Neural Network Training Flow

📊

Load Data

MNIST dataset

→

⚙️

Preprocess

Normalize [0,1]

→

🏗️

Build Model

Sequential layers

→

🚀

Train

5 epochs

→

✅

Evaluate

Test accuracy

→

💾

Save

Export .h5

Hyperparameter Tuning with Keras Tuner

Advanced Model Optimization:

import tensorflow as tf
import keras_tuner as kt

# Install Keras Tuner
# pip install -q -U keras-tuner

def model_builder(hp):
    """Define model with tunable hyperparameters"""
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Flatten(input_shape=(28, 28)))

    # Tunable number of units in hidden layer
    hp_units = hp.Int('units', min_value=32, max_value=512, step=32)
    model.add(tf.keras.layers.Dense(units=hp_units, activation='relu'))
    model.add(tf.keras.layers.Dense(10))

    # Tunable learning rate
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=hp_learning_rate),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=['accuracy']
    )

    return model

# Configure hyperparameter tuner
tuner = kt.Hyperband(
    model_builder,
    objective='val_accuracy',
    max_epochs=10,
    factor=3,
    directory='my_dir',
    project_name='intro_to_kt'
)

# Early stopping to prevent overfitting
early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=4
)

# Start hyperparameter search
print("Starting hyperparameter search...")
tuner.search(
    x_train, y_train,
    epochs=30,
    validation_split=0.2,
    callbacks=[early_stop]
)

# Get best hyperparameters and train final model
best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]
print(f"Best units: {best_hps.get('units')}")
print(f"Best learning rate: {best_hps.get('learning_rate')}")

# Train final model with best hyperparameters
model = tuner.hypermodel.build(best_hps)
model.fit(x_train, y_train, epochs=50, validation_split=0.2)

# Evaluate final model
final_accuracy = model.evaluate(x_test, y_test)[1]
print(f"Final optimized accuracy: {final_accuracy:.4f}")

Hyperparameter Tuning Process

🎯

Define Search Space

Units: 32-512, LR: [1e-2, 1e-3, 1e-4]

↓

🔄

Run Trials

Hyperband algorithm explores combinations

↓

📊

Evaluate Performance

Track validation accuracy per trial

↓

⭐

Select Best Config

Choose highest performing hyperparameters

↓

🏆

Train Final Model

Use optimal settings for production

Model Loading and Reuse:

# Load saved model in new session
import tensorflow as tf
model = tf.keras.models.load_model('my_model.h5')

# Use for predictions
predictions = model.predict(x_test[:5])
predicted_classes = tf.argmax(predictions, axis=1)
print(f"Predictions: {predicted_classes.numpy()}")
print(f"Actual: {y_test[:5]}")

Insight

Hyperparameter tuning is like adjusting a recipe: you try different amounts of ingredients (units, learning rate) and cooking times (epochs) to find the perfect combination that produces the best dish (highest accuracy).

1.2 Exercises

Q: Can you use the previously saved model directly for model evaluation?

A: Yes, via model = tf.keras.models.load_model('my_model.h5'); model.evaluate(x_test, y_test)
Q: Instead of using the Hyperband tuning algorithm, could you try the random search algorithm?

A: You should be able to do it easily by changing the tuner to kt.RandomSearch(model_builder).

2. Kubernetes: The Distributed Container Orchestration System

Kubernetes (K8s) is our distributed infrastructure foundation - it automates deployment, scaling, and management of containerized applications across clusters of machines.

In plain English: Kubernetes is like a smart building manager that automatically assigns office space to teams, ensures everyone has the power and network connections they need, and can quickly reorganize spaces when requirements change.

In technical terms: Kubernetes is an open-source container orchestration platform that automates the deployment, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure.

Why it matters: Instead of manually managing which servers run which applications and monitoring their health, Kubernetes automatically handles scheduling, healing failed containers, scaling based on load, and managing service discovery.

Kubernetes Architecture Overview

Control Plane

API Server

Scheduler

Controller

etcd

orchestrates→

reports status←

Worker Nodes

Kubelet

Container Runtime

Pods

⌨️kubectl (CLI)

→

🚀Applications (ML Workloads)

2.1 The Basics: Clusters, Pods, and Container Management

Setting Up Local Kubernetes Cluster

Step 1: Create Cluster with k3d

# Install k3d (lightweight Kubernetes)
# k3d creates Kubernetes clusters using Docker containers

# Create cluster named 'distml'
k3d cluster create distml --image rancher/k3s:v1.25.3-rc3-k3s1

# Verify cluster creation
kubectl get nodes

Expected Output:

NAME                    STATUS   ROLES                  AGE     VERSION
k3d-distml-server-0     Ready    control-plane,master   1m      v1.25.3+k3s1

Step 2: Inspect Node Details

# Get detailed node information
kubectl describe node k3d-distml-server-0

Key Node Information:

Labels:
  beta.kubernetes.io/arch=arm64
  beta.kubernetes.io/os=linux
  kubernetes.io/hostname=k3d-distml-server-0
  node-role.kubernetes.io/control-plane=true

System Info:
  Operating System: linux
  Architecture: arm64
  Container Runtime: containerd://1.5.9-k3s1

Capacity:
  cpu: 4
  memory: 8142116Ki
  pods: 110

Addresses:
  InternalIP: 172.18.0.3
  Hostname: k3d-distml-server-0

Working with Namespaces

Create and Manage Namespaces:

# Create namespace for our examples
kubectl create ns basics

# Install kubectx/kubens for easier navigation (optional but helpful)
# These tools simplify switching between clusters and namespaces

# List available contexts and namespaces
kubectx  # Show clusters
kubens   # Show namespaces

# Switch to our cluster and namespace
kubectx k3d-distml
kubens basics

Namespace Structure

Kubernetes Cluster

default (system default)

kube-system (K8s core)

kube-public (public)

kube-node-lease (heartbeats)

✓basics (our workspace)

Understanding Pods

Pod Concept:

Pod Structure - Smallest Deployable Unit

Pod

Container 1

• Application

• Main Logic

Container 2

• Sidecar

• Logging

Shared Resources

• Network (IP address)

• Storage volumes

• Lifecycle

Create a Simple Pod:

hello-world.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: whalesay
spec:
  containers:
  - name: whalesay
    image: docker/whalesay:latest
    command: [cowsay]
    args: ["hello world"]

Deploy and Monitor Pod:

# Create the pod
kubectl create -f hello-world.yaml

# Check pod status
kubectl get pods
# Output: NAME       READY   STATUS      RESTARTS   AGE
#         whalesay   0/1     Completed   2          37s

# View pod logs
kubectl logs whalesay

Expected Output:

 _____________
< hello world >
 -------------
    \
     \
      \
                ##        .
         ## ## ##       ==
         ## ## ## ##    ===
  /""""""""""""""""___/ ===
~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
   \______ o             __/
    \      \      __/
           \____\______/

Inspect Pod Details:

# Get complete pod specification
kubectl get pod whalesay -o yaml

# Get pod info in JSON format
kubectl get pod whalesay -o json

Pod Lifecycle

⏳

Pending

Waiting for scheduling

→

▶️

Running

Container(s) executing

→

✅

Succeeded

Completed successfully

→

❌

Failed

Error occurred

CrashLoopBackOff: Repeatedly failing

ImagePullBackOff: Cannot pull image

Evicted: Resource pressure

Insight

Think of Kubernetes like a smart building manager: it knows every room (node), can quickly move teams (pods) to available spaces, and ensures everyone has the resources they need (CPU, memory, storage) while maintaining building-wide coordination.

2.2 Exercises

Q: How do you get the Pod information in JSON format?

A: kubectl get pod <pod-name> -o json
Q: Can a Pod contain multiple containers?

A: Yes, you can define additional containers in the pod.spec.containers in addition to the existing single container.

3. Kubeflow: Machine Learning Workloads on Kubernetes

Kubeflow transforms Kubernetes into a comprehensive ML platform - making it simple to deploy, scale, and manage machine learning workflows on any Kubernetes cluster.

In plain English: Kubeflow is like adding a professional film studio to a construction site - while Kubernetes provides the building infrastructure, Kubeflow adds specialized equipment for producing movies (ML models), from cameras and lighting (training tools) to editing suites (serving infrastructure).

In technical terms: Kubeflow is a machine learning toolkit for Kubernetes that orchestrates complex ML workflows including distributed training, hyperparameter tuning, model serving, and pipeline execution, all as native Kubernetes resources.

Why it matters: Instead of writing custom Kubernetes manifests and management scripts for each ML task, Kubeflow provides high-level abstractions that handle distributed training setup, resource allocation, monitoring, and serving automatically.

Kubeflow Platform

ML Platform on Kubernetes

ML Applications

PipelinesKatib (AutoML)KServeNotebooks

Kubeflow Components

Training OperatorCentral DashboardWorkflow Engine

Kubernetes Foundation

Container OrchestrationResource ManagementService Discovery

Key Components:

🔄

Pipelines (KFP)

Orchestrate workflows
DAG execution
Artifact tracking

🔍

Katib

Hyperparameter tuning
AutoML
Neural architecture search

🚀

KServe

Model serving
Autoscaling
Multi-framework support

📊

Training Operator

Distributed training
TensorFlow/PyTorch
Multi-replica jobs

💻

Notebooks

Jupyter environments
Interactive development
GPU access

🌐

Central UI

Unified interface
Pipeline visualization
Job monitoring

3.1 The Basics: Distributed Training with TFJob

Setup Kubeflow Training Operator

Step 1: Prepare Namespace and Install Components

# Create dedicated namespace
kubectl create ns kubeflow
kubens kubeflow

# Install Kubeflow Training Operator and Argo Workflows
cd code/project
kubectl kustomize manifests | kubectl apply -f -

What Gets Installed:

Installed Components

Training Operator (TensorFlow, PyTorch, MXNet)

Custom Resource Definitions (CRDs)

RBAC Permissions

Argo Workflows (Pipeline Engine)

Understanding Custom Resources

TFJob Custom Resource Definition:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: tfjobs.kubeflow.org
spec:
  group: kubeflow.org
  names:
    kind: TFJob        # What users create
    plural: tfjobs     # For kubectl get tfjobs
    singular: tfjob    # For kubectl get tfjob
  # ... rest of specification

Training Operator Controller:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: training-operator
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: training-operator
        image: kubeflow/training-operator
        command: [/manager]
        # This controller watches for TFJob resources
        # and creates the necessary pods/services

Creating Distributed TensorFlow Jobs

TFJob Specification:

tfjob.yaml:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  namespace: kubeflow
  generateName: distributed-tfjob-
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2                    # Two worker processes
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: tensorflow
            image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
            command:
              - "python"
              - "/var/tf_mnist/mnist_with_summaries.py"
              - "--log_dir=/train/metrics"
              - "--learning_rate=0.01"
              - "--batch_size=100"

Distributed TFJob Architecture

Worker 0

• Data Shard A

• Model Copy

• Gradient Computation

AllReduce→

Sync←

Worker 1

• Data Shard B

• Model Copy

• Gradient Computation

↓

Synchronized Model Updates

Deploy and Monitor TFJob:

# Submit distributed training job
kubectl create -f tfjob.yaml
# Output: tfjob.kubeflow.org/distributed-tfjob-qc8fh created

# Check TFJob status
kubectl get tfjob
# Output: NAME                      AGE
#         distributed-tfjob-qc8fh   1s

# Monitor worker pods
kubectl get pods
# Output: NAME                                  READY   STATUS
#         distributed-tfjob-qc8fh-worker-0      1/1     Running
#         distributed-tfjob-qc8fh-worker-1      1/1     Running

# Check training logs
kubectl logs distributed-tfjob-qc8fh-worker-0

Parameter Server Configuration (Alternative):

# For very large models requiring parameter servers
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: ps-training-job
spec:
  tfReplicaSpecs:
    PS:              # Parameter Server replicas
      replicas: 2
      template:
        spec:
          containers:
          - name: tensorflow
            image: your-ps-image:latest
    Worker:          # Worker replicas
      replicas: 4
      template:
        spec:
          containers:
          - name: tensorflow
            image: your-worker-image:latest

Parameter Server vs AllReduce

Parameter Server

AllReduce (Ring)

Architecture

Centralized servers

Peer-to-peer

Communication

Workers ↔ PS

Worker ↔ Worker

Bottleneck

PS network bandwidth

Distributed evenly

Best for

Large models

Fast networks

Insight

Kubeflow Training Operator is like a skilled orchestra conductor: it coordinates multiple musicians (workers) to play the same piece (train the same model) in perfect harmony, automatically handling the complex timing and communication between all participants.

3.2 Exercises

Q: If your model training requires parameter servers, can you express that in a TFJob?

A: Similar to worker replicas, define parameterServer replicas in your TFJob spec to specify the number of parameter servers.

4. Argo Workflows: Container-Native Workflow Engine

Argo Workflows is our workflow orchestration engine - it connects all ML components into cohesive, automated pipelines that can handle complex dependencies and conditional logic.

In plain English: Argo Workflows is like a sophisticated assembly line manager that coordinates different stations (tasks), ensures work happens in the right order (dependencies), can make decisions based on quality checks (conditionals), and runs multiple production lines simultaneously (parallel execution).

In technical terms: Argo Workflows is a container-native workflow engine for orchestrating parallel jobs on Kubernetes, using directed acyclic graphs (DAGs) to define complex workflows with dependencies, conditionals, loops, and artifact passing.

Why it matters: ML pipelines require coordinating data preprocessing, training, validation, and serving steps with proper dependencies and error handling. Argo automates this orchestration while providing visibility into execution and the ability to retry failed steps.

Argo Project Suite

🔄

Argo Workflows

Container-native engine
DAG execution
Parallel jobs

📦

Argo CD

GitOps delivery
Continuous deployment
Kubernetes sync

🔄

Argo Rollouts

Progressive delivery
Canary deployments
Blue-green releases

⚡

Argo Events

Event-driven automation
Webhook triggers
Resource watchers

Argo Workflows Core Features:

DAG Support: Complex workflow dependencies
Conditional Logic: Branch based on outcomes
Parallel Execution: Maximize resource utilization
Artifact Management: Pass data between steps
Web UI: Visual monitoring and debugging

4.1 The Basics: From Simple Tasks to Complex DAGs

Setting Up Argo Workflows UI

Access the Web Interface:

# Port-forward to access UI locally
kubectl port-forward svc/argo-server 2746:2746

# Visit https://localhost:2746 in your browser

Argo Workflows UI Features

📊

Workflow List

Running workflows
Status overview
Resource usage

📈

DAG Visualization

Node dependencies
Execution flow
Conditional paths

📝

Log Viewer

Real-time logs
Error details
Performance metrics

Workflow Management: Submit, Suspend, Resume, Delete

Simple Workflow Examples

1. Hello World Workflow:

argo-hello-world.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-
spec:
  entrypoint: whalesay
  serviceAccountName: argo
  templates:
  - name: whalesay
    container:
      image: docker/whalesay
      command: [cowsay]
      args: ["hello world"]

Submit and Monitor:

# Submit workflow
kubectl create -f argo-hello-world.yaml
# Output: workflow.argoproj.io/hello-world-zns4g created

# Check workflow status
kubectl get wf
# Output: NAME                STATUS    AGE
#         hello-world-zns4g   Running   2s

# Get pods created by workflow
kubectl get pods -l workflows.argoproj.io/workflow=hello-world-zns4g

# Check logs
kubectl logs hello-world-zns4g -c main

2. Kubernetes Resource Template:

argo-resource-template.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: k8s-resource-
spec:
  entrypoint: k8s-resource
  serviceAccountName: argo
  templates:
  - name: k8s-resource
    resource:
      action: create
      manifest: |
        apiVersion: v1
        kind: ConfigMap
        metadata:
          name: cm-example
        data:
          some: value

3. Python Script Template:

argo-script-template.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: script-tmpl-
spec:
  entrypoint: gen-random-int
  serviceAccountName: argo
  templates:
  - name: gen-random-int
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        import random
        i = random.randint(1, 100)
        print(f"Generated random number: {i}")

Complex Workflow Patterns

4. Diamond DAG Example:

argo-dag-diamond.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dag-diamond-
spec:
  serviceAccountName: argo
  entrypoint: diamond
  templates:
  - name: echo
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo, "{{inputs.parameters.message}}"]

  - name: diamond
    dag:
      tasks:
      - name: A
        template: echo
        arguments:
          parameters: [{name: message, value: A}]

      - name: B
        dependencies: [A]
        template: echo
        arguments:
          parameters: [{name: message, value: B}]

      - name: C
        dependencies: [A]
        template: echo
        arguments:
          parameters: [{name: message, value: C}]

      - name: D
        dependencies: [B, C]
        template: echo
        arguments:
          parameters: [{name: message, value: D}]

Diamond DAG Execution

A (start)

↓

B and C run in parallel

↓

D (end)

Execution Flow:

A runs first (no dependencies)
B and C run in parallel (both depend on A)
D runs last (depends on both B and C)

5. Conditional Workflow (Coin Flip):

argo-coinflip.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: coinflip-
spec:
  serviceAccountName: argo
  entrypoint: coinflip
  templates:
  - name: coinflip
    steps:
    - - name: flip-coin
        template: flip-coin
    - - name: heads
        template: heads
        when: "{{steps.flip-coin.outputs.result}} == heads"
      - name: tails
        template: tails
        when: "{{steps.flip-coin.outputs.result}} == tails"

  - name: flip-coin
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        import random
        result = "heads" if random.randint(0,1) == 0 else "tails"
        print(result)

  - name: heads
    container:
      image: alpine:3.6
      command: [sh, -c]
      args: ["echo \"It was heads!\""]

  - name: tails
    container:
      image: alpine:3.6
      command: [sh, -c]
      args: ["echo \"It was tails!\""]

Conditional Logic Flow

🎲Flip Coin (random)

↓

Generate "heads" or "tails"

heads

↓

Run heads template

tails

↓

Run tails template

Only one branch executes!

Advanced Workflow Features

Parameter Passing:

# Workflow with parameters
spec:
  arguments:
    parameters:
    - name: message
      value: "Hello World"

  templates:
  - name: print-message
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo]
      args: ["{{inputs.parameters.message}}"]

Artifact Management:

# Pass files between workflow steps
templates:
- name: generate-artifact
  container:
    image: python:3.8
    command: [python]
    args: ["-c", "open('/tmp/hello_world.txt', 'w').write('hello world')"]
  outputs:
    artifacts:
    - name: hello-art
      path: /tmp/hello_world.txt

- name: consume-artifact
  inputs:
    artifacts:
    - name: hello-art
      path: /tmp/hello_world.txt
  container:
    image: python:3.8
    command: [cat]
    args: ["/tmp/hello_world.txt"]

Artifact Flow

📝

Step 1

Generate file

→

💾

Save

Output artifact

→

📦

Transfer

Pass to next step

→

📖

Step 2

Consume file

Insight

Argo Workflows is like a sophisticated film director: it coordinates multiple departments (data prep, training, serving), manages complex shooting schedules (DAGs), handles script changes on the fly (conditionals), and ensures everything comes together for the final production (ML pipeline).

4.2 Exercises

Q: Besides accessing the output of each step like {{steps.flip-coin.outputs.result}}, what are other available variables?

A: The complete list is available at Argo Variables Reference.
Q: Can you trigger workflows automatically by Git commits or other events?

A: Yes, you can use Argo Events to watch Git events and trigger workflows.

5. Summary and Exercises

Technology Stack Integration

Production ML Technology Stack

Complete Stack

TensorFlow (ML Framework)

Model developmentTrainingHyperparameter tuningPersistence

Kubernetes (Infrastructure)

Container orchestrationResource managementService discovery

Kubeflow (ML on K8s)

Distributed trainingModel servingAutoML

Argo Workflows (Orchestration)

End-to-end pipelinesDAG executionConditional logic

Key Concepts Mastered

🎯

TensorFlow Proficiency

Model architecture design
Hyperparameter optimization
Model persistence
Keras Tuner integration

⚙️

Kubernetes Fundamentals

Cluster management
Pod lifecycle
Namespace isolation
Resource organization

🚀

Kubeflow Integration

Custom Resource Definitions
Distributed TFJob training
Training Operator architecture
Multi-replica workloads

🔄

Argo Workflows Mastery

DAG-based workflows
Conditional execution
Parameter passing
Artifact management

Technology Readiness Checklist

Before Chapter 9 Implementation:

✓TensorFlow environment configured

✓Local Kubernetes cluster running

✓Kubeflow Training Operator installed

✓Argo Workflows deployed and accessible

✓Basic examples successfully executed

✓UI access configured for monitoring

Ready for:

End-to-end ML pipeline implementation

Fashion-MNIST distributed training

Production-scale model serving

Complete workflow automation

Next Chapter Preview

Chapter 9 Implementation Plan:

📊

Data Pipeline

Fashion-MNIST ingestion with caching

→

🎯

Training Pipeline

Multi-model distributed training

→

🚀

Serving Pipeline

High-availability deployment

→

🔄

Orchestration

Complete Argo integration

→

📈

Monitoring

End-to-end observability

Insight

You now have all the tools needed to build production-scale ML systems. These technologies work together like a well-orchestrated symphony: TensorFlow provides the musical talent, Kubernetes supplies the concert hall infrastructure, Kubeflow offers specialized ML instruments, and Argo Workflows conducts the entire performance.

Practice Exercises

Technology Integration:

Create a TFJob that uses custom hyperparameters via ConfigMaps
Design an Argo Workflow that runs multiple TensorFlow experiments in parallel
Build a complete pipeline: data preparation → training → model validation

Advanced Scenarios:

Implement conditional model deployment based on accuracy thresholds
Create a workflow that automatically retries failed training jobs
Design a multi-stage pipeline with artifact passing between components

Production Readiness:

Configure resource limits and requests for all workloads
Implement proper logging and monitoring for distributed training
Create workflows that handle different failure scenarios gracefully

Previous: Chapter 7: Project Overview & System Architecture | Next: Chapter 9: Complete Implementation

Table of Contents​

1. TensorFlow: The Machine Learning Framework​

1.1 The Basics: Training Models and Hyperparameter Tuning​

Environment Setup​

Basic MNIST Classification​

Hyperparameter Tuning with Keras Tuner​

1.2 Exercises​

2. Kubernetes: The Distributed Container Orchestration System​

2.1 The Basics: Clusters, Pods, and Container Management​

Setting Up Local Kubernetes Cluster​

Working with Namespaces​

Understanding Pods​

2.2 Exercises​

3. Kubeflow: Machine Learning Workloads on Kubernetes​

3.1 The Basics: Distributed Training with TFJob​

Setup Kubeflow Training Operator​

Understanding Custom Resources​

Creating Distributed TensorFlow Jobs​

3.2 Exercises​

4. Argo Workflows: Container-Native Workflow Engine​

4.1 The Basics: From Simple Tasks to Complex DAGs​

Setting Up Argo Workflows UI​

Simple Workflow Examples​

Complex Workflow Patterns​

Advanced Workflow Features​

4.2 Exercises​

5. Summary and Exercises​

Technology Stack Integration​

Key Concepts Mastered​

Technology Readiness Checklist​

Next Chapter Preview​

Practice Exercises​

Table of Contents