Skip to main content

Chapter 8: Overview of Relevant Technologies

Mastering the production ML technology stack: TensorFlow, Kubernetes, Kubeflow, and Argo Workflows


Table of Contents

  1. TensorFlow: The Machine Learning Framework
  2. Kubernetes: The Distributed Container Orchestration System
  3. Kubeflow: Machine Learning Workloads on Kubernetes
  4. Argo Workflows: Container-Native Workflow Engine
  5. Summary and Exercises

Think of this chapter as learning to use a professional kitchen - TensorFlow is your cooking expertise, Kubernetes is the kitchen infrastructure, Kubeflow provides specialized ML cooking equipment, and Argo Workflows orchestrates the entire service from prep to delivery.


1. TensorFlow: The Machine Learning Framework

TensorFlow is the foundation of our ML development stack - a comprehensive platform for building, training, and deploying machine learning models at scale. It's the industry standard used by major tech companies for everything from image classification to recommendation systems.

In plain English: TensorFlow is like a complete woodworking shop - it provides all the tools (functions), raw materials (data structures), and blueprints (architectures) you need to craft machine learning models from scratch or modify existing designs.

In technical terms: TensorFlow is an end-to-end open-source platform for machine learning that provides a comprehensive ecosystem of tools, libraries, and community resources for building and deploying ML-powered applications.

Why it matters: TensorFlow handles the complex mathematics and computational optimization automatically, letting you focus on designing models and solving problems rather than implementing backpropagation algorithms from scratch.

TensorFlow Ecosystem Overview
🌐
TensorFlow.js
  • Browser & Node.js
  • Client-side ML
  • Real-time inference
📱
TensorFlow Lite
  • Mobile & Edge
  • Low latency
  • Optimized models
🏭
TFX Pipeline
  • Production MLOps
  • End-to-end workflows
  • Scalable serving
🚀
TensorFlow Serving
  • Model deployment
  • REST & gRPC APIs
  • Version management
🏪
TensorFlow Hub
  • Pre-trained models
  • Transfer learning
  • Model repository
📊
Core Framework
  • Research & Dev
  • Model building
  • Custom training loops

Insight

TensorFlow's ecosystem is like a complete auto manufacturing plant: TensorFlow Core designs the cars, TensorFlow Lite makes compact city cars, TensorFlow Serving runs the dealerships, and TFX manages the entire production line.

1.1 The Basics: Training Models and Hyperparameter Tuning

Environment Setup

Step 1: Create Conda Environment

# Create isolated Python environment
conda create --name dist-ml python=3.9 -y
conda activate dist-ml

# Install TensorFlow
pip install --upgrade pip
pip install tensorflow==2.10.0

# Handle potential NumPy conflicts
pip install numpy --ignore-installed

Step 2: Verify Installation

import tensorflow as tf
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {tf.config.list_physical_devices('GPU')}")

Basic MNIST Classification

Dataset Overview:

MNIST Handwritten Digits Dataset
0
1
2
3
4
5
6
7
8
9
Dataset Structure
• Training: 60,000 images (28×28 grayscale)
• Testing: 10,000 images (28×28 grayscale)
• Classes: 10 digits (0-9)
• Pixel values: 0-255 (uint8)

Complete Training Pipeline:

MNIST Classification Training
Loading Python runtime...

Full TensorFlow Code:

import tensorflow as tf

# 1. Load and inspect data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

print(f"Training images shape: {x_train.shape}") # (60000, 28, 28)
print(f"Training labels shape: {y_train.shape}") # (60000,)
print(f"Pixel value range: {x_train.min()} to {x_train.max()}") # 0 to 255

# 2. Preprocessing: Normalize pixel values to [0, 1]
def preprocess(ds):
return ds / 255.0

x_train = preprocess(x_train)
x_test = preprocess(x_test)

print(f"After preprocessing: {x_train.min()} to {x_train.max()}") # 0.0 to 1.0

# 3. Define model architecture
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)), # 28×28 → 784 features
tf.keras.layers.Dense(128, activation='relu'), # Hidden layer
tf.keras.layers.Dropout(0.2), # Prevent overfitting
tf.keras.layers.Dense(10, activation='softmax') # 10-class output
])

# 4. Compile model with optimizer, loss, and metrics
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)

# 5. Train the model
print("Training model...")
history = model.fit(x_train, y_train, epochs=5)

# 6. Evaluate on test data
test_loss, test_accuracy = model.evaluate(x_test, y_test)
print(f"Test accuracy: {test_accuracy:.4f}")

# 7. Save trained model
model.save('my_model.h5')
print("Model saved successfully!")
Neural Network Training Flow
📊
Load Data
MNIST dataset
⚙️
Preprocess
Normalize [0,1]
🏗️
Build Model
Sequential layers
🚀
Train
5 epochs
Evaluate
Test accuracy
💾
Save
Export .h5

Hyperparameter Tuning with Keras Tuner

Advanced Model Optimization:

import tensorflow as tf
import keras_tuner as kt

# Install Keras Tuner
# pip install -q -U keras-tuner

def model_builder(hp):
"""Define model with tunable hyperparameters"""
model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=(28, 28)))

# Tunable number of units in hidden layer
hp_units = hp.Int('units', min_value=32, max_value=512, step=32)
model.add(tf.keras.layers.Dense(units=hp_units, activation='relu'))
model.add(tf.keras.layers.Dense(10))

# Tunable learning rate
hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])

model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=hp_learning_rate),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)

return model

# Configure hyperparameter tuner
tuner = kt.Hyperband(
model_builder,
objective='val_accuracy',
max_epochs=10,
factor=3,
directory='my_dir',
project_name='intro_to_kt'
)

# Early stopping to prevent overfitting
early_stop = tf.keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=4
)

# Start hyperparameter search
print("Starting hyperparameter search...")
tuner.search(
x_train, y_train,
epochs=30,
validation_split=0.2,
callbacks=[early_stop]
)

# Get best hyperparameters and train final model
best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]
print(f"Best units: {best_hps.get('units')}")
print(f"Best learning rate: {best_hps.get('learning_rate')}")

# Train final model with best hyperparameters
model = tuner.hypermodel.build(best_hps)
model.fit(x_train, y_train, epochs=50, validation_split=0.2)

# Evaluate final model
final_accuracy = model.evaluate(x_test, y_test)[1]
print(f"Final optimized accuracy: {final_accuracy:.4f}")
Hyperparameter Tuning Process
🎯
Define Search Space
Units: 32-512, LR: [1e-2, 1e-3, 1e-4]
🔄
Run Trials
Hyperband algorithm explores combinations
📊
Evaluate Performance
Track validation accuracy per trial
Select Best Config
Choose highest performing hyperparameters
🏆
Train Final Model
Use optimal settings for production

Model Loading and Reuse:

# Load saved model in new session
import tensorflow as tf
model = tf.keras.models.load_model('my_model.h5')

# Use for predictions
predictions = model.predict(x_test[:5])
predicted_classes = tf.argmax(predictions, axis=1)
print(f"Predictions: {predicted_classes.numpy()}")
print(f"Actual: {y_test[:5]}")

Insight

Hyperparameter tuning is like adjusting a recipe: you try different amounts of ingredients (units, learning rate) and cooking times (epochs) to find the perfect combination that produces the best dish (highest accuracy).

1.2 Exercises

  1. Q: Can you use the previously saved model directly for model evaluation?

    A: Yes, via model = tf.keras.models.load_model('my_model.h5'); model.evaluate(x_test, y_test)

  2. Q: Instead of using the Hyperband tuning algorithm, could you try the random search algorithm?

    A: You should be able to do it easily by changing the tuner to kt.RandomSearch(model_builder).


2. Kubernetes: The Distributed Container Orchestration System

Kubernetes (K8s) is our distributed infrastructure foundation - it automates deployment, scaling, and management of containerized applications across clusters of machines.

In plain English: Kubernetes is like a smart building manager that automatically assigns office space to teams, ensures everyone has the power and network connections they need, and can quickly reorganize spaces when requirements change.

In technical terms: Kubernetes is an open-source container orchestration platform that automates the deployment, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure.

Why it matters: Instead of manually managing which servers run which applications and monitoring their health, Kubernetes automatically handles scheduling, healing failed containers, scaling based on load, and managing service discovery.

Kubernetes Architecture Overview
Control Plane
API Server
Scheduler
Controller
etcd
orchestrates
reports status
Worker Nodes
Kubelet
Container Runtime
Pods
⌨️kubectl (CLI)
🚀Applications (ML Workloads)

2.1 The Basics: Clusters, Pods, and Container Management

Setting Up Local Kubernetes Cluster

Step 1: Create Cluster with k3d

# Install k3d (lightweight Kubernetes)
# k3d creates Kubernetes clusters using Docker containers

# Create cluster named 'distml'
k3d cluster create distml --image rancher/k3s:v1.25.3-rc3-k3s1

# Verify cluster creation
kubectl get nodes

Expected Output:

NAME                    STATUS   ROLES                  AGE     VERSION
k3d-distml-server-0 Ready control-plane,master 1m v1.25.3+k3s1

Step 2: Inspect Node Details

# Get detailed node information
kubectl describe node k3d-distml-server-0

Key Node Information:

Labels:
beta.kubernetes.io/arch=arm64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=k3d-distml-server-0
node-role.kubernetes.io/control-plane=true

System Info:
Operating System: linux
Architecture: arm64
Container Runtime: containerd://1.5.9-k3s1

Capacity:
cpu: 4
memory: 8142116Ki
pods: 110

Addresses:
InternalIP: 172.18.0.3
Hostname: k3d-distml-server-0

Working with Namespaces

Create and Manage Namespaces:

# Create namespace for our examples
kubectl create ns basics

# Install kubectx/kubens for easier navigation (optional but helpful)
# These tools simplify switching between clusters and namespaces

# List available contexts and namespaces
kubectx # Show clusters
kubens # Show namespaces

# Switch to our cluster and namespace
kubectx k3d-distml
kubens basics
Namespace Structure
Kubernetes Cluster
default (system default)
kube-system (K8s core)
kube-public (public)
kube-node-lease (heartbeats)
basics (our workspace)

Understanding Pods

Pod Concept:

Pod Structure - Smallest Deployable Unit
Pod
Container 1
• Application
• Main Logic
Container 2
• Sidecar
• Logging
Shared Resources
• Network (IP address)
• Storage volumes
• Lifecycle

Create a Simple Pod:

hello-world.yaml:

apiVersion: v1
kind: Pod
metadata:
name: whalesay
spec:
containers:
- name: whalesay
image: docker/whalesay:latest
command: [cowsay]
args: ["hello world"]

Deploy and Monitor Pod:

# Create the pod
kubectl create -f hello-world.yaml

# Check pod status
kubectl get pods
# Output: NAME READY STATUS RESTARTS AGE
# whalesay 0/1 Completed 2 37s

# View pod logs
kubectl logs whalesay

Expected Output:

 _____________
< hello world >
-------------
\
\
\
## .
## ## ## ==
## ## ## ## ===
/""""""""""""""""___/ ===
~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ / ===- ~~~
\______ o __/
\ \ __/
\____\______/

Inspect Pod Details:

# Get complete pod specification
kubectl get pod whalesay -o yaml

# Get pod info in JSON format
kubectl get pod whalesay -o json
Pod Lifecycle
Pending
Waiting for scheduling
▶️
Running
Container(s) executing
Succeeded
Completed successfully
Failed
Error occurred
CrashLoopBackOff: Repeatedly failing
ImagePullBackOff: Cannot pull image
Evicted: Resource pressure

Insight

Think of Kubernetes like a smart building manager: it knows every room (node), can quickly move teams (pods) to available spaces, and ensures everyone has the resources they need (CPU, memory, storage) while maintaining building-wide coordination.

2.2 Exercises

  1. Q: How do you get the Pod information in JSON format?

    A: kubectl get pod &lt;pod-name&gt; -o json

  2. Q: Can a Pod contain multiple containers?

    A: Yes, you can define additional containers in the pod.spec.containers in addition to the existing single container.


3. Kubeflow: Machine Learning Workloads on Kubernetes

Kubeflow transforms Kubernetes into a comprehensive ML platform - making it simple to deploy, scale, and manage machine learning workflows on any Kubernetes cluster.

In plain English: Kubeflow is like adding a professional film studio to a construction site - while Kubernetes provides the building infrastructure, Kubeflow adds specialized equipment for producing movies (ML models), from cameras and lighting (training tools) to editing suites (serving infrastructure).

In technical terms: Kubeflow is a machine learning toolkit for Kubernetes that orchestrates complex ML workflows including distributed training, hyperparameter tuning, model serving, and pipeline execution, all as native Kubernetes resources.

Why it matters: Instead of writing custom Kubernetes manifests and management scripts for each ML task, Kubeflow provides high-level abstractions that handle distributed training setup, resource allocation, monitoring, and serving automatically.

Kubeflow Platform
ML Platform on Kubernetes
ML Applications
PipelinesKatib (AutoML)KServeNotebooks
Kubeflow Components
Training OperatorCentral DashboardWorkflow Engine
Kubernetes Foundation
Container OrchestrationResource ManagementService Discovery

Key Components:

🔄
Pipelines (KFP)
  • Orchestrate workflows
  • DAG execution
  • Artifact tracking
🔍
Katib
  • Hyperparameter tuning
  • AutoML
  • Neural architecture search
🚀
KServe
  • Model serving
  • Autoscaling
  • Multi-framework support
📊
Training Operator
  • Distributed training
  • TensorFlow/PyTorch
  • Multi-replica jobs
💻
Notebooks
  • Jupyter environments
  • Interactive development
  • GPU access
🌐
Central UI
  • Unified interface
  • Pipeline visualization
  • Job monitoring

3.1 The Basics: Distributed Training with TFJob

Setup Kubeflow Training Operator

Step 1: Prepare Namespace and Install Components

# Create dedicated namespace
kubectl create ns kubeflow
kubens kubeflow

# Install Kubeflow Training Operator and Argo Workflows
cd code/project
kubectl kustomize manifests | kubectl apply -f -

What Gets Installed:

Installed Components
Training Operator (TensorFlow, PyTorch, MXNet)
Custom Resource Definitions (CRDs)
RBAC Permissions
Argo Workflows (Pipeline Engine)

Understanding Custom Resources

TFJob Custom Resource Definition:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: tfjobs.kubeflow.org
spec:
group: kubeflow.org
names:
kind: TFJob # What users create
plural: tfjobs # For kubectl get tfjobs
singular: tfjob # For kubectl get tfjob
# ... rest of specification

Training Operator Controller:

apiVersion: apps/v1
kind: Deployment
metadata:
name: training-operator
spec:
replicas: 1
template:
spec:
containers:
- name: training-operator
image: kubeflow/training-operator
command: [/manager]
# This controller watches for TFJob resources
# and creates the necessary pods/services

Creating Distributed TensorFlow Jobs

TFJob Specification:

tfjob.yaml:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
namespace: kubeflow
generateName: distributed-tfjob-
spec:
tfReplicaSpecs:
Worker:
replicas: 2 # Two worker processes
restartPolicy: OnFailure
template:
spec:
containers:
- name: tensorflow
image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
- "--log_dir=/train/metrics"
- "--learning_rate=0.01"
- "--batch_size=100"
Distributed TFJob Architecture
Worker 0
• Data Shard A
• Model Copy
• Gradient Computation
AllReduce
Sync
Worker 1
• Data Shard B
• Model Copy
• Gradient Computation
Synchronized Model Updates

Deploy and Monitor TFJob:

# Submit distributed training job
kubectl create -f tfjob.yaml
# Output: tfjob.kubeflow.org/distributed-tfjob-qc8fh created

# Check TFJob status
kubectl get tfjob
# Output: NAME AGE
# distributed-tfjob-qc8fh 1s

# Monitor worker pods
kubectl get pods
# Output: NAME READY STATUS
# distributed-tfjob-qc8fh-worker-0 1/1 Running
# distributed-tfjob-qc8fh-worker-1 1/1 Running

# Check training logs
kubectl logs distributed-tfjob-qc8fh-worker-0

Parameter Server Configuration (Alternative):

# For very large models requiring parameter servers
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: ps-training-job
spec:
tfReplicaSpecs:
PS: # Parameter Server replicas
replicas: 2
template:
spec:
containers:
- name: tensorflow
image: your-ps-image:latest
Worker: # Worker replicas
replicas: 4
template:
spec:
containers:
- name: tensorflow
image: your-worker-image:latest
Parameter Server vs AllReduce
Parameter Server
AllReduce (Ring)
Architecture
Centralized servers
Peer-to-peer
Communication
Workers ↔ PS
Worker ↔ Worker
Bottleneck
PS network bandwidth
Distributed evenly
Best for
Large models
Fast networks

Insight

Kubeflow Training Operator is like a skilled orchestra conductor: it coordinates multiple musicians (workers) to play the same piece (train the same model) in perfect harmony, automatically handling the complex timing and communication between all participants.

3.2 Exercises

  1. Q: If your model training requires parameter servers, can you express that in a TFJob?

    A: Similar to worker replicas, define parameterServer replicas in your TFJob spec to specify the number of parameter servers.


4. Argo Workflows: Container-Native Workflow Engine

Argo Workflows is our workflow orchestration engine - it connects all ML components into cohesive, automated pipelines that can handle complex dependencies and conditional logic.

In plain English: Argo Workflows is like a sophisticated assembly line manager that coordinates different stations (tasks), ensures work happens in the right order (dependencies), can make decisions based on quality checks (conditionals), and runs multiple production lines simultaneously (parallel execution).

In technical terms: Argo Workflows is a container-native workflow engine for orchestrating parallel jobs on Kubernetes, using directed acyclic graphs (DAGs) to define complex workflows with dependencies, conditionals, loops, and artifact passing.

Why it matters: ML pipelines require coordinating data preprocessing, training, validation, and serving steps with proper dependencies and error handling. Argo automates this orchestration while providing visibility into execution and the ability to retry failed steps.

Argo Project Suite
🔄
Argo Workflows
  • Container-native engine
  • DAG execution
  • Parallel jobs
📦
Argo CD
  • GitOps delivery
  • Continuous deployment
  • Kubernetes sync
🔄
Argo Rollouts
  • Progressive delivery
  • Canary deployments
  • Blue-green releases
Argo Events
  • Event-driven automation
  • Webhook triggers
  • Resource watchers

Argo Workflows Core Features:

  • DAG Support: Complex workflow dependencies
  • Conditional Logic: Branch based on outcomes
  • Parallel Execution: Maximize resource utilization
  • Artifact Management: Pass data between steps
  • Web UI: Visual monitoring and debugging

4.1 The Basics: From Simple Tasks to Complex DAGs

Setting Up Argo Workflows UI

Access the Web Interface:

# Port-forward to access UI locally
kubectl port-forward svc/argo-server 2746:2746

# Visit https://localhost:2746 in your browser
Argo Workflows UI Features
📊
Workflow List
  • Running workflows
  • Status overview
  • Resource usage
📈
DAG Visualization
  • Node dependencies
  • Execution flow
  • Conditional paths
📝
Log Viewer
  • Real-time logs
  • Error details
  • Performance metrics
Workflow Management: Submit, Suspend, Resume, Delete

Simple Workflow Examples

1. Hello World Workflow:

argo-hello-world.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: hello-world-
spec:
entrypoint: whalesay
serviceAccountName: argo
templates:
- name: whalesay
container:
image: docker/whalesay
command: [cowsay]
args: ["hello world"]

Submit and Monitor:

# Submit workflow
kubectl create -f argo-hello-world.yaml
# Output: workflow.argoproj.io/hello-world-zns4g created

# Check workflow status
kubectl get wf
# Output: NAME STATUS AGE
# hello-world-zns4g Running 2s

# Get pods created by workflow
kubectl get pods -l workflows.argoproj.io/workflow=hello-world-zns4g

# Check logs
kubectl logs hello-world-zns4g -c main

2. Kubernetes Resource Template:

argo-resource-template.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: k8s-resource-
spec:
entrypoint: k8s-resource
serviceAccountName: argo
templates:
- name: k8s-resource
resource:
action: create
manifest: |
apiVersion: v1
kind: ConfigMap
metadata:
name: cm-example
data:
some: value

3. Python Script Template:

argo-script-template.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: script-tmpl-
spec:
entrypoint: gen-random-int
serviceAccountName: argo
templates:
- name: gen-random-int
script:
image: python:alpine3.6
command: [python]
source: |
import random
i = random.randint(1, 100)
print(f"Generated random number: {i}")

Complex Workflow Patterns

4. Diamond DAG Example:

argo-dag-diamond.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: dag-diamond-
spec:
serviceAccountName: argo
entrypoint: diamond
templates:
- name: echo
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [echo, "{{inputs.parameters.message}}"]

- name: diamond
dag:
tasks:
- name: A
template: echo
arguments:
parameters: [{name: message, value: A}]

- name: B
dependencies: [A]
template: echo
arguments:
parameters: [{name: message, value: B}]

- name: C
dependencies: [A]
template: echo
arguments:
parameters: [{name: message, value: C}]

- name: D
dependencies: [B, C]
template: echo
arguments:
parameters: [{name: message, value: D}]
Diamond DAG Execution
A (start)
B
C
B and C run in parallel
D (end)

Execution Flow:

  1. A runs first (no dependencies)
  2. B and C run in parallel (both depend on A)
  3. D runs last (depends on both B and C)

5. Conditional Workflow (Coin Flip):

argo-coinflip.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: coinflip-
spec:
serviceAccountName: argo
entrypoint: coinflip
templates:
- name: coinflip
steps:
- - name: flip-coin
template: flip-coin
- - name: heads
template: heads
when: "{{steps.flip-coin.outputs.result}} == heads"
- name: tails
template: tails
when: "{{steps.flip-coin.outputs.result}} == tails"

- name: flip-coin
script:
image: python:alpine3.6
command: [python]
source: |
import random
result = "heads" if random.randint(0,1) == 0 else "tails"
print(result)

- name: heads
container:
image: alpine:3.6
command: [sh, -c]
args: ["echo \"It was heads!\""]

- name: tails
container:
image: alpine:3.6
command: [sh, -c]
args: ["echo \"It was tails!\""]
Conditional Logic Flow
🎲Flip Coin (random)
Generate "heads" or "tails"
heads
Run heads template
tails
Run tails template
Only one branch executes!

Advanced Workflow Features

Parameter Passing:

# Workflow with parameters
spec:
arguments:
parameters:
- name: message
value: "Hello World"

templates:
- name: print-message
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [echo]
args: ["{{inputs.parameters.message}}"]

Artifact Management:

# Pass files between workflow steps
templates:
- name: generate-artifact
container:
image: python:3.8
command: [python]
args: ["-c", "open('/tmp/hello_world.txt', 'w').write('hello world')"]
outputs:
artifacts:
- name: hello-art
path: /tmp/hello_world.txt

- name: consume-artifact
inputs:
artifacts:
- name: hello-art
path: /tmp/hello_world.txt
container:
image: python:3.8
command: [cat]
args: ["/tmp/hello_world.txt"]
Artifact Flow
📝
Step 1
Generate file
💾
Save
Output artifact
📦
Transfer
Pass to next step
📖
Step 2
Consume file

Insight

Argo Workflows is like a sophisticated film director: it coordinates multiple departments (data prep, training, serving), manages complex shooting schedules (DAGs), handles script changes on the fly (conditionals), and ensures everything comes together for the final production (ML pipeline).

4.2 Exercises

  1. Q: Besides accessing the output of each step like {{steps.flip-coin.outputs.result}}, what are other available variables?

    A: The complete list is available at Argo Variables Reference.

  2. Q: Can you trigger workflows automatically by Git commits or other events?

    A: Yes, you can use Argo Events to watch Git events and trigger workflows.


5. Summary and Exercises

Technology Stack Integration

Production ML Technology Stack
Complete Stack
TensorFlow (ML Framework)
Model developmentTrainingHyperparameter tuningPersistence
Kubernetes (Infrastructure)
Container orchestrationResource managementService discovery
Kubeflow (ML on K8s)
Distributed trainingModel servingAutoML
Argo Workflows (Orchestration)
End-to-end pipelinesDAG executionConditional logic

Key Concepts Mastered

🎯
TensorFlow Proficiency
  • Model architecture design
  • Hyperparameter optimization
  • Model persistence
  • Keras Tuner integration
⚙️
Kubernetes Fundamentals
  • Cluster management
  • Pod lifecycle
  • Namespace isolation
  • Resource organization
🚀
Kubeflow Integration
  • Custom Resource Definitions
  • Distributed TFJob training
  • Training Operator architecture
  • Multi-replica workloads
🔄
Argo Workflows Mastery
  • DAG-based workflows
  • Conditional execution
  • Parameter passing
  • Artifact management

Technology Readiness Checklist

Before Chapter 9 Implementation:

TensorFlow environment configured
Local Kubernetes cluster running
Kubeflow Training Operator installed
Argo Workflows deployed and accessible
Basic examples successfully executed
UI access configured for monitoring

Ready for:

End-to-end ML pipeline implementation
Fashion-MNIST distributed training
Production-scale model serving
Complete workflow automation

Next Chapter Preview

Chapter 9 Implementation Plan:

📊
Data Pipeline
Fashion-MNIST ingestion with caching
🎯
Training Pipeline
Multi-model distributed training
🚀
Serving Pipeline
High-availability deployment
🔄
Orchestration
Complete Argo integration
📈
Monitoring
End-to-end observability

Insight

You now have all the tools needed to build production-scale ML systems. These technologies work together like a well-orchestrated symphony: TensorFlow provides the musical talent, Kubernetes supplies the concert hall infrastructure, Kubeflow offers specialized ML instruments, and Argo Workflows conducts the entire performance.

Practice Exercises

Technology Integration:

  1. Create a TFJob that uses custom hyperparameters via ConfigMaps
  2. Design an Argo Workflow that runs multiple TensorFlow experiments in parallel
  3. Build a complete pipeline: data preparation → training → model validation

Advanced Scenarios:

  1. Implement conditional model deployment based on accuracy thresholds
  2. Create a workflow that automatically retries failed training jobs
  3. Design a multi-stage pipeline with artifact passing between components

Production Readiness:

  1. Configure resource limits and requests for all workloads
  2. Implement proper logging and monitoring for distributed training
  3. Create workflows that handle different failure scenarios gracefully

Previous: Chapter 7: Project Overview & System Architecture | Next: Chapter 9: Complete Implementation