Fiooodooor/masterclass-quick-reference-handout.md

Last active April 2, 2026 09:37

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/Fiooodooor/aeb4fc51121ab1c137bfc3d97a68e66a.js"></script>
Save Fiooodooor/aeb4fc51121ab1c137bfc3d97a68e66a to your computer and use it in GitHub Desktop.

Download ZIP

Raw

masterclass-quick-reference-handout.md

Quick Reference — Bare Metal to Kubernetes

Master Class Handout (1-pager per session)

SESSION 1 CHEAT SHEET: Bare Metal → Virtualization

The Stack

VMs → Hypervisor → Hardware

Hypervisor Types

	Type 1 (Bare Metal)	Type 2 (Hosted)
Runs on	Hardware directly	Host OS
Performance	Near-native	10–30% overhead
Use	Production / Cloud	Dev / Test
Examples	ESXi, KVM, Hyper-V, Xen	VirtualBox, VMware Workstation

Hypervisor Reference Model (Popek & Goldberg, 1974)

DISPATCHER  → Entry point, routes traps
ALLOCATOR   → Manages resources for VMs
INTERPRETER → Emulates privileged instructions

Hardware Virtualization Extensions

Intel VT-x / AMD-V — CPU virtualization (VMENTER/VMEXIT)
Intel VT-d / AMD-Vi — I/O virtualization (IOMMU, DMA isolation)
SR-IOV — NIC hardware partitioning (virtual functions)

Key Commands

# Check if CPU supports virtualization
grep -E 'vmx|svm' /proc/cpuinfo

# Check KVM availability
lsmod | grep kvm

# List VMs (libvirt)
virsh list --all

SESSION 2 CHEAT SHEET: Containers → Kubernetes

The Container "Recipe"

Container = Namespaces + cgroups + seccomp + capabilities + rootfs

Linux Namespaces

NS	Isolates	Flag
mnt	Filesystems	CLONE_NEWNS
pid	Process IDs	CLONE_NEWPID
net	Network stack	CLONE_NEWNET
uts	Hostname	CLONE_NEWUTS
ipc	IPC resources	CLONE_NEWIPC
user	UID/GID	CLONE_NEWUSER
cgroup	cgroup view	CLONE_NEWCGROUP

Container Runtime Stack

HIGH: Docker / Podman / nerdctl     (UX: build, run, push)
MID:  containerd / CRI-O            (lifecycle: create, start, stop)
LOW:  runc / crun / kata / gVisor   (OCI: clone, unshare, execve)

CRI = Container Runtime Interface

kubelet ←─ gRPC ─→ containerd or CRI-O ──→ runc/crun

Two services: RuntimeService (pod/container lifecycle) + ImageService (pull/list/remove)

Kubernetes Components

CONTROL PLANE:                 WORKER NODE:
├── API Server (REST)          ├── kubelet (CRI client)
├── etcd (state store)         ├── container runtime
├── Scheduler (placement)      └── kube-proxy (networking)
└── Controller Manager (loops)

Pod = Scheduling Unit

Shares: network namespace, IPC, volumes
Each pod gets its own IP
"pause" container holds the namespace

Docker vs Podman

	Docker	Podman
Daemon	Yes (dockerd)	No
Root	Default	Rootless default
Build	docker build / buildx	podman build / buildah
K8s	via containerd	via CRI-O

Essential Commands

# Namespaces
unshare --pid --net --mount --fork /bin/bash
lsns
nsenter -t <PID> -n                  # enter network ns

# cgroups
cat /sys/fs/cgroup/memory.max
cat /sys/fs/cgroup/cpu.max

# Docker / Podman
docker build -t myapp:1.0 .
docker run --rm -p 8080:80 myapp:1.0
podman pod create --name mypod

# Kubernetes
kubeadm init --pod-network-cidr=10.244.0.0/16
kubeadm join <cp>:6443 --token <tok>
kubectl get nodes
kubectl get pods -A

# CRI (on K8s node)
crictl pods
crictl ps
crictl info
crictl inspect <id>

Dockerfile Best Practices

# Multi-stage, pinned version, non-root
FROM golang:1.22-alpine AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /app ./cmd/server

FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]

Security Layers (defense in depth)

1. Namespaces        → visibility isolation
2. cgroups           → resource limits
3. seccomp           → syscall filtering
4. Capabilities      → fine-grained privileges
5. AppArmor/SELinux  → mandatory access control
6. Network Policies  → pod-to-pod firewall
7. Read-only rootfs  → immutable filesystem
8. Non-root user     → no UID 0

Raw

masterclass-session-1--bare-metal-to-virtualization.md

marp	true
theme	default
paginate	true
backgroundColor
color
style	section { font-family: 'Segoe UI', Arial, sans-serif; } h1 { color: #00d4ff; } h2 { color: #7b68ee; } h3 { color: #ff6b6b; } strong { color: #ffd93d; } code { background: #16213e; color: #00d4ff; padding: 2px 6px; border-radius: 4px; } table { font-size: 0.8em; } th { background: #16213e; color: #00d4ff; } td { background: #0f3460; } blockquote { border-left: 4px solid #7b68ee; background: #16213e; padding: 10px 20px; } a { color: #ffd93d; } .columns { display: flex; gap: 2em; } .col { flex: 1; }

🖥️ Session 1

From Bare Metal to Virtualization

Master Class — Infrastructure Foundations

Duration: 90 minutes Level: Intermediate → Advanced

📋 Session 1 — Agenda

#	Topic	Time
1	Bare Metal OS Deployment	15 min
2	What is Virtualization?	10 min
3	The Hypervisor — Core Concepts	20 min
4	Type 1 vs Type 2 Hypervisors	15 min
5	Hypervisor Reference Model Deep Dive	10 min
6	Virtualization vs Containerization — Preview	10 min
7	Q&A / Discussion	10 min

Part 1

Bare Metal OS Deployment

What is "Bare Metal"?

Bare metal = software running directly on hardware without any intervening virtualization layer.

┌─────────────────────────────┐
│       Application(s)        │
├─────────────────────────────┤
│     Operating System        │
│   (Linux / Windows / BSD)   │
├─────────────────────────────┤
│   Hardware (CPU, RAM, NIC,  │
│     Storage, GPU, etc.)     │
└─────────────────────────────┘

1:1 relationship — one OS owns the entire machine
Full, unmediated hardware access
Maximum performance, minimum abstraction

Bare Metal — Characteristics

✅ Advantages

Maximum performance — no virtualization overhead
Full hardware access — DMA, IOMMU, SR-IOV, GPU passthrough native
Deterministic latency — critical for HPC, real-time, HFT
Simpler debugging — no hypervisor layer to reason about

❌ Disadvantages

Low utilization — typical server uses only 5–15% of capacity
Slow provisioning — hours to days (PXE boot, Kickstart, Preseed)
No isolation — a rogue process can crash the entire machine
Scaling = buying hardware — no elasticity

Bare Metal Provisioning Methods

Method	Description	Use Case
PXE/iPXE + Kickstart	Network boot → automated install	Data center fleet
Cloud-Init	First-boot config injection	Cloud bare metal (e.g., AWS i3.metal)
Ironic (OpenStack)	Bare metal as a service	Private cloud
MAAS (Canonical)	Metal as a Service	Ubuntu-centric DC
Tinkerbell (Equinix)	Declarative bare metal workflow	Edge / hybrid
Manual ISO Install	USB/DVD boot + manual steps	Lab / dev

Key insight: Bare metal provisioning is fundamentally slower than VM or container creation. This drove the industry toward virtualization.

Linux on Bare Metal — Key Subsystems

User Space
├── systemd (PID 1, service management)
├── Applications & daemons
├── Shared libraries (glibc, libssl, ...)
│
Kernel Space
├── Process Scheduler (CFS / EEVDF)
├── Memory Management (page tables, NUMA, hugepages)
├── Virtual File System (VFS)
├── Network Stack (netfilter, tc, XDP, eBPF)
├── Device Drivers (NIC, storage, GPU)
├── Security Modules (SELinux, AppArmor, seccomp)
│
Hardware
├── CPU (rings 0-3, VMX extensions)
├── RAM (DDR4/5, NUMA nodes)
├── NIC (queues, RSS, offloads)
├── Storage (NVMe, SATA, HBA)
└── IOMMU, SR-IOV, PCIe topology

The Problem Bare Metal Couldn't Solve

Scenario: A company has 50 physical servers

Server 1:  Web App     → 8% CPU utilization
Server 2:  Database    → 12% CPU utilization
Server 3:  CI Runner   → 3% avg, 90% peak (bursts)
Server 4:  Mail Server → 5% CPU utilization
...
Server 50: Monitoring  → 2% CPU utilization

Average utilization: ~8% → 92% of purchased compute is wasted

💡 The question that launched an industry:

"Can we run multiple isolated workloads on one physical machine?"

Answer: Yes — Virtualization.

Part 2

What is Virtualization?

Virtualization — Definition

Virtualization is the creation of a virtual (rather than physical) version of something — servers, storage, networks, or operating systems — using a software abstraction layer.

┌──────────┐ ┌──────────┐ ┌──────────┐
│   VM 1   │ │   VM 2   │ │   VM 3   │
│ (Ubuntu) │ │(Windows) │ │(FreeBSD) │
│ App A    │ │ App B    │ │ App C    │
├──────────┤ ├──────────┤ ├──────────┤
│ Guest OS │ │ Guest OS │ │ Guest OS │
└────┬─────┘ └────┬─────┘ └────┬─────┘
     │             │             │
┌────┴─────────────┴─────────────┴────┐
│         HYPERVISOR (VMM)            │
├─────────────────────────────────────┤
│         Physical Hardware           │
└─────────────────────────────────────┘

A Brief History of Virtualization

Year	Milestone
1967	IBM CP-40 — first hypervisor on System/360 Model 67
1972	IBM VM/370 — commercial virtual machine OS
1998	VMware founded — x86 virtualization via binary translation
1999	VMware Workstation 1.0 released
2003	Xen hypervisor open-sourced (paravirtualization)
2005–06	Intel VT-x and AMD-V — hardware-assisted virtualization
2007	KVM merged into Linux kernel (2.6.20)
2008	Microsoft Hyper-V released
2010s	Cloud era — EC2, GCE, Azure all built on hypervisors
2020s	Lightweight VMMs: Firecracker (AWS Lambda), Cloud Hypervisor

The x86 trap-and-emulate problem (Popek & Goldberg, 1974) wasn't solved until VMware's binary translation (1999) and Intel VT-x (2005).

Why Virtualization? — The Core Value

Before Virtualization (Physical Servers)

1 app = 1 server
5–15% average CPU utilization
Weeks to provision new servers
Hardware lock-in

After Virtualization

Many apps on 1 server → 60–80% utilization
Minutes to create new VMs → agility
Hardware abstraction → portability
Snapshots & live migration → disaster recovery
Isolation → security boundaries between tenants

Part 3

The Hypervisor — Core Concepts

Hypervisor — Definition

A hypervisor (also called Virtual Machine Monitor — VMM) is software, firmware, or hardware that creates and runs virtual machines by separating a computer's software from its hardware.

What it does:

Partitions physical resources (CPU, memory, I/O) among VMs
Isolates VMs from each other
Emulates or paravirtualizes hardware for guest OSes
Schedules VM execution on physical CPUs
Intercepts privileged instructions from guest kernels

The Contract:

Each VM believes it has exclusive access to dedicated hardware. The hypervisor maintains this illusion while sharing the real hardware.

Types of Hypervisors — Overview

         TYPE 1 (Bare Metal)              TYPE 2 (Hosted)
     ┌────────┐  ┌────────┐         ┌────────┐  ┌────────┐
     │  VM 1  │  │  VM 2  │         │  VM 1  │  │  VM 2  │
     │Guest OS│  │Guest OS│         │Guest OS│  │Guest OS│
     └───┬────┘  └───┬────┘         └───┬────┘  └───┬────┘
         │           │                   │           │
  ┌──────┴───────────┴──────┐     ┌──────┴───────────┴──────┐
  │   TYPE 1 HYPERVISOR     │     │   TYPE 2 HYPERVISOR     │
  │   (runs ON hardware)    │     │   (runs ON host OS)     │
  ├─────────────────────────┤     ├─────────────────────────┤
  │     Physical Hardware   │     │     Host OS (Linux,     │
  └─────────────────────────┘     │     Windows, macOS)     │
                                  ├─────────────────────────┤
                                  │   Physical Hardware     │
                                  └─────────────────────────┘

Type 1 — Bare Metal Hypervisor

Runs directly on hardware — it is the operating system (or functionally replaces it).

Examples:

Hypervisor	Vendor	Notes
VMware ESXi	Broadcom	Industry standard for enterprise
Microsoft Hyper-V	Microsoft	Built into Windows Server
KVM	Linux/Red Hat	Kernel module — Linux IS the hypervisor
Xen	Linux Foundation	Used by AWS EC2 (legacy instances)
Proxmox VE	Proxmox	KVM + LXC, open-source
Firecracker	AWS	MicroVM for Lambda/Fargate
bhyve	FreeBSD	Native FreeBSD hypervisor

Type 1 — Characteristics

✅ Pros

Near-native performance — minimal overhead (1–5%)
Strong isolation — thin attack surface
Hardware-assisted — leverages VT-x/AMD-V, VT-d, SR-IOV
Scalable — run hundreds of VMs per host
Live migration — move VMs between hosts with zero downtime

❌ Cons

Complex to set up — dedicated infrastructure
Requires compatible hardware — VT-x/AMD-V, IOMMU
Management overhead — needs vCenter, oVirt, Proxmox UI, etc.
Expensive licensing (VMware vSphere, Microsoft Datacenter)

🎯 Use Cases

Enterprise data centers, cloud providers, production workloads, multi-tenant hosting

Type 2 — Hosted Hypervisor

Runs as an application on top of a conventional operating system.

Examples:

Hypervisor	Platform	Notes
Oracle VirtualBox	Cross-platform	Free, open-source
VMware Workstation	Windows/Linux	Commercial, feature-rich
VMware Fusion	macOS	Workstation equivalent for Mac
Parallels Desktop	macOS	Best macOS integration
QEMU	Cross-platform	Emulator + virtualizer
GNOME Boxes	Linux	Simple QEMU frontend

Type 2 — Characteristics

✅ Pros

Easy to install — just another application
No dedicated hardware — runs on your laptop/desktop
Great host–guest integration — shared folders, clipboard, drag & drop
Snapshots — quick state save/restore
Perfect for development, testing, malware analysis

❌ Cons

Performance degradation — overhead from host OS layer
Security degradation — host OS compromise → all VMs compromised
Resource contention — competes with host OS and other apps
Not suitable for production — no live migration, limited HA

🎯 Use Cases

Development, testing, learning, malware sandboxing, running legacy apps

KVM — A Special Case (Type 1.5?)

┌──────────────────────────────────────────────┐
│              User Space (Linux)               │
│  ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│  │ QEMU/VM1 │ │ QEMU/VM2 │ │ Normal Apps  │ │
│  └────┬─────┘ └────┬─────┘ └──────────────┘ │
│       │             │                         │
├───────┴─────────────┴─────────────────────────┤
│          Linux Kernel + KVM module            │
│  ┌─────────────────────────────────────────┐  │
│  │  KVM: /dev/kvm — hardware virt. API     │  │
│  │  QEMU: device emulation in userspace    │  │
│  │  virtio: paravirtualized I/O drivers    │  │
│  └─────────────────────────────────────────┘  │
├───────────────────────────────────────────────┤
│     Hardware (VT-x / AMD-V, VT-d, SR-IOV)   │
└───────────────────────────────────────────────┘

KVM turns Linux itself into a Type-1 hypervisor. Each VM is a regular Linux process managed by QEMU. The kernel's scheduler, memory manager, and driver stack are reused — no separate hypervisor OS.

Type 1 vs Type 2 — Comparison Matrix

Aspect	Type 1 (Bare Metal)	Type 2 (Hosted)
Runs on	Hardware directly	Host operating system
Performance	Near-native (1–5% overhead)	Moderate (10–30% overhead)
Security	Strong isolation	Host compromise = game over
Use case	Production, cloud, DC	Dev, test, sandbox
Boot time	Seconds (microVMs) to minutes	Minutes (host + hypervisor + VM)
Management	vCenter, oVirt, Proxmox	GUI application
Live migration	✅ Yes	❌ No
Cost	$$$ (licenses + dedicated HW)	$ (free or cheap)
Examples	ESXi, KVM, Hyper-V, Xen	VirtualBox, VMware Workstation

Part 4

Hypervisor Reference Model

(Popek & Goldberg, 1974)

The Three Pillars of a Hypervisor

Popek & Goldberg (1974) defined the formal requirements for virtualizable architectures and the three core modules that coordinate to emulate hardware:

         ┌─────────────────────────────────────┐
         │         VIRTUAL MACHINE             │
         │     (guest OS + applications)       │
         └──┬───────────┬──────────────┬───────┘
            │           │              │
            ▼           ▼              ▼
   ┌────────────┐ ┌──────────┐ ┌─────────────┐
   │ DISPATCHER │ │ALLOCATOR │ │ INTERPRETER  │
   │            │ │          │ │              │
   │ Entry point│ │ Resource │ │  Privileged  │
   │ Routes     │ │ manager  │ │  instruction │
   │ traps to   │ │ for VMs  │ │  emulation   │
   │ handlers   │ │          │ │              │
   └────────────┘ └──────────┘ └─────────────┘

Module 1: DISPATCHER

The entry point of the VMM. All traps from guest VMs arrive here first.

How it works:

Guest VM executes a sensitive instruction (privileged or behavior-sensitive)
Hardware traps to the hypervisor (via VT-x VMEXIT or ring transition)
Dispatcher receives the trap
Dispatcher examines the trap reason
Dispatcher routes to either the Allocator or the Interpreter

Modern implementations:

KVM: Linux kernel trap handler → kvm_handle_exit()
ESXi: VMM world trap handler
Xen: Hypercall handler + trap dispatch

Think of the Dispatcher as a traffic controller — it doesn't do the work, it decides who does.

Module 2: ALLOCATOR

Decides and manages what physical resources each VM gets.

Responsibilities:

CPU scheduling — which VM runs on which physical core, for how long
Memory allocation — how much RAM each VM gets, shadow/nested page tables
I/O assignment — virtual devices mapped to physical or emulated devices
Resource limits — prevent one VM from starving others

Triggered when:

A guest instruction changes the machine's resource mapping — e.g., setting up new page tables, accessing a new I/O port, changing interrupt vectors.

Modern implementations:

KVM: Linux CFS/EEVDF scheduler + KSM memory dedup + cgroups
ESXi: DRS (Distributed Resource Scheduler) + memory ballooning
Xen: Credit/Credit2 scheduler + Xen grant tables

Module 3: INTERPRETER

Contains interpreter routines that emulate privileged instructions.

Responsibilities:

Execute an equivalent safe sequence when a guest tries a privileged operation
Emulate hardware behavior without giving the guest real hardware access
Maintain the virtual CPU state (virtual registers, flags, control registers)

Examples of interpreted instructions:

Guest Instruction	What Interpreter Does
`LGDT` (load GDT)	Updates virtual GDT, maintains shadow GDT
`MOV CR3` (page table switch)	Updates shadow/nested page tables
`IN/OUT` (I/O port access)	Routes to virtual device emulator
`HLT` (halt CPU)	Deschedules vCPU, wakes on virtual interrupt
`INVLPG` (TLB invalidation)	Flushes relevant shadow TLB entries

Reference Model in Action — Full Flow

Guest VM: MOV to CR3 (switch page tables)
    │
    ▼ VMEXIT (hardware trap)
┌──────────┐
│DISPATCHER │──── Trap reason: CR3 write
└──┬───┬───┘
   │   │
   │   ▼ (resource change detected)
   │ ┌──────────┐
   │ │ALLOCATOR │──── Update VM's memory mapping
   │ └──┬───────┘     Track new page table base address
   │    │
   │    ▼ (privileged instruction)
   ▼ ┌───────────┐
     │INTERPRETER│──── Emulate CR3 load safely
     └──┬────────┘     Update nested/shadow page tables
        │              Flush relevant TLB entries
        ▼
    VMENTER (resume guest)

Popek & Goldberg — The Formal Theorem

Three types of instructions in an ISA:

Privileged instructions — cause a trap when executed in user mode
Sensitive instructions:
- Control-sensitive — change system configuration (e.g., I/O, page tables)
- Behavior-sensitive — behave differently depending on privilege level

The Theorem (1974):

A virtual machine monitor may be constructed for any conventional third-generation computer if the set of sensitive instructions is a subset of the set of privileged instructions.

The x86 Problem:

x86 had 17 sensitive but non-privileged instructions (e.g., SGDT, SIDT, POPF) — they didn't trap! Solutions:

Binary translation (VMware, 1998) — rewrite guest code on-the-fly
Paravirtualization (Xen, 2003) — modify guest OS to use hypercalls
Hardware VT-x/AMD-V (2005–06) — new CPU mode with proper trapping

Hardware-Assisted Virtualization (VT-x)

┌─────────────────────────────────────────┐
│              VMX Operation              │
│                                         │
│  ┌─────────┐         ┌───────────────┐ │
│  │VMX root │◄──VMEXIT──│VMX non-root│ │
│  │(host/   │          │  (guest VM)  │ │
│  │hyperv.) │──VMENTER──►│             │ │
│  └─────────┘         └───────────────┘ │
│                                         │
│  VMCS (Virtual Machine Control Struct.) │
│  ┌─────────────────────────────────────┐│
│  │ Guest state area (regs, CR, EFER)  ││
│  │ Host state area (return state)     ││
│  │ VM-execution controls (what traps) ││
│  │ Exit reason + qualification        ││
│  └─────────────────────────────────────┘│
└─────────────────────────────────────────┘

VMLAUNCH/VMRESUME → enter guest (VMX non-root)
VMEXIT → trap back to host (VMX root) on sensitive operations
VMCS → per-vCPU control structure (what to trap, guest/host state)

Key Benefits of Hypervisors — Summary

🔧 Efficiency

Maximizes hardware utilization — run multiple virtual servers on one physical machine (60–80% utilization vs. 5–15% bare metal)

🔒 Isolation & Security

If one VM crashes or is compromised, others remain unaffected — strong security boundary (especially Type 1)

🔄 Flexibility

Run different operating systems (Linux, Windows, BSD) simultaneously on the same hardware

💰 Cost Savings

Reduces physical hardware count → lower energy, cooling, space, and maintenance costs

☁️ Cloud Foundation

Hypervisors are the foundation of modern cloud services — AWS (Xen → Nitro/Firecracker), Google Cloud (KVM), Azure (Hyper-V)

Part 5

Virtualization vs. Containerization

(Preview for Session 2)

The Two Paradigms

      VIRTUALIZATION                    CONTAINERIZATION

┌──────┐ ┌──────┐ ┌──────┐     ┌──────┐ ┌──────┐ ┌──────┐
│App A │ │App B │ │App C │     │App A │ │App B │ │App C │
├──────┤ ├──────┤ ├──────┤     ├──────┤ ├──────┤ ├──────┤
│Bins/ │ │Bins/ │ │Bins/ │     │Bins/ │ │Bins/ │ │Bins/ │
│Libs  │ │Libs  │ │Libs  │     │Libs  │ │Libs  │ │Libs  │
├──────┤ ├──────┤ ├──────┤     └──┬───┘ └──┬───┘ └──┬───┘
│GuestOS││GuestOS││GuestOS│        │        │        │
└──┬───┘ └──┬───┘ └──┬───┘   ┌────┴────────┴────────┴────┐
   │        │        │       │     Container Runtime      │
┌──┴────────┴────────┴────┐  │   (Docker/containerd/CRI-O)│
│       Hypervisor        │  ├────────────────────────────┤
├─────────────────────────┤  │    Host OS Kernel (shared) │
│   Hardware              │  ├────────────────────────────┤
└─────────────────────────┘  │    Hardware                │
                             └────────────────────────────┘

Side-by-Side Comparison

Aspect	Virtualization	Containerization
Isolation	Full OS per VM (strong)	Shared kernel (process-level)
Resource Usage	Heavy — each VM has full OS	Lightweight — shared kernel
Performance	1–5% overhead (Type 1)	Near-native (<1% overhead)
Startup Time	Seconds to minutes	Milliseconds to seconds
Image Size	GBs (full OS image)	MBs (only app + deps)
Portability	Less portable (OS-specific)	Highly portable (OCI images)
Density	10–100 VMs per host	100–1000+ containers per host
Security	Strong (hardware isolation)	Weaker (kernel shared)
Ecosystem	VMware, Hyper-V, KVM	Docker, Kubernetes, Podman
Best For	Multi-OS, legacy, strong isolation	Microservices, CI/CD, cloud-native

When to Use Which?

Choose Virtualization when:

Running different operating systems on one host
Need strong security isolation (multi-tenant, compliance)
Running legacy applications not designed for containers
Need full kernel control (custom kernel modules, drivers)
Compliance requires hardware-level separation

Choose Containerization when:

Building microservices architectures
Need fast scaling (autoscaling, burst capacity)
Want CI/CD pipeline integration (build → test → deploy)
Need maximum resource density (cost optimization)
Building cloud-native applications

🔑 Real-world answer: Both. Containers typically run inside VMs in production.

The Full Stack in Production

┌────────────────────────────────────────────────────┐
│              YOUR APPLICATIONS                     │
│   ┌──────────┐ ┌──────────┐ ┌──────────┐          │
│   │Container │ │Container │ │Container │  ...      │
│   └────┬─────┘ └────┬─────┘ └────┬─────┘          │
│        └──────┬──────┘            │                │
│        ┌──────┴───────────────────┴──────────┐     │
│        │   Kubernetes / Container Orchestrator │    │
│        ├─────────────────────────────────────┤     │
│        │   Container Runtime (containerd/CRI-O)│   │
│        ├─────────────────────────────────────┤     │
│        │   Linux Kernel (namespaces, cgroups) │    │
│        └─────────────────────────────────────┘     │
│                    ┌──────────┐                     │
│                    │    VM    │ ← VM per K8s node   │
│                    └────┬─────┘                     │
│               ┌─────────┴──────────┐                │
│               │   Hypervisor       │                │
│               │ (ESXi/KVM/Hyper-V) │                │
│               ├────────────────────┤                │
│               │ Physical Hardware  │                │
│               └────────────────────┘                │
└────────────────────────────────────────────────────┘

Next session: We'll deep-dive into what's inside that container & Kubernetes layer →

🧠 Session 1 — Key Takeaways

Bare metal gives maximum performance but wastes resources and is slow to provision
Hypervisors solve this by multiplexing hardware across isolated VMs
Type 1 (bare metal) hypervisors are for production; Type 2 (hosted) for dev/test
The Dispatcher → Allocator → Interpreter triad is the universal hypervisor model
Hardware-assisted virtualization (VT-x/AMD-V) solved x86's virtualization gap
Containers ≠ replacement for VMs — they complement each other
In production: apps in containers, containers in VMs, VMs on hypervisors

📖 Session 1 — Recommended Reading

Popek & Goldberg, "Formal Requirements for Virtualizable Third Generation Architectures" (1974)
Intel SDM Volume 3, Chapter 23-33 — VMX architecture
Smith & Nair, "Virtual Machines: Versatile Platforms for Systems and Processes" (2005)
The Borg Paper (Google) — large-scale cluster management
Firecracker: Lightweight Virtualization for Serverless
KVM Architecture Overview

❓ Questions & Discussion

(10 minutes)

Discussion prompts:

When would you choose bare metal over VMs?
Why did cloud providers build on Type 1 hypervisors instead of containers alone?
What's the security difference between VM isolation and container isolation?

See you in Session 2!

From Containers to Kubernetes

🐳 → ☸️

Raw

masterclass-session-2--containers-to-kubernetes.md

marp	true
theme	default
paginate	true
backgroundColor
color
style	section { font-family: 'Segoe UI', Arial, sans-serif; } h1 { color: #00d4ff; } h2 { color: #7b68ee; } h3 { color: #ff6b6b; } strong { color: #ffd93d; } code { background: #16213e; color: #00d4ff; padding: 2px 6px; border-radius: 4px; } table { font-size: 0.78em; } th { background: #16213e; color: #00d4ff; } td { background: #0f3460; } blockquote { border-left: 4px solid #7b68ee; background: #16213e; padding: 10px 20px; } a { color: #ffd93d; } pre { font-size: 0.72em; }

🐳 Session 2

From Containers to Kubernetes

Master Class — Container Orchestration & CRI

Duration: 90 minutes Level: Intermediate → Advanced Prerequisite: Session 1 (Bare Metal to Virtualization)

📋 Session 2 — Agenda

#	Topic	Time
1	Linux Kernel Primitives for Containers	15 min
2	Containerization Deep Dive	10 min
3	Dockerfiles & Image Building	15 min
4	Container Runtimes: Docker, Podman, Buildx, CRI-O	15 min
5	Kubernetes Architecture	15 min
6	CRI — Container Runtime Interface	10 min
7	Bare Metal to K8s Cluster — Full Journey	5 min
8	Q&A / Discussion	5 min

Part 1

Linux Kernel Primitives

The Foundation Containers Are Built On

Containers Are NOT a Kernel Feature

There is no "container" system call in Linux. A "container" is a user-space concept built from combining multiple independent kernel primitives.

The building blocks:

Primitive	Purpose	Year
chroot	Filesystem isolation	1979
Namespaces	Resource visibility isolation	2002–2016
cgroups (v1/v2)	Resource limits & accounting	2007/2016
seccomp-bpf	System call filtering	2012
Capabilities	Fine-grained privilege control	1999 (POSIX)
AppArmor / SELinux	Mandatory access control	2003/2000
OverlayFS	Layered filesystem (image layers)	2014

A container = namespaces + cgroups + seccomp + capabilities + rootfs

Linux Namespaces — Visibility Isolation

Namespaces make a process think it's alone on the system.

Namespace	Flag	Isolates	Since
Mount	`CLONE_NEWNS`	Filesystem mount points	2.4.19 (2002)
UTS	`CLONE_NEWUTS`	Hostname and domain name	2.6.19 (2006)
IPC	`CLONE_NEWIPC`	System V IPC, POSIX MQs	2.6.19 (2006)
PID	`CLONE_NEWPID`	Process IDs (init = PID 1)	2.6.24 (2008)
Network	`CLONE_NEWNET`	Network stack, interfaces, routes	2.6.24 (2008)
User	`CLONE_NEWUSER`	UID/GID mappings (rootless)	3.8 (2013)
Cgroup	`CLONE_NEWCGROUP`	cgroup root visibility	4.6 (2016)
Time	`CLONE_NEWTIME`	Boot and monotonic clocks	5.6 (2020)

# Create a process with new PID + NET + MNT namespaces:
unshare --pid --net --mount --fork /bin/bash

Network Namespace — Deep Dive

Each network namespace has its own complete, isolated network stack.

┌─────────────────── Host Network Namespace ───────────────────┐
│                                                               │
│  eth0 (physical NIC)    docker0 (bridge)      veth-host-1    │
│  10.0.0.5               172.17.0.1            ─────────┐     │
│                              │                          │     │
│                              │                     ┌────┤     │
│                              │                     │veth│     │
│                              │                     │pair│     │
│                              │                     └────┤     │
│                                                         │     │
│  ┌──────────── Container Network Namespace ──────────┐  │     │
│  │                                                   │  │     │
│  │  eth0 (veth-container-1)     lo (loopback)        │  │     │
│  │  172.17.0.2                  127.0.0.1            │  │     │
│  │                                                   │  │     │
│  │  Routing table:   default via 172.17.0.1          │  │     │
│  │  iptables rules:  (independent)                   │  │     │
│  │  /proc/net/...:   (container's own view)          │  │     │
│  │                                                   │  │     │
│  └───────────────────────────────────────────────────┘  │     │
│                                                               │
└───────────────────────────────────────────────────────────────┘

PID Namespace — Process Isolation

Host PID Namespace:
  PID 1: systemd
  PID 100: dockerd
  PID 200: containerd
  PID 500: containerd-shim → (container process)
  PID 501: nginx (from host's view)

Container PID Namespace:
  PID 1: nginx  ← same process, different PID!
  PID 2: nginx worker
  PID 3: nginx worker

  # Inside container:
  $ ps aux
  PID  USER  COMMAND
  1    root  nginx: master process
  2    nginx nginx: worker process
  3    nginx nginx: worker process
  # Can't see host processes = isolation

Container's PID 1 = the entrypoint process. If PID 1 dies → the container stops. Signal handling for PID 1 is special (no default SIGTERM handler).

cgroups — Resource Limits & Accounting

Control Groups limit, account for, and isolate resource usage (CPU, memory, I/O, network).

cgroups v2 hierarchy (modern):

/sys/fs/cgroup/
├── system.slice/              ← system services
├── user.slice/                ← user sessions
└── kubepods.slice/            ← Kubernetes pods
    ├── kubepods-burstable.slice/
    │   └── kubepods-burstable-pod<UID>.slice/
    │       ├── cri-containerd-<ID>.scope/
    │       │   ├── cpu.max          → "100000 100000" (100% of 1 CPU)
    │       │   ├── memory.max       → "536870912" (512 MiB)
    │       │   ├── memory.current   → "234881024" (current usage)
    │       │   ├── io.max           → "8:0 rbps=104857600" (100MB/s)
    │       │   └── pids.max         → "1024"

Key cgroup controllers:

Controller	Manages
`cpu`	CPU time (shares, quota, period)
`memory`	Memory limit, swap, OOM behavior
`io`	Block I/O bandwidth, IOPS
`pids`	Max number of processes
`cpuset`	Pin to specific CPUs/NUMA nodes

seccomp-bpf — System Call Filtering

seccomp restricts which system calls a process can make.

Default Docker seccomp profile blocks ~44 of ~330+ syscalls:

BLOCKED (dangerous):                  ALLOWED (safe):
├── reboot()                         ├── read() / write()
├── kexec_load()                     ├── open() / close()
├── mount() / umount2()              ├── mmap() / mprotect()
├── swapon() / swapoff()             ├── socket() / connect()
├── init_module() / delete_module()  ├── fork() / clone()
├── acct()                           ├── execve()
├── settimeofday()                   ├── getpid() / getuid()
├── syslog()                         ├── stat() / fstat()
├── ptrace() *                       └── ... (most normal ops)
└── bpf() *

Defense in depth: Even if an attacker escapes namespace isolation, seccomp blocks dangerous kernel interactions.

Capabilities — Fine-Grained Privileges

Linux splits the old root / non-root binary into ~40 individual capabilities.

Traditional model:        Capabilities model:
  UID 0 = ALL power         CAP_NET_BIND_SERVICE → bind < 1024
  UID !0 = no power         CAP_NET_RAW          → raw sockets
                            CAP_SYS_ADMIN        → mount, bpf, ...
                            CAP_SYS_PTRACE       → ptrace
                            CAP_DAC_OVERRIDE     → bypass file perms
                            CAP_CHOWN            → change file owner
                            ... (~40 total)

Docker default capability set (whitelist):

GRANTED (14):                       DROPPED (everything else):
  CAP_CHOWN                          CAP_SYS_ADMIN
  CAP_DAC_OVERRIDE                   CAP_NET_RAW (dropped by default now)
  CAP_FSETID                         CAP_SYS_PTRACE
  CAP_FOWNER                         CAP_SYS_MODULE
  CAP_NET_BIND_SERVICE               CAP_SYS_RAWIO
  CAP_SETGID / CAP_SETUID            CAP_SYS_TIME
  CAP_KILL / CAP_AUDIT_WRITE         ...
  CAP_SETPCAP / CAP_SETFCAP
  CAP_MKNOD / CAP_NET_RAW (*)

Part 2

Containerization Deep Dive

What is a Container?

A container is an isolated process (or group of processes) running on a shared kernel, with its own filesystem, network, and process view.

    Container = isolated process on shared kernel

    ┌─ Namespace isolation ──────────────────────┐
    │                                             │
    │  PID namespace  → own PID 1                 │
    │  NET namespace  → own eth0, routes, iptables│
    │  MNT namespace  → own root filesystem       │
    │  UTS namespace  → own hostname              │
    │  USER namespace → own uid mapping           │
    │                                             │
    │  + cgroup limits (CPU, mem, I/O)            │
    │  + seccomp filter (syscall whitelist)       │
    │  + capabilities (fine-grained privs)        │
    │  + OverlayFS (layered rootfs)               │
    │                                             │
    └─────────────────────────────────────────────┘
         │
         ▼
    Host Linux Kernel (shared by ALL containers)

Container Images — Layered Filesystem

A container image is a stack of read-only layers plus a thin read-write layer on top.

┌─────────────────────────────────────────────┐
│  Writable Container Layer (ephemeral)       │  ← changes here
├─────────────────────────────────────────────┤
│  Layer 4: COPY app.py /app/                 │  ← your code
├─────────────────────────────────────────────┤
│  Layer 3: RUN pip install flask             │  ← dependencies
├─────────────────────────────────────────────┤
│  Layer 2: RUN apt-get install python3       │  ← runtime
├─────────────────────────────────────────────┤
│  Layer 1: Ubuntu 24.04 base image           │  ← base OS
└─────────────────────────────────────────────┘

Storage driver: OverlayFS (overlay2)
  - Lower layers: read-only, shared between containers
  - Upper layer: read-write, container-specific (copy-on-write)
  - Merged view: union mount presented to the container

Efficiency: 100 containers from the same image share the same base layers — only the writable layer is unique per container.

OCI Standards — The Universal Contract

The Open Container Initiative (OCI) defines open standards so images and runtimes are interchangeable.

Three specifications:

Spec	Defines	Key Points
Image Spec	Image format & layout	Layers, manifests, config JSON
Runtime Spec	How to run a container	`config.json` → namespaces, mounts, hooks
Distribution Spec	How to push/pull images	Registry API (Docker Hub, GHCR, ECR)

OCI Image Manifest:
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": { "digest": "sha256:abc...", "size": 7023 },
  "layers": [
    { "digest": "sha256:def...", "size": 32654 },   ← base
    { "digest": "sha256:ghi...", "size": 16724 },   ← deps
    { "digest": "sha256:jkl...", "size": 73109 }    ← app
  ]
}

Part 3

Dockerfiles & Image Building

Dockerfile — Anatomy

# ─── Build stage ──────────────────────────────────
FROM golang:1.22-alpine AS builder
WORKDIR /src

# Cache dependencies separately from source code
COPY go.mod go.sum ./
RUN go mod download

# Copy source and build
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /app/server ./cmd/server

# ─── Production stage ─────────────────────────────
FROM gcr.io/distroless/static-debian12:nonroot

COPY --from=builder /app/server /server

EXPOSE 8080
USER nonroot:nonroot

ENTRYPOINT ["/server"]

Key principles:

Multi-stage builds → small final images (no compiler in production)
Layer caching → put rarely-changing layers first (go.mod before source)
Non-root user → security best practice
Distroless base → minimal attack surface (no shell, no package manager)

Dockerfile Best Practices

DO ✅

# Pin versions for reproducibility
FROM python:3.12.3-slim-bookworm

# Combine RUN commands to reduce layers
RUN apt-get update && \
    apt-get install -y --no-install-recommends curl && \
    rm -rf /var/lib/apt/lists/*

# Copy dependency file first for caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Use .dockerignore to exclude unnecessary files
# Run as non-root
USER 1000:1000

DON'T ❌

FROM ubuntu:latest           # unpinned tag
RUN apt-get update           # separate from install (cache bug)
RUN apt-get install python3  # missing -y, missing cleanup
COPY . .                     # copies everything including .git
USER root                    # running as root in production

Multi-Architecture Builds with `docker buildx`

# Create a multi-platform builder
docker buildx create --name mybuilder --use --bootstrap

# Build for multiple architectures simultaneously
docker buildx build \
  --platform linux/amd64,linux/arm64,linux/arm/v7 \
  --tag myregistry/myapp:1.0.0 \
  --push \
  .

How it works:

docker buildx build --platform linux/amd64,linux/arm64
         │
         ▼
┌─────────────────────────────────────────┐
│  BuildKit (Moby buildkitd)              │
│                                         │
│  amd64: native build on x86_64 host    │
│  arm64: QEMU emulation (or remote)      │
│                                         │
│  → Multi-arch manifest (fat manifest)   │
│    ├── linux/amd64 → sha256:abc...      │
│    └── linux/arm64 → sha256:def...      │
└─────────────────────────────────────────┘
         │
         ▼
    Registry auto-selects correct image
    based on client architecture

Part 4

Container Runtimes

Docker, Podman, containerd, CRI-O

The Container Runtime Landscape

                                HIGH-LEVEL
                          (Image mgmt + Build + API)
                    ┌──────────────────────────────────┐
                    │  Docker Engine  │  Podman  │ Nerd │
                    │  (dockerd)      │          │(nerdctl)
                    └───────┬─────────┴────┬─────┴──┬──┘
                            │              │        │
                        MID-LEVEL                    
                    (Container lifecycle management)  
                    ┌───────┴──────────────┴────────┴──┐
                    │   containerd       │    CRI-O     │
                    │                    │              │
                    │  (Docker's core)   │  (K8s native)│
                    └───────┬────────────┴──────┬───────┘
                            │                   │
                        LOW-LEVEL (OCI Runtime)  
                    ┌───────┴───────────────────┴───────┐
                    │  runc   │  crun   │  kata  │ gVisor│
                    │ (ref    │ (fast,  │(micro  │(user  │
                    │  impl)  │  C)     │ VMs)   │space) │
                    └──────────────────────────────────┘

Docker Engine — Architecture

                     docker CLI
                        │
                   REST API (unix socket / TCP)
                        │
                        ▼
               ┌─────────────────┐
               │    dockerd      │  ← Docker daemon
               │  (Docker Engine)│     Image builds, networking,
               │                 │     volumes, orchestration
               └────────┬────────┘
                        │ gRPC
                        ▼
               ┌─────────────────┐
               │   containerd    │  ← Container runtime
               │                 │     Lifecycle, snapshots,
               │                 │     image pull/push
               └────────┬────────┘
                        │ OCI runtime exec
                        ▼
               ┌─────────────────┐
               │   runc          │  ← OCI runtime
               │                 │     clone() + execve()
               │                 │     namespaces + cgroups
               └─────────────────┘

Docker = dockerd + containerd + runc — three layers. Kubernetes talks to containerd directly, skipping dockerd.

Podman — Daemonless Alternative

                     podman CLI
                        │
                    (direct, no daemon)
                        │
                        ▼
               ┌─────────────────┐
               │    Podman       │  ← Library (libpod)
               │   (no daemon!)  │     Fork + exec model
               │                 │     Rootless by default
               └────────┬────────┘
                        │
                        ▼
               ┌─────────────────┐
               │   conmon        │  ← Container monitor
               │                 │     Holds stdio, exit code
               └────────┬────────┘
                        │
                        ▼
               ┌─────────────────┐
               │ crun (or runc)  │  ← OCI runtime
               └─────────────────┘

Key differences from Docker:

No daemon — each podman call is a direct process
Rootless by default — runs entirely in user namespaces
Pod-native — podman pod directly models K8s pods
Docker CLI compatible — alias docker=podman works
systemd integration — podman generate systemd

Docker vs Podman vs Buildah vs Skopeo

Tool	Role	Daemon?	Root?
Docker	Build + Run + Push/Pull	Yes (dockerd)	Yes (default)
Podman	Run containers (+ build)	No	No (rootless)
Buildah	Build OCI images	No	No (rootless)
Skopeo	Copy/inspect images between registries	No	No
Buildx	Docker multi-arch builder (BuildKit)	Yes (BuildKit)	Yes

The Red Hat container toolchain:

Building images:     Buildah (or Podman build)
Running containers:  Podman
Moving images:       Skopeo
Kubernetes runtime:  CRI-O
OCI runtime:         crun (or runc)

The Docker toolchain:

Building images:     docker build (or docker buildx)
Running containers:  docker run (dockerd → containerd → runc)
Moving images:       docker push/pull
Kubernetes runtime:  containerd
OCI runtime:         runc

OCI Runtimes — The Lowest Layer

Runtime	Language	Key Feature
runc	Go	Reference implementation, Docker default
crun	C	2× faster startup, lower memory, Podman default
Kata Containers	Go	Runs each container in a lightweight VM (hardware isolation)
gVisor (runsc)	Go	User-space kernel — intercepts syscalls (Google)
Firecracker	Rust	MicroVM backend for AWS Lambda / Fargate
youki	Rust	Rust rewrite of runc

Security spectrum:

Less isolation ◄──────────────────────────────► More isolation
   (faster)                                      (slower)

  runc/crun         gVisor            Kata / Firecracker
  namespaces        user-space        micro-VMs
  + cgroups         kernel            + hardware isolation
  (shared kernel)   (syscall proxy)   (separate kernel per container)

Part 5

Kubernetes Architecture

What is Kubernetes?

Kubernetes (K8s) is an open-source container orchestration platform that automates deployment, scaling, and management of containerized applications.

What K8s manages:

Scheduling — which node runs which container
Scaling — horizontal pod autoscaling (HPA)
Networking — service discovery, load balancing, ingress
Storage — persistent volumes, storage classes
Self-healing — restart failed containers, reschedule on healthy nodes
Rolling updates — zero-downtime deployments
Config & secrets — centralized configuration management

What K8s is NOT:

Not a PaaS (no app-level framework)
Not a CI/CD system (but integrates with them)
Not a VM orchestrator (that's OpenStack, vSphere)

Kubernetes Architecture — The Big Picture

┌─────────────────── Control Plane ───────────────────┐
│                                                      │
│  ┌──────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ API      │  │ etcd         │  │ Scheduler     │  │
│  │ Server   │  │ (consensus   │  │ (where to     │  │
│  │ (REST)   │  │  store)      │  │  place pods)  │  │
│  └────┬─────┘  └──────────────┘  └───────────────┘  │
│       │                                              │
│  ┌────┴─────────────────┐  ┌──────────────────────┐  │
│  │ Controller Manager   │  │ Cloud Controller Mgr │  │
│  │ (reconciliation      │  │ (cloud-specific:     │  │
│  │  loops)              │  │  LB, nodes, routes)  │  │
│  └──────────────────────┘  └──────────────────────┘  │
└──────────────────────────────────────────────────────┘
              │ kubectl / API calls
              ▼
┌─────────────── Worker Nodes ────────────────────────┐
│  Node 1              Node 2              Node 3     │
│  ┌──────────┐       ┌──────────┐       ┌────────┐  │
│  │ kubelet  │       │ kubelet  │       │kubelet │  │
│  │ CRI ─────┤       │ CRI ─────┤       │CRI ───┤  │
│  │ runtime  │       │ runtime  │       │runtime │  │
│  │ kube-    │       │ kube-    │       │kube-   │  │
│  │  proxy   │       │  proxy   │       │ proxy  │  │
│  └──────────┘       └──────────┘       └────────┘  │
└─────────────────────────────────────────────────────┘

Control Plane Components

API Server (`kube-apiserver`)

Front door for all K8s operations
RESTful API — all communication goes through here
Authentication, authorization (RBAC), admission control
Stores/retrieves state from etcd

etcd

Distributed key-value store (Raft consensus)
Single source of truth for all cluster state
All K8s objects stored here (pods, services, secrets, configs)

Scheduler (`kube-scheduler`)

Decides which node a new pod should run on
Considers: resources, affinity, anti-affinity, taints, tolerations

Controller Manager (`kube-controller-manager`)

Runs reconciliation loops (controllers)
Deployment controller, ReplicaSet controller, Node controller, etc.
Watches desired state vs actual state → takes corrective action

Worker Node Components

kubelet

Agent running on every node
Receives pod specs from API server
Calls the Container Runtime via CRI to start/stop containers
Reports node and pod status back to control plane
Manages liveness/readiness probes

Container Runtime

Actually runs the containers
Must implement the CRI (Container Runtime Interface)
Options: containerd, CRI-O, (Docker via cri-dockerd — deprecated)

kube-proxy

Manages network rules on each node
Implements Kubernetes Services (ClusterIP, NodePort, LoadBalancer)
Modes: iptables (default), IPVS (high-performance), nftables (new)

The Pod — Kubernetes' Atomic Unit

┌─────────────────── Pod ──────────────────────┐
│                                               │
│  Shared:                                      │
│  ├── Network namespace (same IP, localhost)   │
│  ├── IPC namespace                            │
│  ├── Volumes (shared storage)                 │
│  └── (optionally) PID namespace               │
│                                               │
│  ┌─────────────┐  ┌──────────────────┐        │
│  │ Container 1 │  │ Container 2      │        │
│  │ (main app)  │  │ (sidecar/proxy)  │        │
│  │             │  │                  │        │
│  │ Port 8080   │  │ Port 15001       │        │
│  └─────────────┘  └──────────────────┘        │
│                                               │
│  ┌─────────────────────────────┐              │
│  │ Init Container(s)          │ ← run first  │
│  │ (setup, migration, etc.)   │   then exit  │
│  └─────────────────────────────┘              │
│                                               │
│  Pod IP: 10.244.1.5 (all containers share)    │
│  Node: worker-02                              │
└───────────────────────────────────────────────┘

All containers in a pod share the same network namespace — they can reach each other on localhost.

Part 6

CRI — Container Runtime Interface

The Bridge Between kubelet and Containers

Why CRI Exists

The Docker Problem (pre-CRI):

Before CRI (K8s < 1.5):

  kubelet ──── Docker-specific code ──── dockerd ──── containerd ──── runc

  Problems:
  ✗ Kubelet was hardcoded to Docker's API
  ✗ Adding a new runtime = modifying kubelet source code
  ✗ Docker daemon had features K8s didn't need (build, swarm, etc.)
  ✗ Extra layer of indirection (kubelet→dockerd→containerd→runc)

The CRI Solution (K8s 1.5+, stable 1.26+):

After CRI:

  kubelet ──── CRI (gRPC) ──── containerd ──── runc
                    or
  kubelet ──── CRI (gRPC) ──── CRI-O     ──── runc/crun

  Benefits:
  ✓ kubelet is runtime-agnostic
  ✓ Any CRI-compliant runtime works
  ✓ Direct path — no unnecessary Docker daemon
  ✓ Kubernetes removed dockershim in v1.24

CRI — The gRPC Interface

CRI defines a gRPC protocol with two services:

1. RuntimeService — Container Lifecycle

service RuntimeService {
    // Sandbox (pod-level) operations
    rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse);
    rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse);
    rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse);

    // Container operations
    rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse);
    rpc StartContainer(StartContainerRequest) returns (StartContainerResponse);
    rpc StopContainer(StopContainerRequest) returns (StopContainerResponse);
    rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse);

    // Exec / Attach / Port-forward
    rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse);
    rpc Exec(ExecRequest) returns (ExecResponse);
    rpc Attach(AttachRequest) returns (AttachResponse);
}

2. ImageService — Image Management

service ImageService {
    rpc PullImage(PullImageRequest) returns (PullImageResponse);
    rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse);
    rpc ListImages(ListImagesRequest) returns (ListImagesResponse);
    rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse);
}

CRI-O — Purpose-Built for Kubernetes

CRI-O is a lightweight CRI implementation that does one thing: run containers for Kubernetes.

┌───────────────── kubelet ──────────────────┐
│                                             │
│  "I need a pod with nginx:1.27 container"  │
│                                             │
└──────────────────┬──────────────────────────┘
                   │ CRI gRPC (unix socket)
                   ▼
┌───────────────── CRI-O ───────────────────┐
│                                            │
│  1. Pull image from registry               │
│  2. Create pod sandbox (pause container)   │
│  3. Set up networking (CNI plugin call)    │
│  4. Create container in sandbox            │
│  5. Invoke OCI runtime (runc/crun)         │
│  6. Monitor via conmon                     │
│                                            │
│  Scope: ONLY what K8s needs               │
│  No build, no docker CLI, no swarm        │
│                                            │
└──────────────────┬─────────────────────────┘
                   │ OCI spec
                   ▼
┌───────────────── runc / crun ─────────────┐
│  clone() → unshare() → pivot_root()       │
│  → execve() the container entrypoint      │
└────────────────────────────────────────────┘

CRI-O vs containerd — Comparison

Feature	CRI-O	containerd
Purpose	K8s-only CRI runtime	General-purpose container runtime
Versioning	Matches K8s versions (1.30.x)	Independent releases
Scope	Minimal — CRI + OCI	Broader — CRI + Docker + others
Used by Docker?	No	Yes (Docker's core runtime)
Default in	OpenShift, SUSE Rancher	GKE, EKS, AKS, kubeadm default
Image pull	containers/image library	Own image pull implementation
Networking	CNI plugins	CNI plugins
Storage	containers/storage	Own snapshotter framework
Build images?	No	No (but nerdctl can)
Configuration	Drop-in config files	TOML config
OCI runtimes	runc, crun, Kata, gVisor	runc, gVisor, Kata

Rule of thumb: CRI-O = lean & K8s-only. containerd = versatile & widely adopted.

The Full Pod Start Sequence

kubectl apply -f pod.yaml
    │
    ▼
API Server → stores in etcd
    │
    ▼
Scheduler → assigns to Node-2
    │ (binding written to etcd)
    ▼
kubelet on Node-2 (watches API server)
    │
    ▼ CRI: RunPodSandbox()
Container Runtime (CRI-O / containerd)
    │
    ├── 1. Create pod sandbox (pause container with new namespaces)
    ├── 2. Call CNI plugin → allocate IP, set up veth pair
    ├── 3. CRI: CreateContainer() → prepare rootfs (overlay mount)
    ├── 4. CRI: StartContainer() → invoke OCI runtime
    │       └── runc/crun: clone(NEWNS|NEWPID|NEWNET|...)
    │                      → pivot_root() → execve("nginx")
    ├── 5. conmon monitors container stdio + exit
    └── 6. kubelet reports pod status → API server → etcd

Total time: 1-3 seconds (for a cached image)

Part 7

Bare Metal to K8s — The Full Journey

The Complete Evolution

LEVEL 0: BARE METAL
┌──────────────────────────┐
│  App A     App B    App C│  1 server = 1 (or few) apps
│  ════════════════════════│  Low utilization, slow provisioning
│  Host OS (Linux)         │
│  Physical Hardware       │
└──────────────────────────┘
         │ need isolation + better utilization
         ▼
LEVEL 1: VIRTUALIZATION
┌──────────────────────────┐
│ ┌──VM──┐ ┌──VM──┐       │  Hardware-level isolation
│ │App A │ │App B │  ...   │  Full OS per workload
│ │OS    │ │OS    │        │  Minutes to provision
│ └──────┘ └──────┘        │
│ Hypervisor (KVM/ESXi)    │
│ Physical Hardware        │
└──────────────────────────┘
         │ need faster scaling + less overhead
         ▼
LEVEL 2: CONTAINERIZATION
┌──────────────────────────┐
│ ┌────┐ ┌────┐ ┌────┐    │  Process-level isolation
│ │ A  │ │ B  │ │ C  │... │  Shared kernel
│ └────┘ └────┘ └────┘    │  Seconds to provision
│ Container Runtime        │
│ Host OS (Linux Kernel)   │
│ (VM or Bare Metal)       │
└──────────────────────────┘
         │ need orchestration at scale
         ▼
LEVEL 3: KUBERNETES
┌──────────────────────────┐
│ K8s Control Plane        │  Automated scheduling, scaling,
│ ┌─Node─┐ ┌─Node─┐       │  self-healing, networking,
│ │Pod Pod│ │Pod Pod│ ...  │  service discovery, config mgmt
│ │CRI   │ │CRI   │       │  Declarative desired-state model
│ └──────┘ └──────┘        │
│ Infrastructure (VMs/BM)  │
└──────────────────────────┘

Building a K8s Cluster from Bare Metal Linux

Step-by-step:

1. PREPARE THE NODE (bare metal Linux)
   ├── Disable swap:           swapoff -a
   ├── Load kernel modules:    overlay, br_netfilter
   ├── Set sysctl:             net.bridge.bridge-nf-call-iptables = 1
   │                           net.ipv4.ip_forward = 1
   └── Install container runtime (choose one):
       ├── containerd (+ CNI plugins)
       └── CRI-O

2. INSTALL KUBERNETES COMPONENTS
   ├── kubeadm  (cluster bootstrapper)
   ├── kubelet  (node agent)
   └── kubectl  (CLI client)

3. INITIALIZE CONTROL PLANE (first node)
   └── kubeadm init --pod-network-cidr=10.244.0.0/16

4. INSTALL CNI PLUGIN (pod networking)
   └── kubectl apply -f calico.yaml  (or Cilium, Flannel, ...)

5. JOIN WORKER NODES
   └── kubeadm join <control-plane>:6443 --token <token> ...

6. VERIFY
   └── kubectl get nodes  →  Ready, Ready, Ready

The Network Stack — What Actually Happens

Internet
    │
    ▼
┌─────────────── Physical NIC (eth0) ─────────────┐
│  IP: 10.0.0.100                                  │
│                                                  │
│  kube-proxy (iptables/IPVS rules)                │
│  ├── NodePort 30080 → Service ClusterIP → Pod IP │
│  └── LoadBalancer → External IP → Pods           │
│                                                  │
│  ┌──── CNI (Calico/Cilium) ─────────────────┐   │
│  │  Pod Network: 10.244.0.0/16              │   │
│  │                                           │   │
│  │  ┌── Pod A ──┐     ┌── Pod B ──┐         │   │
│  │  │ eth0      │     │ eth0      │         │   │
│  │  │10.244.1.5 │←───→│10.244.1.6 │         │   │
│  │  └───────────┘     └───────────┘         │   │
│  │     veth             veth                 │   │
│  │       └──── bridge/tunnel/eBPF ────┘      │   │
│  └───────────────────────────────────────────┘   │
│                                                  │
│  Cross-node: VXLAN / IPIP / BGP / WireGuard      │
└──────────────────────────────────────────────────┘

Analogy: Medical Triage → Kubernetes Scheduling

Medical Triage determines the priority of patient admission based on urgency.

Kubernetes Scheduling determines the priority and placement of pods based on resource needs and constraints.

Medical Triage	Kubernetes Scheduling
Immediate (Red) — life-threatening	PriorityClass: system-critical — must run first
Urgent (Orange) — serious but stable	Guaranteed QoS — resources reserved
Standard (Yellow) — can wait	Burstable QoS — requests < limits
Non-urgent (Green) — minor	BestEffort QoS — no guarantees
Deceased — no treatment	Evicted/Preempted — killed for higher priority
Available beds determine placement	Available node resources determine scheduling
Specialist wards (cardio, neuro)	Node affinity / taints (GPU node, high-mem)

Both systems: assess → classify → prioritize → assign resources under constraints.

Summary — The Full Stack Map

Layer 7: APPLICATION
  │  Your microservices, APIs, frontends, ML models
  │
Layer 6: ORCHESTRATION
  │  Kubernetes (scheduling, scaling, networking, self-healing)
  │  Helm charts, Operators, GitOps (ArgoCD/Flux)
  │
Layer 5: CONTAINER RUNTIME INTERFACE (CRI)
  │  kubelet ← gRPC → containerd / CRI-O
  │
Layer 4: CONTAINER RUNTIME
  │  containerd, CRI-O, Podman (standalone)
  │
Layer 3: OCI RUNTIME
  │  runc, crun, Kata Containers, gVisor
  │
Layer 2: LINUX KERNEL PRIMITIVES
  │  Namespaces, cgroups, seccomp, capabilities, OverlayFS
  │
Layer 1: OPERATING SYSTEM
  │  Linux (Ubuntu, RHEL, Flatcar, Talos)
  │
Layer 0: INFRASTRUCTURE
     Bare Metal  or  Virtual Machines (Type 1 hypervisor)

🧠 Session 2 — Key Takeaways

Containers are NOT a kernel feature — they're built from namespaces + cgroups + seccomp + capabilities
Network namespaces give each container its own full network stack
OCI standards ensure images and runtimes are interchangeable
Multi-stage Dockerfiles with non-root users are the gold standard
Docker ≠ the only option — Podman (daemonless), Buildah, CRI-O are production-proven
CRI decoupled Kubernetes from Docker — any CRI-compliant runtime works
CRI-O is purpose-built for K8s; containerd is more general-purpose
The real-world stack: App → K8s → CRI → containerd/CRI-O → runc → Linux kernel → VM → Hypervisor → Hardware

📖 Recommended Reading & Resources

Specifications

Books

Brendan Burns, "Kubernetes: Up and Running" (3rd ed., O'Reilly)
Liz Rice, "Container Security" (O'Reilly) — Linux primitives deep-dive
Michael Hausenblas, "Learning Modern Linux" (O'Reilly)

Hands-on

Kubernetes the Hard Way — Kelsey Hightower
Play with Kubernetes — browser-based lab
unshare + nsenter — build a container from scratch in 20 commands

❓ Questions & Discussion

Discussion prompts:

Why did Kubernetes remove Docker support (dockershim)?
When would you choose CRI-O over containerd?
What stops a container from escaping to the host?
How does the triage analogy map to your team's deployment priorities?

Thank You!

🖥️ Session 1: Bare Metal → Virtualization

🐳 Session 2: Containers → Kubernetes

"You can't build cloud-native without understanding what's beneath the clouds."

Next steps for your team:

Run unshare --pid --net --mount --fork /bin/bash on a test box
Build a multi-stage Dockerfile for one of your services
Try kubeadm init on a spare node
Explore crictl to interact with CRI directly

Raw

masterclass-speaker-notes.md

Master Class — Speaker Notes & Teaching Guide

From Bare Metal to Kubernetes (2 × 90 min)

SESSION 1: From Bare Metal to Virtualization

Slide Timing Guide

Slide(s)	Topic	Mins	Teaching Notes
1–2	Title + Agenda	2	Set expectations: "By end of today, you'll understand everything between hardware and VMs"
3–6	Bare Metal	15	Ask the room: "Who has installed an OS on bare metal?" Start with what they know. Highlight the utilization problem — this motivates everything that follows.
7–9	What is Virtualization	10	History slide is a great storytelling moment. The Popek & Goldberg theorem is the "aha" — x86 wasn't virtualizable until 2005!
10–13	Hypervisor Core	20	Key teaching point: Explain Type 1 vs Type 2 with the building analogy: Type 1 = the building foundation itself, Type 2 = a room inside someone else's building. KVM slide is critical — "Linux IS the hypervisor" blows minds.
14–17	Reference Model	15	Use the traffic controller analogy for Dispatcher. Walk through the CR3 example step by step — this makes it concrete. Poll: "What happens when a VM tries to reboot?"
18–21	Virt vs Container	10	This is the bridge to Session 2. Key message: "In production, you use BOTH." The full stack diagram at the end is the money shot.
22–23	Takeaways + Q&A	10	Recap the 7 key points. Open discussion.

Key Stories & Analogies to Use

The Hotel Analogy (Virtualization)

"Think of a hypervisor as a hotel building. Each VM is a complete hotel room with its own bathroom, kitchen, and bedroom. Guests are fully isolated — what happens in room 301 doesn't affect room 302. But each room takes significant space and resources."

The Apartment Analogy (Containerization — preview)

"Containers are like apartments in a shared building. They share plumbing (kernel), electrical (CPU scheduler), and foundation (hardware). Each has its own locked door (namespaces) and utility meter (cgroups). Much more efficient, but the building superintendent (kernel) is a shared dependency."

The Traffic Controller (Dispatcher)

"The dispatcher is a traffic controller at an intersection. It doesn't drive any car, it doesn't fuel any car — it just decides which lane each car goes to. Privileged instruction? Go to the Interpreter lane. Resource change? Go to the Allocator lane."

The x86 Problem — Tell It As a Mystery

"In 1974, Popek and Goldberg proved that you CAN virtualize any architecture... IF sensitive instructions always trap. For 25 years, x86 couldn't do this — 17 instructions were sensitive but didn't trap. VMware's genius was binary translation: rewrite the guest code on-the-fly to replace those sneaky instructions with safe trapping versions. Then Intel said 'fine, we'll fix it in hardware' — VT-x in 2005."

Common Questions & Answers

Q: Is Docker a hypervisor? A: No. Docker is a container runtime that uses Linux kernel primitives (namespaces, cgroups). It doesn't run full operating systems — just isolated processes on a shared kernel.

Q: Is KVM Type 1 or Type 2? A: This is debated! Technically it's Type 1 — the KVM module makes the Linux kernel itself into a hypervisor. But since you still have a full Linux userspace, some argue it's a "hybrid." The practical answer: it delivers Type 1 performance with Type 2 convenience.

Q: Why can't I just use containers for everything? A: Containers share the host kernel — a kernel vulnerability affects ALL containers. VMs provide hardware-level isolation. In regulated environments (banking, defense, healthcare), VM isolation is often a compliance requirement. Also, you can't run Windows containers on a Linux kernel (native — WSL2 uses a VM).

Q: What about WSL2? A: WSL2 is actually a lightweight Hyper-V VM running a real Linux kernel. It's Type 1 virtualization (Hyper-V is bare metal, Windows runs in the root partition) with a great developer experience layer on top.

SESSION 2: From Containers to Kubernetes

Slide Timing Guide

Slide(s)	Topic	Mins	Teaching Notes
1–2	Title + Agenda	2	Quick recap of Session 1's key points before diving in
3–8	Linux Kernel Primitives	15	DEMO OPPORTUNITY: Run `unshare` live. Show `lsns`. Create a network namespace and show isolated `ip addr`. This is the most educational part — demystify the "magic" of containers.
9–11	Containerization	10	The layer diagram and OCI spec slide are key. Emphasize: "A container image is just a tarball of filesystem layers + metadata JSON."
12–14	Dockerfiles & Buildx	15	LIVE CODING: Write a Dockerfile together. Show the DO vs DON'T side by side. Multi-stage builds are the #1 practical takeaway.
15–19	Container Runtimes	15	The landscape diagram is the anchor slide. Key message: "Docker is 3 layers: dockerd → containerd → runc. Kubernetes skips dockerd."
20–23	Kubernetes Architecture	15	Draw on whiteboard: Start with "a user types kubectl apply" and trace the full path. The pod start sequence is the master slide.
24–28	CRI Deep Dive	10	The protobuf definitions make it concrete — CRI is just a gRPC API. CRI-O vs containerd comparison is a common team decision point.
29–31	Full Journey + Triage	5	The triage analogy lands well with mixed audiences. The full stack map is the synthesis of both sessions.
32–33	Takeaways + Q&A	5	End with the hands-on next steps.

Live Demo Script (Session 2)

Demo 1: Build a Container From Scratch (5 min)

# Show current namespaces
lsns

# Create a new PID + mount + UTS namespace
sudo unshare --pid --mount --uts --fork /bin/bash

# Inside the new namespace:
hostname container-demo
hostname  # shows "container-demo"

ps aux    # only shows processes in this namespace!
# PID 1 is our bash shell

# Exit and show host hostname is unchanged
exit
hostname  # still the original hostname

Demo 2: Network Namespace (5 min)

# Create a network namespace
sudo ip netns add demo-ns

# Show it's completely empty (no interfaces)
sudo ip netns exec demo-ns ip addr
# Only loopback, and it's DOWN

# Bring up loopback
sudo ip netns exec demo-ns ip link set lo up

# Create a veth pair (virtual ethernet cable)
sudo ip link add veth-host type veth peer name veth-ns

# Move one end into the namespace
sudo ip link set veth-ns netns demo-ns

# Assign IPs
sudo ip addr add 10.0.0.1/24 dev veth-host
sudo ip link set veth-host up
sudo ip netns exec demo-ns ip addr add 10.0.0.2/24 dev veth-ns
sudo ip netns exec demo-ns ip link set veth-ns up

# Ping across the namespace boundary!
ping -c 2 10.0.0.2

# Cleanup
sudo ip netns del demo-ns

Demo 3: cgroup resource limit (3 min)

# Create a cgroup with 50MB memory limit (cgroups v2)
sudo mkdir /sys/fs/cgroup/demo
echo "52428800" | sudo tee /sys/fs/cgroup/demo/memory.max

# Run a process in that cgroup
echo $$ | sudo tee /sys/fs/cgroup/demo/cgroup.procs

# Try to allocate more than 50MB → OOM killed!
python3 -c "x = ' ' * 60_000_000"
# Killed!

Demo 4: crictl basics (2 min, needs a K8s node)

# List pods via CRI
sudo crictl pods

# List containers
sudo crictl ps

# Inspect a container
sudo crictl inspect <container-id>

# Pull an image via CRI
sudo crictl pull nginx:1.27

# Check runtime info
sudo crictl info

Common Questions & Answers (Session 2)

Q: Why did Kubernetes remove Docker support? A: Kubernetes never "used Docker" — it used containerd (inside Docker). The dockershim was a translation layer in kubelet that converted CRI calls to Docker API calls, which then called containerd anyway. Removing it: (a) eliminated a maintenance burden, (b) removed an unnecessary indirection layer, (c) let kubelet talk directly to containerd via CRI. Your container images still work — they're OCI standard.

Q: Should we use CRI-O or containerd? A: Both are production-grade.

containerd if: you want the most widely adopted option with the largest community (default for GKE, EKS, AKS kubeadm).
CRI-O if: you want a minimal, K8s-only runtime with version-locked releases (default for OpenShift, Rancher). Neither is "better" — it's organizational preference.

Q: Are containers less secure than VMs? A: Yes, by default. Containers share the host kernel → a kernel exploit affects all containers. VMs have hardware isolation (VT-x, separate kernel per VM). However, container security can be hardened significantly with: seccomp profiles, AppArmor/SELinux, rootless containers, read-only rootfs, network policies, and tools like Falco. For maximum isolation, use Kata Containers (container UX, VM isolation).

Q: What is a "pause" container? A: When CRI-O/containerd creates a pod, they first start a tiny "pause" container (literally does pause() syscall — sleeps forever). This container holds the pod's network namespace alive. When you add application containers to the pod, they join this existing namespace. If the app container crashes and restarts, the network namespace (and IP address) survive because the pause container is still running.

Q: Can I run Kubernetes on bare metal? A: Absolutely — and many high-performance workloads do (no hypervisor overhead). Tools for bare metal K8s: kubeadm, k3s, Talos Linux, Flatcar Container Linux, Tinkerbell for provisioning, MetalLB for load balancing, Rook/Ceph for storage.

Whiteboard Diagrams to Draw Live

1. "What Happens When You Type `kubectl run nginx --image=nginx`"

Draw step by step:

kubectl → API Server (REST call)
API Server → etcd (store pod spec)
Scheduler watches → picks node
Scheduler → API Server (binding)
kubelet on chosen node watches → sees new pod
kubelet → CRI-O (RunPodSandbox)
CRI-O → runc (create pause container + namespaces)
CRI-O → CNI (setup networking, assign IP)
kubelet → CRI-O (CreateContainer, StartContainer)
CRI-O → runc (unshare + execve nginx)
kubelet → API Server (pod status: Running)

2. "The Isolation Stack"

Draw as concentric security rings:

Outer: Hardware (separate machines)
Next: VMs (hypervisor isolation, separate kernels)
Next: Containers (namespace + cgroup isolation, shared kernel)
Next: Processes (standard OS isolation)
Inner: Threads (shared address space)

Each ring = different cost/performance/isolation tradeoff.

General Teaching Tips

Start each concept with WHY, then HOW. "Before we explain cgroups, let's understand why you need them — imagine 50 containers and one starts eating all the memory..."
Use the "zoom in" technique. Show the full stack diagram → "Today we're zooming into THIS layer."
Every 15 minutes, interact. Ask a question, run a demo, or do a quick poll. 90 minutes of pure slides = sleeping audience.
The triage analogy works. Your team likely knows medical triage from common knowledge. Map it: "immediate = critical pods, urgent = guaranteed QoS, standard = burstable, non-urgent = best-effort, deceased = evicted."
End with hands-on homework. Give specific commands to try. People remember what they do, not what they hear.