You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
kubelet ←─ gRPC ─→ containerd or CRI-O ──→ runc/crun
Two services: RuntimeService (pod/container lifecycle) + ImageService (pull/list/remove)
Kubernetes Components
CONTROL PLANE: WORKER NODE:
├── API Server (REST) ├── kubelet (CRI client)
├── etcd (state store) ├── container runtime
├── Scheduler (placement) └── kube-proxy (networking)
└── Controller Manager (loops)
Pod = Scheduling Unit
Shares: network namespace, IPC, volumes
Each pod gets its own IP
"pause" container holds the namespace
Docker vs Podman
Docker
Podman
Daemon
Yes (dockerd)
No
Root
Default
Rootless default
Build
docker build / buildx
podman build / buildah
K8s
via containerd
via CRI-O
Essential Commands
# Namespaces
unshare --pid --net --mount --fork /bin/bash
lsns
nsenter -t <PID> -n # enter network ns# cgroups
cat /sys/fs/cgroup/memory.max
cat /sys/fs/cgroup/cpu.max
# Docker / Podman
docker build -t myapp:1.0 .
docker run --rm -p 8080:80 myapp:1.0
podman pod create --name mypod
# Kubernetes
kubeadm init --pod-network-cidr=10.244.0.0/16
kubeadm join <cp>:6443 --token <tok>
kubectl get nodes
kubectl get pods -A
# CRI (on K8s node)
crictl pods
crictl ps
crictl info
crictl inspect <id>
Dockerfile Best Practices
# Multi-stage, pinned version, non-rootFROM golang:1.22-alpine AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /app ./cmd/server
FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]
Full hardware access — DMA, IOMMU, SR-IOV, GPU passthrough native
Deterministic latency — critical for HPC, real-time, HFT
Simpler debugging — no hypervisor layer to reason about
❌ Disadvantages
Low utilization — typical server uses only 5–15% of capacity
Slow provisioning — hours to days (PXE boot, Kickstart, Preseed)
No isolation — a rogue process can crash the entire machine
Scaling = buying hardware — no elasticity
Bare Metal Provisioning Methods
Method
Description
Use Case
PXE/iPXE + Kickstart
Network boot → automated install
Data center fleet
Cloud-Init
First-boot config injection
Cloud bare metal (e.g., AWS i3.metal)
Ironic (OpenStack)
Bare metal as a service
Private cloud
MAAS (Canonical)
Metal as a Service
Ubuntu-centric DC
Tinkerbell (Equinix)
Declarative bare metal workflow
Edge / hybrid
Manual ISO Install
USB/DVD boot + manual steps
Lab / dev
Key insight: Bare metal provisioning is fundamentally slower than VM or container creation. This drove the industry toward virtualization.
Linux on Bare Metal — Key Subsystems
User Space
├── systemd (PID 1, service management)
├── Applications & daemons
├── Shared libraries (glibc, libssl, ...)
│
Kernel Space
├── Process Scheduler (CFS / EEVDF)
├── Memory Management (page tables, NUMA, hugepages)
├── Virtual File System (VFS)
├── Network Stack (netfilter, tc, XDP, eBPF)
├── Device Drivers (NIC, storage, GPU)
├── Security Modules (SELinux, AppArmor, seccomp)
│
Hardware
├── CPU (rings 0-3, VMX extensions)
├── RAM (DDR4/5, NUMA nodes)
├── NIC (queues, RSS, offloads)
├── Storage (NVMe, SATA, HBA)
└── IOMMU, SR-IOV, PCIe topology
The Problem Bare Metal Couldn't Solve
Scenario: A company has 50 physical servers
Server 1: Web App → 8% CPU utilization
Server 2: Database → 12% CPU utilization
Server 3: CI Runner → 3% avg, 90% peak (bursts)
Server 4: Mail Server → 5% CPU utilization
...
Server 50: Monitoring → 2% CPU utilization
Average utilization: ~8% → 92% of purchased compute is wasted
💡 The question that launched an industry:
"Can we run multiple isolated workloads on one physical machine?"
Answer: Yes — Virtualization.
Part 2
What is Virtualization?
Virtualization — Definition
Virtualization is the creation of a virtual (rather than physical) version of something — servers, storage, networks, or operating systems — using a software abstraction layer.
┌──────────┐ ┌──────────┐ ┌──────────┐
│ VM 1 │ │ VM 2 │ │ VM 3 │
│ (Ubuntu) │ │(Windows) │ │(FreeBSD) │
│ App A │ │ App B │ │ App C │
├──────────┤ ├──────────┤ ├──────────┤
│ Guest OS │ │ Guest OS │ │ Guest OS │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
┌────┴─────────────┴─────────────┴────┐
│ HYPERVISOR (VMM) │
├─────────────────────────────────────┤
│ Physical Hardware │
└─────────────────────────────────────┘
A Brief History of Virtualization
Year
Milestone
1967
IBM CP-40 — first hypervisor on System/360 Model 67
1972
IBM VM/370 — commercial virtual machine OS
1998
VMware founded — x86 virtualization via binary translation
1999
VMware Workstation 1.0 released
2003
Xen hypervisor open-sourced (paravirtualization)
2005–06
Intel VT-x and AMD-V — hardware-assisted virtualization
2007
KVM merged into Linux kernel (2.6.20)
2008
Microsoft Hyper-V released
2010s
Cloud era — EC2, GCE, Azure all built on hypervisors
The x86 trap-and-emulate problem (Popek & Goldberg, 1974) wasn't solved until VMware's binary translation (1999) and Intel VT-x (2005).
Why Virtualization? — The Core Value
Before Virtualization (Physical Servers)
1 app = 1 server
5–15% average CPU utilization
Weeks to provision new servers
Hardware lock-in
After Virtualization
Many apps on 1 server → 60–80% utilization
Minutes to create new VMs → agility
Hardware abstraction → portability
Snapshots & live migration → disaster recovery
Isolation → security boundaries between tenants
Part 3
The Hypervisor — Core Concepts
Hypervisor — Definition
A hypervisor (also called Virtual Machine Monitor — VMM) is software, firmware, or hardware that creates and runs virtual machines by separating a computer's software from its hardware.
What it does:
Partitions physical resources (CPU, memory, I/O) among VMs
Isolates VMs from each other
Emulates or paravirtualizes hardware for guest OSes
Schedules VM execution on physical CPUs
Intercepts privileged instructions from guest kernels
The Contract:
Each VM believes it has exclusive access to dedicated hardware. The hypervisor maintains this illusion while sharing the real hardware.
Types of Hypervisors — Overview
TYPE 1 (Bare Metal) TYPE 2 (Hosted)
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ VM 1 │ │ VM 2 │ │ VM 1 │ │ VM 2 │
│Guest OS│ │Guest OS│ │Guest OS│ │Guest OS│
└───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘
│ │ │ │
┌──────┴───────────┴──────┐ ┌──────┴───────────┴──────┐
│ TYPE 1 HYPERVISOR │ │ TYPE 2 HYPERVISOR │
│ (runs ON hardware) │ │ (runs ON host OS) │
├─────────────────────────┤ ├─────────────────────────┤
│ Physical Hardware │ │ Host OS (Linux, │
└─────────────────────────┘ │ Windows, macOS) │
├─────────────────────────┤
│ Physical Hardware │
└─────────────────────────┘
Type 1 — Bare Metal Hypervisor
Runs directly on hardware — it is the operating system (or functionally replaces it).
KVM turns Linux itself into a Type-1 hypervisor. Each VM is a regular Linux process managed by QEMU. The kernel's scheduler, memory manager, and driver stack are reused — no separate hypervisor OS.
Type 1 vs Type 2 — Comparison Matrix
Aspect
Type 1 (Bare Metal)
Type 2 (Hosted)
Runs on
Hardware directly
Host operating system
Performance
Near-native (1–5% overhead)
Moderate (10–30% overhead)
Security
Strong isolation
Host compromise = game over
Use case
Production, cloud, DC
Dev, test, sandbox
Boot time
Seconds (microVMs) to minutes
Minutes (host + hypervisor + VM)
Management
vCenter, oVirt, Proxmox
GUI application
Live migration
✅ Yes
❌ No
Cost
$$$ (licenses + dedicated HW)
$ (free or cheap)
Examples
ESXi, KVM, Hyper-V, Xen
VirtualBox, VMware Workstation
Part 4
Hypervisor Reference Model
(Popek & Goldberg, 1974)
The Three Pillars of a Hypervisor
Popek & Goldberg (1974) defined the formal requirements for virtualizable architectures and the three core modules that coordinate to emulate hardware:
Privileged instructions — cause a trap when executed in user mode
Sensitive instructions:
Control-sensitive — change system configuration (e.g., I/O, page tables)
Behavior-sensitive — behave differently depending on privilege level
The Theorem (1974):
A virtual machine monitor may be constructed for any conventional third-generation computer if the set of sensitive instructions is a subset of the set of privileged instructions.
The x86 Problem:
x86 had 17 sensitive but non-privileged instructions (e.g., SGDT, SIDT, POPF) — they didn't trap! Solutions:
# ─── Build stage ──────────────────────────────────FROM golang:1.22-alpine AS builder
WORKDIR /src
# Cache dependencies separately from source codeCOPY go.mod go.sum ./
RUN go mod download
# Copy source and buildCOPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /app/server ./cmd/server
# ─── Production stage ─────────────────────────────FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /app/server /server
EXPOSE 8080
USER nonroot:nonroot
ENTRYPOINT ["/server"]
Key principles:
Multi-stage builds → small final images (no compiler in production)
Layer caching → put rarely-changing layers first (go.mod before source)
Non-root user → security best practice
Distroless base → minimal attack surface (no shell, no package manager)
Dockerfile Best Practices
DO ✅
# Pin versions for reproducibilityFROM python:3.12.3-slim-bookworm
# Combine RUN commands to reduce layersRUN apt-get update && \
apt-get install -y --no-install-recommends curl && \
rm -rf /var/lib/apt/lists/*
# Copy dependency file first for cachingCOPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Use .dockerignore to exclude unnecessary files# Run as non-rootUSER 1000:1000
DON'T ❌
FROM ubuntu:latest # unpinned tag
RUN apt-get update # separate from install (cache bug)
RUN apt-get install python3 # missing -y, missing cleanup
COPY . . # copies everything including .git
USER root # running as root in production
All containers in a pod share the same network namespace — they can reach each other on localhost.
Part 6
CRI — Container Runtime Interface
The Bridge Between kubelet and Containers
Why CRI Exists
The Docker Problem (pre-CRI):
Before CRI (K8s < 1.5):
kubelet ──── Docker-specific code ──── dockerd ──── containerd ──── runc
Problems:
✗ Kubelet was hardcoded to Docker's API
✗ Adding a new runtime = modifying kubelet source code
✗ Docker daemon had features K8s didn't need (build, swarm, etc.)
✗ Extra layer of indirection (kubelet→dockerd→containerd→runc)
The CRI Solution (K8s 1.5+, stable 1.26+):
After CRI:
kubelet ──── CRI (gRPC) ──── containerd ──── runc
or
kubelet ──── CRI (gRPC) ──── CRI-O ──── runc/crun
Benefits:
✓ kubelet is runtime-agnostic
✓ Any CRI-compliant runtime works
✓ Direct path — no unnecessary Docker daemon
✓ Kubernetes removed dockershim in v1.24
Set expectations: "By end of today, you'll understand everything between hardware and VMs"
3–6
Bare Metal
15
Ask the room: "Who has installed an OS on bare metal?" Start with what they know. Highlight the utilization problem — this motivates everything that follows.
7–9
What is Virtualization
10
History slide is a great storytelling moment. The Popek & Goldberg theorem is the "aha" — x86 wasn't virtualizable until 2005!
10–13
Hypervisor Core
20
Key teaching point: Explain Type 1 vs Type 2 with the building analogy: Type 1 = the building foundation itself, Type 2 = a room inside someone else's building. KVM slide is critical — "Linux IS the hypervisor" blows minds.
14–17
Reference Model
15
Use the traffic controller analogy for Dispatcher. Walk through the CR3 example step by step — this makes it concrete. Poll: "What happens when a VM tries to reboot?"
18–21
Virt vs Container
10
This is the bridge to Session 2. Key message: "In production, you use BOTH." The full stack diagram at the end is the money shot.
22–23
Takeaways + Q&A
10
Recap the 7 key points. Open discussion.
Key Stories & Analogies to Use
The Hotel Analogy (Virtualization)
"Think of a hypervisor as a hotel building. Each VM is a complete hotel room with its own bathroom, kitchen, and bedroom. Guests are fully isolated — what happens in room 301 doesn't affect room 302. But each room takes significant space and resources."
The Apartment Analogy (Containerization — preview)
"Containers are like apartments in a shared building. They share plumbing (kernel), electrical (CPU scheduler), and foundation (hardware). Each has its own locked door (namespaces) and utility meter (cgroups). Much more efficient, but the building superintendent (kernel) is a shared dependency."
The Traffic Controller (Dispatcher)
"The dispatcher is a traffic controller at an intersection. It doesn't drive any car, it doesn't fuel any car — it just decides which lane each car goes to. Privileged instruction? Go to the Interpreter lane. Resource change? Go to the Allocator lane."
The x86 Problem — Tell It As a Mystery
"In 1974, Popek and Goldberg proved that you CAN virtualize any architecture... IF sensitive instructions always trap. For 25 years, x86 couldn't do this — 17 instructions were sensitive but didn't trap. VMware's genius was binary translation: rewrite the guest code on-the-fly to replace those sneaky instructions with safe trapping versions. Then Intel said 'fine, we'll fix it in hardware' — VT-x in 2005."
Common Questions & Answers
Q: Is Docker a hypervisor?
A: No. Docker is a container runtime that uses Linux kernel primitives (namespaces, cgroups). It doesn't run full operating systems — just isolated processes on a shared kernel.
Q: Is KVM Type 1 or Type 2?
A: This is debated! Technically it's Type 1 — the KVM module makes the Linux kernel itself into a hypervisor. But since you still have a full Linux userspace, some argue it's a "hybrid." The practical answer: it delivers Type 1 performance with Type 2 convenience.
Q: Why can't I just use containers for everything?
A: Containers share the host kernel — a kernel vulnerability affects ALL containers. VMs provide hardware-level isolation. In regulated environments (banking, defense, healthcare), VM isolation is often a compliance requirement. Also, you can't run Windows containers on a Linux kernel (native — WSL2 uses a VM).
Q: What about WSL2?
A: WSL2 is actually a lightweight Hyper-V VM running a real Linux kernel. It's Type 1 virtualization (Hyper-V is bare metal, Windows runs in the root partition) with a great developer experience layer on top.
SESSION 2: From Containers to Kubernetes
Slide Timing Guide
Slide(s)
Topic
Mins
Teaching Notes
1–2
Title + Agenda
2
Quick recap of Session 1's key points before diving in
3–8
Linux Kernel Primitives
15
DEMO OPPORTUNITY: Run unshare live. Show lsns. Create a network namespace and show isolated ip addr. This is the most educational part — demystify the "magic" of containers.
9–11
Containerization
10
The layer diagram and OCI spec slide are key. Emphasize: "A container image is just a tarball of filesystem layers + metadata JSON."
12–14
Dockerfiles & Buildx
15
LIVE CODING: Write a Dockerfile together. Show the DO vs DON'T side by side. Multi-stage builds are the #1 practical takeaway.
15–19
Container Runtimes
15
The landscape diagram is the anchor slide. Key message: "Docker is 3 layers: dockerd → containerd → runc. Kubernetes skips dockerd."
20–23
Kubernetes Architecture
15
Draw on whiteboard: Start with "a user types kubectl apply" and trace the full path. The pod start sequence is the master slide.
24–28
CRI Deep Dive
10
The protobuf definitions make it concrete — CRI is just a gRPC API. CRI-O vs containerd comparison is a common team decision point.
29–31
Full Journey + Triage
5
The triage analogy lands well with mixed audiences. The full stack map is the synthesis of both sessions.
32–33
Takeaways + Q&A
5
End with the hands-on next steps.
Live Demo Script (Session 2)
Demo 1: Build a Container From Scratch (5 min)
# Show current namespaces
lsns
# Create a new PID + mount + UTS namespace
sudo unshare --pid --mount --uts --fork /bin/bash
# Inside the new namespace:
hostname container-demo
hostname # shows "container-demo"
ps aux # only shows processes in this namespace!# PID 1 is our bash shell# Exit and show host hostname is unchangedexit
hostname # still the original hostname
Demo 2: Network Namespace (5 min)
# Create a network namespace
sudo ip netns add demo-ns
# Show it's completely empty (no interfaces)
sudo ip netns exec demo-ns ip addr
# Only loopback, and it's DOWN# Bring up loopback
sudo ip netns exec demo-ns ip link set lo up
# Create a veth pair (virtual ethernet cable)
sudo ip link add veth-host type veth peer name veth-ns
# Move one end into the namespace
sudo ip link set veth-ns netns demo-ns
# Assign IPs
sudo ip addr add 10.0.0.1/24 dev veth-host
sudo ip link set veth-host up
sudo ip netns exec demo-ns ip addr add 10.0.0.2/24 dev veth-ns
sudo ip netns exec demo-ns ip link set veth-ns up
# Ping across the namespace boundary!
ping -c 2 10.0.0.2
# Cleanup
sudo ip netns del demo-ns
Demo 3: cgroup resource limit (3 min)
# Create a cgroup with 50MB memory limit (cgroups v2)
sudo mkdir /sys/fs/cgroup/demo
echo"52428800"| sudo tee /sys/fs/cgroup/demo/memory.max
# Run a process in that cgroupecho$$| sudo tee /sys/fs/cgroup/demo/cgroup.procs
# Try to allocate more than 50MB → OOM killed!
python3 -c "x = ' ' * 60_000_000"# Killed!
Demo 4: crictl basics (2 min, needs a K8s node)
# List pods via CRI
sudo crictl pods
# List containers
sudo crictl ps
# Inspect a container
sudo crictl inspect <container-id># Pull an image via CRI
sudo crictl pull nginx:1.27
# Check runtime info
sudo crictl info
Common Questions & Answers (Session 2)
Q: Why did Kubernetes remove Docker support?
A: Kubernetes never "used Docker" — it used containerd (inside Docker). The dockershim was a translation layer in kubelet that converted CRI calls to Docker API calls, which then called containerd anyway. Removing it: (a) eliminated a maintenance burden, (b) removed an unnecessary indirection layer, (c) let kubelet talk directly to containerd via CRI. Your container images still work — they're OCI standard.
Q: Should we use CRI-O or containerd?
A: Both are production-grade.
containerd if: you want the most widely adopted option with the largest community (default for GKE, EKS, AKS kubeadm).
CRI-O if: you want a minimal, K8s-only runtime with version-locked releases (default for OpenShift, Rancher).
Neither is "better" — it's organizational preference.
Q: Are containers less secure than VMs?
A: Yes, by default. Containers share the host kernel → a kernel exploit affects all containers. VMs have hardware isolation (VT-x, separate kernel per VM). However, container security can be hardened significantly with: seccomp profiles, AppArmor/SELinux, rootless containers, read-only rootfs, network policies, and tools like Falco. For maximum isolation, use Kata Containers (container UX, VM isolation).
Q: What is a "pause" container?
A: When CRI-O/containerd creates a pod, they first start a tiny "pause" container (literally does pause() syscall — sleeps forever). This container holds the pod's network namespace alive. When you add application containers to the pod, they join this existing namespace. If the app container crashes and restarts, the network namespace (and IP address) survive because the pause container is still running.
Q: Can I run Kubernetes on bare metal?
A: Absolutely — and many high-performance workloads do (no hypervisor overhead). Tools for bare metal K8s: kubeadm, k3s, Talos Linux, Flatcar Container Linux, Tinkerbell for provisioning, MetalLB for load balancing, Rook/Ceph for storage.
Whiteboard Diagrams to Draw Live
1. "What Happens When You Type kubectl run nginx --image=nginx"
Each ring = different cost/performance/isolation tradeoff.
General Teaching Tips
Start each concept with WHY, then HOW. "Before we explain cgroups, let's understand why you need them — imagine 50 containers and one starts eating all the memory..."
Use the "zoom in" technique. Show the full stack diagram → "Today we're zooming into THIS layer."
Every 15 minutes, interact. Ask a question, run a demo, or do a quick poll. 90 minutes of pure slides = sleeping audience.
The triage analogy works. Your team likely knows medical triage from common knowledge. Map it: "immediate = critical pods, urgent = guaranteed QoS, standard = burstable, non-urgent = best-effort, deceased = evicted."
End with hands-on homework. Give specific commands to try. People remember what they do, not what they hear.