Skip to content

Instantly share code, notes, and snippets.

@Fiooodooor
Last active April 2, 2026 09:37
Show Gist options
  • Select an option

  • Save Fiooodooor/aeb4fc51121ab1c137bfc3d97a68e66a to your computer and use it in GitHub Desktop.

Select an option

Save Fiooodooor/aeb4fc51121ab1c137bfc3d97a68e66a to your computer and use it in GitHub Desktop.

Quick Reference — Bare Metal to Kubernetes

Master Class Handout (1-pager per session)


SESSION 1 CHEAT SHEET: Bare Metal → Virtualization

The Stack

VMs → Hypervisor → Hardware

Hypervisor Types

Type 1 (Bare Metal) Type 2 (Hosted)
Runs on Hardware directly Host OS
Performance Near-native 10–30% overhead
Use Production / Cloud Dev / Test
Examples ESXi, KVM, Hyper-V, Xen VirtualBox, VMware Workstation

Hypervisor Reference Model (Popek & Goldberg, 1974)

DISPATCHER  → Entry point, routes traps
ALLOCATOR   → Manages resources for VMs
INTERPRETER → Emulates privileged instructions

Hardware Virtualization Extensions

  • Intel VT-x / AMD-V — CPU virtualization (VMENTER/VMEXIT)
  • Intel VT-d / AMD-Vi — I/O virtualization (IOMMU, DMA isolation)
  • SR-IOV — NIC hardware partitioning (virtual functions)

Key Commands

# Check if CPU supports virtualization
grep -E 'vmx|svm' /proc/cpuinfo

# Check KVM availability
lsmod | grep kvm

# List VMs (libvirt)
virsh list --all

SESSION 2 CHEAT SHEET: Containers → Kubernetes

The Container "Recipe"

Container = Namespaces + cgroups + seccomp + capabilities + rootfs

Linux Namespaces

NS Isolates Flag
mnt Filesystems CLONE_NEWNS
pid Process IDs CLONE_NEWPID
net Network stack CLONE_NEWNET
uts Hostname CLONE_NEWUTS
ipc IPC resources CLONE_NEWIPC
user UID/GID CLONE_NEWUSER
cgroup cgroup view CLONE_NEWCGROUP

Container Runtime Stack

HIGH: Docker / Podman / nerdctl     (UX: build, run, push)
MID:  containerd / CRI-O            (lifecycle: create, start, stop)
LOW:  runc / crun / kata / gVisor   (OCI: clone, unshare, execve)

CRI = Container Runtime Interface

kubelet ←─ gRPC ─→ containerd or CRI-O ──→ runc/crun

Two services: RuntimeService (pod/container lifecycle) + ImageService (pull/list/remove)

Kubernetes Components

CONTROL PLANE:                 WORKER NODE:
├── API Server (REST)          ├── kubelet (CRI client)
├── etcd (state store)         ├── container runtime
├── Scheduler (placement)      └── kube-proxy (networking)
└── Controller Manager (loops)

Pod = Scheduling Unit

  • Shares: network namespace, IPC, volumes
  • Each pod gets its own IP
  • "pause" container holds the namespace

Docker vs Podman

Docker Podman
Daemon Yes (dockerd) No
Root Default Rootless default
Build docker build / buildx podman build / buildah
K8s via containerd via CRI-O

Essential Commands

# Namespaces
unshare --pid --net --mount --fork /bin/bash
lsns
nsenter -t <PID> -n                  # enter network ns

# cgroups
cat /sys/fs/cgroup/memory.max
cat /sys/fs/cgroup/cpu.max

# Docker / Podman
docker build -t myapp:1.0 .
docker run --rm -p 8080:80 myapp:1.0
podman pod create --name mypod

# Kubernetes
kubeadm init --pod-network-cidr=10.244.0.0/16
kubeadm join <cp>:6443 --token <tok>
kubectl get nodes
kubectl get pods -A

# CRI (on K8s node)
crictl pods
crictl ps
crictl info
crictl inspect <id>

Dockerfile Best Practices

# Multi-stage, pinned version, non-root
FROM golang:1.22-alpine AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /app ./cmd/server

FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]

Security Layers (defense in depth)

1. Namespaces        → visibility isolation
2. cgroups           → resource limits
3. seccomp           → syscall filtering
4. Capabilities      → fine-grained privileges
5. AppArmor/SELinux  → mandatory access control
6. Network Policies  → pod-to-pod firewall
7. Read-only rootfs  → immutable filesystem
8. Non-root user     → no UID 0
marp true
theme default
paginate true
backgroundColor
color
style section { font-family: 'Segoe UI', Arial, sans-serif; } h1 { color: #00d4ff; } h2 { color: #7b68ee; } h3 { color: #ff6b6b; } strong { color: #ffd93d; } code { background: #16213e; color: #00d4ff; padding: 2px 6px; border-radius: 4px; } table { font-size: 0.8em; } th { background: #16213e; color: #00d4ff; } td { background: #0f3460; } blockquote { border-left: 4px solid #7b68ee; background: #16213e; padding: 10px 20px; } a { color: #ffd93d; } .columns { display: flex; gap: 2em; } .col { flex: 1; }

🖥️ Session 1

From Bare Metal to Virtualization

Master Class — Infrastructure Foundations

Duration: 90 minutes Level: Intermediate → Advanced


📋 Session 1 — Agenda

# Topic Time
1 Bare Metal OS Deployment 15 min
2 What is Virtualization? 10 min
3 The Hypervisor — Core Concepts 20 min
4 Type 1 vs Type 2 Hypervisors 15 min
5 Hypervisor Reference Model Deep Dive 10 min
6 Virtualization vs Containerization — Preview 10 min
7 Q&A / Discussion 10 min

Part 1

Bare Metal OS Deployment


What is "Bare Metal"?

Bare metal = software running directly on hardware without any intervening virtualization layer.

┌─────────────────────────────┐
│       Application(s)        │
├─────────────────────────────┤
│     Operating System        │
│   (Linux / Windows / BSD)   │
├─────────────────────────────┤
│   Hardware (CPU, RAM, NIC,  │
│     Storage, GPU, etc.)     │
└─────────────────────────────┘
  • 1:1 relationship — one OS owns the entire machine
  • Full, unmediated hardware access
  • Maximum performance, minimum abstraction

Bare Metal — Characteristics

✅ Advantages

  • Maximum performance — no virtualization overhead
  • Full hardware access — DMA, IOMMU, SR-IOV, GPU passthrough native
  • Deterministic latency — critical for HPC, real-time, HFT
  • Simpler debugging — no hypervisor layer to reason about

❌ Disadvantages

  • Low utilization — typical server uses only 5–15% of capacity
  • Slow provisioning — hours to days (PXE boot, Kickstart, Preseed)
  • No isolation — a rogue process can crash the entire machine
  • Scaling = buying hardware — no elasticity

Bare Metal Provisioning Methods

Method Description Use Case
PXE/iPXE + Kickstart Network boot → automated install Data center fleet
Cloud-Init First-boot config injection Cloud bare metal (e.g., AWS i3.metal)
Ironic (OpenStack) Bare metal as a service Private cloud
MAAS (Canonical) Metal as a Service Ubuntu-centric DC
Tinkerbell (Equinix) Declarative bare metal workflow Edge / hybrid
Manual ISO Install USB/DVD boot + manual steps Lab / dev

Key insight: Bare metal provisioning is fundamentally slower than VM or container creation. This drove the industry toward virtualization.


Linux on Bare Metal — Key Subsystems

User Space
├── systemd (PID 1, service management)
├── Applications & daemons
├── Shared libraries (glibc, libssl, ...)
│
Kernel Space
├── Process Scheduler (CFS / EEVDF)
├── Memory Management (page tables, NUMA, hugepages)
├── Virtual File System (VFS)
├── Network Stack (netfilter, tc, XDP, eBPF)
├── Device Drivers (NIC, storage, GPU)
├── Security Modules (SELinux, AppArmor, seccomp)
│
Hardware
├── CPU (rings 0-3, VMX extensions)
├── RAM (DDR4/5, NUMA nodes)
├── NIC (queues, RSS, offloads)
├── Storage (NVMe, SATA, HBA)
└── IOMMU, SR-IOV, PCIe topology

The Problem Bare Metal Couldn't Solve

Scenario: A company has 50 physical servers

Server 1:  Web App     → 8% CPU utilization
Server 2:  Database    → 12% CPU utilization
Server 3:  CI Runner   → 3% avg, 90% peak (bursts)
Server 4:  Mail Server → 5% CPU utilization
...
Server 50: Monitoring  → 2% CPU utilization

Average utilization: ~8% → 92% of purchased compute is wasted

💡 The question that launched an industry:

"Can we run multiple isolated workloads on one physical machine?"

Answer: Yes — Virtualization.


Part 2

What is Virtualization?


Virtualization — Definition

Virtualization is the creation of a virtual (rather than physical) version of something — servers, storage, networks, or operating systems — using a software abstraction layer.

┌──────────┐ ┌──────────┐ ┌──────────┐
│   VM 1   │ │   VM 2   │ │   VM 3   │
│ (Ubuntu) │ │(Windows) │ │(FreeBSD) │
│ App A    │ │ App B    │ │ App C    │
├──────────┤ ├──────────┤ ├──────────┤
│ Guest OS │ │ Guest OS │ │ Guest OS │
└────┬─────┘ └────┬─────┘ └────┬─────┘
     │             │             │
┌────┴─────────────┴─────────────┴────┐
│         HYPERVISOR (VMM)            │
├─────────────────────────────────────┤
│         Physical Hardware           │
└─────────────────────────────────────┘

A Brief History of Virtualization

Year Milestone
1967 IBM CP-40 — first hypervisor on System/360 Model 67
1972 IBM VM/370 — commercial virtual machine OS
1998 VMware founded — x86 virtualization via binary translation
1999 VMware Workstation 1.0 released
2003 Xen hypervisor open-sourced (paravirtualization)
2005–06 Intel VT-x and AMD-V — hardware-assisted virtualization
2007 KVM merged into Linux kernel (2.6.20)
2008 Microsoft Hyper-V released
2010s Cloud era — EC2, GCE, Azure all built on hypervisors
2020s Lightweight VMMs: Firecracker (AWS Lambda), Cloud Hypervisor

The x86 trap-and-emulate problem (Popek & Goldberg, 1974) wasn't solved until VMware's binary translation (1999) and Intel VT-x (2005).


Why Virtualization? — The Core Value

Before Virtualization (Physical Servers)

  • 1 app = 1 server
  • 5–15% average CPU utilization
  • Weeks to provision new servers
  • Hardware lock-in

After Virtualization

  • Many apps on 1 server → 60–80% utilization
  • Minutes to create new VMs → agility
  • Hardware abstraction → portability
  • Snapshots & live migration → disaster recovery
  • Isolation → security boundaries between tenants

Part 3

The Hypervisor — Core Concepts


Hypervisor — Definition

A hypervisor (also called Virtual Machine Monitor — VMM) is software, firmware, or hardware that creates and runs virtual machines by separating a computer's software from its hardware.

What it does:

  1. Partitions physical resources (CPU, memory, I/O) among VMs
  2. Isolates VMs from each other
  3. Emulates or paravirtualizes hardware for guest OSes
  4. Schedules VM execution on physical CPUs
  5. Intercepts privileged instructions from guest kernels

The Contract:

Each VM believes it has exclusive access to dedicated hardware. The hypervisor maintains this illusion while sharing the real hardware.


Types of Hypervisors — Overview

         TYPE 1 (Bare Metal)              TYPE 2 (Hosted)
     ┌────────┐  ┌────────┐         ┌────────┐  ┌────────┐
     │  VM 1  │  │  VM 2  │         │  VM 1  │  │  VM 2  │
     │Guest OS│  │Guest OS│         │Guest OS│  │Guest OS│
     └───┬────┘  └───┬────┘         └───┬────┘  └───┬────┘
         │           │                   │           │
  ┌──────┴───────────┴──────┐     ┌──────┴───────────┴──────┐
  │   TYPE 1 HYPERVISOR     │     │   TYPE 2 HYPERVISOR     │
  │   (runs ON hardware)    │     │   (runs ON host OS)     │
  ├─────────────────────────┤     ├─────────────────────────┤
  │     Physical Hardware   │     │     Host OS (Linux,     │
  └─────────────────────────┘     │     Windows, macOS)     │
                                  ├─────────────────────────┤
                                  │   Physical Hardware     │
                                  └─────────────────────────┘

Type 1 — Bare Metal Hypervisor

Runs directly on hardware — it is the operating system (or functionally replaces it).

Examples:

Hypervisor Vendor Notes
VMware ESXi Broadcom Industry standard for enterprise
Microsoft Hyper-V Microsoft Built into Windows Server
KVM Linux/Red Hat Kernel module — Linux IS the hypervisor
Xen Linux Foundation Used by AWS EC2 (legacy instances)
Proxmox VE Proxmox KVM + LXC, open-source
Firecracker AWS MicroVM for Lambda/Fargate
bhyve FreeBSD Native FreeBSD hypervisor

Type 1 — Characteristics

✅ Pros

  • Near-native performance — minimal overhead (1–5%)
  • Strong isolation — thin attack surface
  • Hardware-assisted — leverages VT-x/AMD-V, VT-d, SR-IOV
  • Scalable — run hundreds of VMs per host
  • Live migration — move VMs between hosts with zero downtime

❌ Cons

  • Complex to set up — dedicated infrastructure
  • Requires compatible hardware — VT-x/AMD-V, IOMMU
  • Management overhead — needs vCenter, oVirt, Proxmox UI, etc.
  • Expensive licensing (VMware vSphere, Microsoft Datacenter)

🎯 Use Cases

Enterprise data centers, cloud providers, production workloads, multi-tenant hosting


Type 2 — Hosted Hypervisor

Runs as an application on top of a conventional operating system.

Examples:

Hypervisor Platform Notes
Oracle VirtualBox Cross-platform Free, open-source
VMware Workstation Windows/Linux Commercial, feature-rich
VMware Fusion macOS Workstation equivalent for Mac
Parallels Desktop macOS Best macOS integration
QEMU Cross-platform Emulator + virtualizer
GNOME Boxes Linux Simple QEMU frontend

Type 2 — Characteristics

✅ Pros

  • Easy to install — just another application
  • No dedicated hardware — runs on your laptop/desktop
  • Great host–guest integration — shared folders, clipboard, drag & drop
  • Snapshots — quick state save/restore
  • Perfect for development, testing, malware analysis

❌ Cons

  • Performance degradation — overhead from host OS layer
  • Security degradation — host OS compromise → all VMs compromised
  • Resource contention — competes with host OS and other apps
  • Not suitable for production — no live migration, limited HA

🎯 Use Cases

Development, testing, learning, malware sandboxing, running legacy apps


KVM — A Special Case (Type 1.5?)

┌──────────────────────────────────────────────┐
│              User Space (Linux)               │
│  ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│  │ QEMU/VM1 │ │ QEMU/VM2 │ │ Normal Apps  │ │
│  └────┬─────┘ └────┬─────┘ └──────────────┘ │
│       │             │                         │
├───────┴─────────────┴─────────────────────────┤
│          Linux Kernel + KVM module            │
│  ┌─────────────────────────────────────────┐  │
│  │  KVM: /dev/kvm — hardware virt. API     │  │
│  │  QEMU: device emulation in userspace    │  │
│  │  virtio: paravirtualized I/O drivers    │  │
│  └─────────────────────────────────────────┘  │
├───────────────────────────────────────────────┤
│     Hardware (VT-x / AMD-V, VT-d, SR-IOV)   │
└───────────────────────────────────────────────┘

KVM turns Linux itself into a Type-1 hypervisor. Each VM is a regular Linux process managed by QEMU. The kernel's scheduler, memory manager, and driver stack are reused — no separate hypervisor OS.


Type 1 vs Type 2 — Comparison Matrix

Aspect Type 1 (Bare Metal) Type 2 (Hosted)
Runs on Hardware directly Host operating system
Performance Near-native (1–5% overhead) Moderate (10–30% overhead)
Security Strong isolation Host compromise = game over
Use case Production, cloud, DC Dev, test, sandbox
Boot time Seconds (microVMs) to minutes Minutes (host + hypervisor + VM)
Management vCenter, oVirt, Proxmox GUI application
Live migration ✅ Yes ❌ No
Cost $$$ (licenses + dedicated HW) $ (free or cheap)
Examples ESXi, KVM, Hyper-V, Xen VirtualBox, VMware Workstation

Part 4

Hypervisor Reference Model

(Popek & Goldberg, 1974)


The Three Pillars of a Hypervisor

Popek & Goldberg (1974) defined the formal requirements for virtualizable architectures and the three core modules that coordinate to emulate hardware:

         ┌─────────────────────────────────────┐
         │         VIRTUAL MACHINE             │
         │     (guest OS + applications)       │
         └──┬───────────┬──────────────┬───────┘
            │           │              │
            ▼           ▼              ▼
   ┌────────────┐ ┌──────────┐ ┌─────────────┐
   │ DISPATCHER │ │ALLOCATOR │ │ INTERPRETER  │
   │            │ │          │ │              │
   │ Entry point│ │ Resource │ │  Privileged  │
   │ Routes     │ │ manager  │ │  instruction │
   │ traps to   │ │ for VMs  │ │  emulation   │
   │ handlers   │ │          │ │              │
   └────────────┘ └──────────┘ └─────────────┘

Module 1: DISPATCHER

The entry point of the VMM. All traps from guest VMs arrive here first.

How it works:

  1. Guest VM executes a sensitive instruction (privileged or behavior-sensitive)
  2. Hardware traps to the hypervisor (via VT-x VMEXIT or ring transition)
  3. Dispatcher receives the trap
  4. Dispatcher examines the trap reason
  5. Dispatcher routes to either the Allocator or the Interpreter

Modern implementations:

  • KVM: Linux kernel trap handler → kvm_handle_exit()
  • ESXi: VMM world trap handler
  • Xen: Hypercall handler + trap dispatch

Think of the Dispatcher as a traffic controller — it doesn't do the work, it decides who does.


Module 2: ALLOCATOR

Decides and manages what physical resources each VM gets.

Responsibilities:

  • CPU scheduling — which VM runs on which physical core, for how long
  • Memory allocation — how much RAM each VM gets, shadow/nested page tables
  • I/O assignment — virtual devices mapped to physical or emulated devices
  • Resource limits — prevent one VM from starving others

Triggered when:

A guest instruction changes the machine's resource mapping — e.g., setting up new page tables, accessing a new I/O port, changing interrupt vectors.

Modern implementations:

  • KVM: Linux CFS/EEVDF scheduler + KSM memory dedup + cgroups
  • ESXi: DRS (Distributed Resource Scheduler) + memory ballooning
  • Xen: Credit/Credit2 scheduler + Xen grant tables

Module 3: INTERPRETER

Contains interpreter routines that emulate privileged instructions.

Responsibilities:

  • Execute an equivalent safe sequence when a guest tries a privileged operation
  • Emulate hardware behavior without giving the guest real hardware access
  • Maintain the virtual CPU state (virtual registers, flags, control registers)

Examples of interpreted instructions:

Guest Instruction What Interpreter Does
LGDT (load GDT) Updates virtual GDT, maintains shadow GDT
MOV CR3 (page table switch) Updates shadow/nested page tables
IN/OUT (I/O port access) Routes to virtual device emulator
HLT (halt CPU) Deschedules vCPU, wakes on virtual interrupt
INVLPG (TLB invalidation) Flushes relevant shadow TLB entries

Reference Model in Action — Full Flow

Guest VM: MOV to CR3 (switch page tables)
    │
    ▼ VMEXIT (hardware trap)
┌──────────┐
│DISPATCHER │──── Trap reason: CR3 write
└──┬───┬───┘
   │   │
   │   ▼ (resource change detected)
   │ ┌──────────┐
   │ │ALLOCATOR │──── Update VM's memory mapping
   │ └──┬───────┘     Track new page table base address
   │    │
   │    ▼ (privileged instruction)
   ▼ ┌───────────┐
     │INTERPRETER│──── Emulate CR3 load safely
     └──┬────────┘     Update nested/shadow page tables
        │              Flush relevant TLB entries
        ▼
    VMENTER (resume guest)

Popek & Goldberg — The Formal Theorem

Three types of instructions in an ISA:

  1. Privileged instructions — cause a trap when executed in user mode
  2. Sensitive instructions:
    • Control-sensitive — change system configuration (e.g., I/O, page tables)
    • Behavior-sensitive — behave differently depending on privilege level

The Theorem (1974):

A virtual machine monitor may be constructed for any conventional third-generation computer if the set of sensitive instructions is a subset of the set of privileged instructions.

The x86 Problem:

x86 had 17 sensitive but non-privileged instructions (e.g., SGDT, SIDT, POPF) — they didn't trap! Solutions:

  • Binary translation (VMware, 1998) — rewrite guest code on-the-fly
  • Paravirtualization (Xen, 2003) — modify guest OS to use hypercalls
  • Hardware VT-x/AMD-V (2005–06) — new CPU mode with proper trapping

Hardware-Assisted Virtualization (VT-x)

┌─────────────────────────────────────────┐
│              VMX Operation              │
│                                         │
│  ┌─────────┐         ┌───────────────┐ │
│  │VMX root │◄──VMEXIT──│VMX non-root│ │
│  │(host/   │          │  (guest VM)  │ │
│  │hyperv.) │──VMENTER──►│             │ │
│  └─────────┘         └───────────────┘ │
│                                         │
│  VMCS (Virtual Machine Control Struct.) │
│  ┌─────────────────────────────────────┐│
│  │ Guest state area (regs, CR, EFER)  ││
│  │ Host state area (return state)     ││
│  │ VM-execution controls (what traps) ││
│  │ Exit reason + qualification        ││
│  └─────────────────────────────────────┘│
└─────────────────────────────────────────┘
  • VMLAUNCH/VMRESUME → enter guest (VMX non-root)
  • VMEXIT → trap back to host (VMX root) on sensitive operations
  • VMCS → per-vCPU control structure (what to trap, guest/host state)

Key Benefits of Hypervisors — Summary

🔧 Efficiency

Maximizes hardware utilization — run multiple virtual servers on one physical machine (60–80% utilization vs. 5–15% bare metal)

🔒 Isolation & Security

If one VM crashes or is compromised, others remain unaffected — strong security boundary (especially Type 1)

🔄 Flexibility

Run different operating systems (Linux, Windows, BSD) simultaneously on the same hardware

💰 Cost Savings

Reduces physical hardware count → lower energy, cooling, space, and maintenance costs

☁️ Cloud Foundation

Hypervisors are the foundation of modern cloud services — AWS (Xen → Nitro/Firecracker), Google Cloud (KVM), Azure (Hyper-V)


Part 5

Virtualization vs. Containerization

(Preview for Session 2)


The Two Paradigms

      VIRTUALIZATION                    CONTAINERIZATION

┌──────┐ ┌──────┐ ┌──────┐     ┌──────┐ ┌──────┐ ┌──────┐
│App A │ │App B │ │App C │     │App A │ │App B │ │App C │
├──────┤ ├──────┤ ├──────┤     ├──────┤ ├──────┤ ├──────┤
│Bins/ │ │Bins/ │ │Bins/ │     │Bins/ │ │Bins/ │ │Bins/ │
│Libs  │ │Libs  │ │Libs  │     │Libs  │ │Libs  │ │Libs  │
├──────┤ ├──────┤ ├──────┤     └──┬───┘ └──┬───┘ └──┬───┘
│GuestOS││GuestOS││GuestOS│        │        │        │
└──┬───┘ └──┬───┘ └──┬───┘   ┌────┴────────┴────────┴────┐
   │        │        │       │     Container Runtime      │
┌──┴────────┴────────┴────┐  │   (Docker/containerd/CRI-O)│
│       Hypervisor        │  ├────────────────────────────┤
├─────────────────────────┤  │    Host OS Kernel (shared) │
│   Hardware              │  ├────────────────────────────┤
└─────────────────────────┘  │    Hardware                │
                             └────────────────────────────┘

Side-by-Side Comparison

Aspect Virtualization Containerization
Isolation Full OS per VM (strong) Shared kernel (process-level)
Resource Usage Heavy — each VM has full OS Lightweight — shared kernel
Performance 1–5% overhead (Type 1) Near-native (<1% overhead)
Startup Time Seconds to minutes Milliseconds to seconds
Image Size GBs (full OS image) MBs (only app + deps)
Portability Less portable (OS-specific) Highly portable (OCI images)
Density 10–100 VMs per host 100–1000+ containers per host
Security Strong (hardware isolation) Weaker (kernel shared)
Ecosystem VMware, Hyper-V, KVM Docker, Kubernetes, Podman
Best For Multi-OS, legacy, strong isolation Microservices, CI/CD, cloud-native

When to Use Which?

Choose Virtualization when:

  • Running different operating systems on one host
  • Need strong security isolation (multi-tenant, compliance)
  • Running legacy applications not designed for containers
  • Need full kernel control (custom kernel modules, drivers)
  • Compliance requires hardware-level separation

Choose Containerization when:

  • Building microservices architectures
  • Need fast scaling (autoscaling, burst capacity)
  • Want CI/CD pipeline integration (build → test → deploy)
  • Need maximum resource density (cost optimization)
  • Building cloud-native applications

🔑 Real-world answer: Both. Containers typically run inside VMs in production.


The Full Stack in Production

┌────────────────────────────────────────────────────┐
│              YOUR APPLICATIONS                     │
│   ┌──────────┐ ┌──────────┐ ┌──────────┐          │
│   │Container │ │Container │ │Container │  ...      │
│   └────┬─────┘ └────┬─────┘ └────┬─────┘          │
│        └──────┬──────┘            │                │
│        ┌──────┴───────────────────┴──────────┐     │
│        │   Kubernetes / Container Orchestrator │    │
│        ├─────────────────────────────────────┤     │
│        │   Container Runtime (containerd/CRI-O)│   │
│        ├─────────────────────────────────────┤     │
│        │   Linux Kernel (namespaces, cgroups) │    │
│        └─────────────────────────────────────┘     │
│                    ┌──────────┐                     │
│                    │    VM    │ ← VM per K8s node   │
│                    └────┬─────┘                     │
│               ┌─────────┴──────────┐                │
│               │   Hypervisor       │                │
│               │ (ESXi/KVM/Hyper-V) │                │
│               ├────────────────────┤                │
│               │ Physical Hardware  │                │
│               └────────────────────┘                │
└────────────────────────────────────────────────────┘

Next session: We'll deep-dive into what's inside that container & Kubernetes layer →


🧠 Session 1 — Key Takeaways

  1. Bare metal gives maximum performance but wastes resources and is slow to provision
  2. Hypervisors solve this by multiplexing hardware across isolated VMs
  3. Type 1 (bare metal) hypervisors are for production; Type 2 (hosted) for dev/test
  4. The Dispatcher → Allocator → Interpreter triad is the universal hypervisor model
  5. Hardware-assisted virtualization (VT-x/AMD-V) solved x86's virtualization gap
  6. Containers ≠ replacement for VMs — they complement each other
  7. In production: apps in containers, containers in VMs, VMs on hypervisors

📖 Session 1 — Recommended Reading


❓ Questions & Discussion

(10 minutes)

Discussion prompts:

  1. When would you choose bare metal over VMs?
  2. Why did cloud providers build on Type 1 hypervisors instead of containers alone?
  3. What's the security difference between VM isolation and container isolation?

See you in Session 2!

From Containers to Kubernetes

🐳 → ☸️

marp true
theme default
paginate true
backgroundColor
color
style section { font-family: 'Segoe UI', Arial, sans-serif; } h1 { color: #00d4ff; } h2 { color: #7b68ee; } h3 { color: #ff6b6b; } strong { color: #ffd93d; } code { background: #16213e; color: #00d4ff; padding: 2px 6px; border-radius: 4px; } table { font-size: 0.78em; } th { background: #16213e; color: #00d4ff; } td { background: #0f3460; } blockquote { border-left: 4px solid #7b68ee; background: #16213e; padding: 10px 20px; } a { color: #ffd93d; } pre { font-size: 0.72em; }

🐳 Session 2

From Containers to Kubernetes

Master Class — Container Orchestration & CRI

Duration: 90 minutes Level: Intermediate → Advanced Prerequisite: Session 1 (Bare Metal to Virtualization)


📋 Session 2 — Agenda

# Topic Time
1 Linux Kernel Primitives for Containers 15 min
2 Containerization Deep Dive 10 min
3 Dockerfiles & Image Building 15 min
4 Container Runtimes: Docker, Podman, Buildx, CRI-O 15 min
5 Kubernetes Architecture 15 min
6 CRI — Container Runtime Interface 10 min
7 Bare Metal to K8s Cluster — Full Journey 5 min
8 Q&A / Discussion 5 min

Part 1

Linux Kernel Primitives

The Foundation Containers Are Built On


Containers Are NOT a Kernel Feature

There is no "container" system call in Linux. A "container" is a user-space concept built from combining multiple independent kernel primitives.

The building blocks:

Primitive Purpose Year
chroot Filesystem isolation 1979
Namespaces Resource visibility isolation 2002–2016
cgroups (v1/v2) Resource limits & accounting 2007/2016
seccomp-bpf System call filtering 2012
Capabilities Fine-grained privilege control 1999 (POSIX)
AppArmor / SELinux Mandatory access control 2003/2000
OverlayFS Layered filesystem (image layers) 2014

A container = namespaces + cgroups + seccomp + capabilities + rootfs


Linux Namespaces — Visibility Isolation

Namespaces make a process think it's alone on the system.

Namespace Flag Isolates Since
Mount CLONE_NEWNS Filesystem mount points 2.4.19 (2002)
UTS CLONE_NEWUTS Hostname and domain name 2.6.19 (2006)
IPC CLONE_NEWIPC System V IPC, POSIX MQs 2.6.19 (2006)
PID CLONE_NEWPID Process IDs (init = PID 1) 2.6.24 (2008)
Network CLONE_NEWNET Network stack, interfaces, routes 2.6.24 (2008)
User CLONE_NEWUSER UID/GID mappings (rootless) 3.8 (2013)
Cgroup CLONE_NEWCGROUP cgroup root visibility 4.6 (2016)
Time CLONE_NEWTIME Boot and monotonic clocks 5.6 (2020)
# Create a process with new PID + NET + MNT namespaces:
unshare --pid --net --mount --fork /bin/bash

Network Namespace — Deep Dive

Each network namespace has its own complete, isolated network stack.

┌─────────────────── Host Network Namespace ───────────────────┐
│                                                               │
│  eth0 (physical NIC)    docker0 (bridge)      veth-host-1    │
│  10.0.0.5               172.17.0.1            ─────────┐     │
│                              │                          │     │
│                              │                     ┌────┤     │
│                              │                     │veth│     │
│                              │                     │pair│     │
│                              │                     └────┤     │
│                                                         │     │
│  ┌──────────── Container Network Namespace ──────────┐  │     │
│  │                                                   │  │     │
│  │  eth0 (veth-container-1)     lo (loopback)        │  │     │
│  │  172.17.0.2                  127.0.0.1            │  │     │
│  │                                                   │  │     │
│  │  Routing table:   default via 172.17.0.1          │  │     │
│  │  iptables rules:  (independent)                   │  │     │
│  │  /proc/net/...:   (container's own view)          │  │     │
│  │                                                   │  │     │
│  └───────────────────────────────────────────────────┘  │     │
│                                                               │
└───────────────────────────────────────────────────────────────┘

PID Namespace — Process Isolation

Host PID Namespace:
  PID 1: systemd
  PID 100: dockerd
  PID 200: containerd
  PID 500: containerd-shim → (container process)
  PID 501: nginx (from host's view)

Container PID Namespace:
  PID 1: nginx  ← same process, different PID!
  PID 2: nginx worker
  PID 3: nginx worker

  # Inside container:
  $ ps aux
  PID  USER  COMMAND
  1    root  nginx: master process
  2    nginx nginx: worker process
  3    nginx nginx: worker process
  # Can't see host processes = isolation

Container's PID 1 = the entrypoint process. If PID 1 dies → the container stops. Signal handling for PID 1 is special (no default SIGTERM handler).


cgroups — Resource Limits & Accounting

Control Groups limit, account for, and isolate resource usage (CPU, memory, I/O, network).

cgroups v2 hierarchy (modern):

/sys/fs/cgroup/
├── system.slice/              ← system services
├── user.slice/                ← user sessions
└── kubepods.slice/            ← Kubernetes pods
    ├── kubepods-burstable.slice/
    │   └── kubepods-burstable-pod<UID>.slice/
    │       ├── cri-containerd-<ID>.scope/
    │       │   ├── cpu.max          → "100000 100000" (100% of 1 CPU)
    │       │   ├── memory.max       → "536870912" (512 MiB)
    │       │   ├── memory.current   → "234881024" (current usage)
    │       │   ├── io.max           → "8:0 rbps=104857600" (100MB/s)
    │       │   └── pids.max         → "1024"

Key cgroup controllers:

Controller Manages
cpu CPU time (shares, quota, period)
memory Memory limit, swap, OOM behavior
io Block I/O bandwidth, IOPS
pids Max number of processes
cpuset Pin to specific CPUs/NUMA nodes

seccomp-bpf — System Call Filtering

seccomp restricts which system calls a process can make.

Default Docker seccomp profile blocks ~44 of ~330+ syscalls:

BLOCKED (dangerous):                  ALLOWED (safe):
├── reboot()                         ├── read() / write()
├── kexec_load()                     ├── open() / close()
├── mount() / umount2()              ├── mmap() / mprotect()
├── swapon() / swapoff()             ├── socket() / connect()
├── init_module() / delete_module()  ├── fork() / clone()
├── acct()                           ├── execve()
├── settimeofday()                   ├── getpid() / getuid()
├── syslog()                         ├── stat() / fstat()
├── ptrace() *                       └── ... (most normal ops)
└── bpf() *

Defense in depth: Even if an attacker escapes namespace isolation, seccomp blocks dangerous kernel interactions.


Capabilities — Fine-Grained Privileges

Linux splits the old root / non-root binary into ~40 individual capabilities.

Traditional model:        Capabilities model:
  UID 0 = ALL power         CAP_NET_BIND_SERVICE → bind < 1024
  UID !0 = no power         CAP_NET_RAW          → raw sockets
                            CAP_SYS_ADMIN        → mount, bpf, ...
                            CAP_SYS_PTRACE       → ptrace
                            CAP_DAC_OVERRIDE     → bypass file perms
                            CAP_CHOWN            → change file owner
                            ... (~40 total)

Docker default capability set (whitelist):

GRANTED (14):                       DROPPED (everything else):
  CAP_CHOWN                          CAP_SYS_ADMIN
  CAP_DAC_OVERRIDE                   CAP_NET_RAW (dropped by default now)
  CAP_FSETID                         CAP_SYS_PTRACE
  CAP_FOWNER                         CAP_SYS_MODULE
  CAP_NET_BIND_SERVICE               CAP_SYS_RAWIO
  CAP_SETGID / CAP_SETUID            CAP_SYS_TIME
  CAP_KILL / CAP_AUDIT_WRITE         ...
  CAP_SETPCAP / CAP_SETFCAP
  CAP_MKNOD / CAP_NET_RAW (*)

Part 2

Containerization Deep Dive


What is a Container?

A container is an isolated process (or group of processes) running on a shared kernel, with its own filesystem, network, and process view.

    Container = isolated process on shared kernel

    ┌─ Namespace isolation ──────────────────────┐
    │                                             │
    │  PID namespace  → own PID 1                 │
    │  NET namespace  → own eth0, routes, iptables│
    │  MNT namespace  → own root filesystem       │
    │  UTS namespace  → own hostname              │
    │  USER namespace → own uid mapping           │
    │                                             │
    │  + cgroup limits (CPU, mem, I/O)            │
    │  + seccomp filter (syscall whitelist)       │
    │  + capabilities (fine-grained privs)        │
    │  + OverlayFS (layered rootfs)               │
    │                                             │
    └─────────────────────────────────────────────┘
         │
         ▼
    Host Linux Kernel (shared by ALL containers)

Container Images — Layered Filesystem

A container image is a stack of read-only layers plus a thin read-write layer on top.

┌─────────────────────────────────────────────┐
│  Writable Container Layer (ephemeral)       │  ← changes here
├─────────────────────────────────────────────┤
│  Layer 4: COPY app.py /app/                 │  ← your code
├─────────────────────────────────────────────┤
│  Layer 3: RUN pip install flask             │  ← dependencies
├─────────────────────────────────────────────┤
│  Layer 2: RUN apt-get install python3       │  ← runtime
├─────────────────────────────────────────────┤
│  Layer 1: Ubuntu 24.04 base image           │  ← base OS
└─────────────────────────────────────────────┘

Storage driver: OverlayFS (overlay2)
  - Lower layers: read-only, shared between containers
  - Upper layer: read-write, container-specific (copy-on-write)
  - Merged view: union mount presented to the container

Efficiency: 100 containers from the same image share the same base layers — only the writable layer is unique per container.


OCI Standards — The Universal Contract

The Open Container Initiative (OCI) defines open standards so images and runtimes are interchangeable.

Three specifications:

Spec Defines Key Points
Image Spec Image format & layout Layers, manifests, config JSON
Runtime Spec How to run a container config.json → namespaces, mounts, hooks
Distribution Spec How to push/pull images Registry API (Docker Hub, GHCR, ECR)
OCI Image Manifest:
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": { "digest": "sha256:abc...", "size": 7023 },
  "layers": [
    { "digest": "sha256:def...", "size": 32654 },   ← base
    { "digest": "sha256:ghi...", "size": 16724 },   ← deps
    { "digest": "sha256:jkl...", "size": 73109 }    ← app
  ]
}

Part 3

Dockerfiles & Image Building


Dockerfile — Anatomy

# ─── Build stage ──────────────────────────────────
FROM golang:1.22-alpine AS builder
WORKDIR /src

# Cache dependencies separately from source code
COPY go.mod go.sum ./
RUN go mod download

# Copy source and build
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /app/server ./cmd/server

# ─── Production stage ─────────────────────────────
FROM gcr.io/distroless/static-debian12:nonroot

COPY --from=builder /app/server /server

EXPOSE 8080
USER nonroot:nonroot

ENTRYPOINT ["/server"]

Key principles:

  • Multi-stage builds → small final images (no compiler in production)
  • Layer caching → put rarely-changing layers first (go.mod before source)
  • Non-root user → security best practice
  • Distroless base → minimal attack surface (no shell, no package manager)

Dockerfile Best Practices

DO ✅

# Pin versions for reproducibility
FROM python:3.12.3-slim-bookworm

# Combine RUN commands to reduce layers
RUN apt-get update && \
    apt-get install -y --no-install-recommends curl && \
    rm -rf /var/lib/apt/lists/*

# Copy dependency file first for caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Use .dockerignore to exclude unnecessary files
# Run as non-root
USER 1000:1000

DON'T ❌

FROM ubuntu:latest           # unpinned tag
RUN apt-get update           # separate from install (cache bug)
RUN apt-get install python3  # missing -y, missing cleanup
COPY . .                     # copies everything including .git
USER root                    # running as root in production

Multi-Architecture Builds with docker buildx

# Create a multi-platform builder
docker buildx create --name mybuilder --use --bootstrap

# Build for multiple architectures simultaneously
docker buildx build \
  --platform linux/amd64,linux/arm64,linux/arm/v7 \
  --tag myregistry/myapp:1.0.0 \
  --push \
  .

How it works:

docker buildx build --platform linux/amd64,linux/arm64
         │
         ▼
┌─────────────────────────────────────────┐
│  BuildKit (Moby buildkitd)              │
│                                         │
│  amd64: native build on x86_64 host    │
│  arm64: QEMU emulation (or remote)      │
│                                         │
│  → Multi-arch manifest (fat manifest)   │
│    ├── linux/amd64 → sha256:abc...      │
│    └── linux/arm64 → sha256:def...      │
└─────────────────────────────────────────┘
         │
         ▼
    Registry auto-selects correct image
    based on client architecture

Part 4

Container Runtimes

Docker, Podman, containerd, CRI-O


The Container Runtime Landscape

                                HIGH-LEVEL
                          (Image mgmt + Build + API)
                    ┌──────────────────────────────────┐
                    │  Docker Engine  │  Podman  │ Nerd │
                    │  (dockerd)      │          │(nerdctl)
                    └───────┬─────────┴────┬─────┴──┬──┘
                            │              │        │
                        MID-LEVEL                    
                    (Container lifecycle management)  
                    ┌───────┴──────────────┴────────┴──┐
                    │   containerd       │    CRI-O     │
                    │                    │              │
                    │  (Docker's core)   │  (K8s native)│
                    └───────┬────────────┴──────┬───────┘
                            │                   │
                        LOW-LEVEL (OCI Runtime)  
                    ┌───────┴───────────────────┴───────┐
                    │  runc   │  crun   │  kata  │ gVisor│
                    │ (ref    │ (fast,  │(micro  │(user  │
                    │  impl)  │  C)     │ VMs)   │space) │
                    └──────────────────────────────────┘

Docker Engine — Architecture

                     docker CLI
                        │
                   REST API (unix socket / TCP)
                        │
                        ▼
               ┌─────────────────┐
               │    dockerd      │  ← Docker daemon
               │  (Docker Engine)│     Image builds, networking,
               │                 │     volumes, orchestration
               └────────┬────────┘
                        │ gRPC
                        ▼
               ┌─────────────────┐
               │   containerd    │  ← Container runtime
               │                 │     Lifecycle, snapshots,
               │                 │     image pull/push
               └────────┬────────┘
                        │ OCI runtime exec
                        ▼
               ┌─────────────────┐
               │   runc          │  ← OCI runtime
               │                 │     clone() + execve()
               │                 │     namespaces + cgroups
               └─────────────────┘

Docker = dockerd + containerd + runc — three layers. Kubernetes talks to containerd directly, skipping dockerd.


Podman — Daemonless Alternative

                     podman CLI
                        │
                    (direct, no daemon)
                        │
                        ▼
               ┌─────────────────┐
               │    Podman       │  ← Library (libpod)
               │   (no daemon!)  │     Fork + exec model
               │                 │     Rootless by default
               └────────┬────────┘
                        │
                        ▼
               ┌─────────────────┐
               │   conmon        │  ← Container monitor
               │                 │     Holds stdio, exit code
               └────────┬────────┘
                        │
                        ▼
               ┌─────────────────┐
               │ crun (or runc)  │  ← OCI runtime
               └─────────────────┘

Key differences from Docker:

  • No daemon — each podman call is a direct process
  • Rootless by default — runs entirely in user namespaces
  • Pod-nativepodman pod directly models K8s pods
  • Docker CLI compatiblealias docker=podman works
  • systemd integrationpodman generate systemd

Docker vs Podman vs Buildah vs Skopeo

Tool Role Daemon? Root?
Docker Build + Run + Push/Pull Yes (dockerd) Yes (default)
Podman Run containers (+ build) No No (rootless)
Buildah Build OCI images No No (rootless)
Skopeo Copy/inspect images between registries No No
Buildx Docker multi-arch builder (BuildKit) Yes (BuildKit) Yes

The Red Hat container toolchain:

Building images:     Buildah (or Podman build)
Running containers:  Podman
Moving images:       Skopeo
Kubernetes runtime:  CRI-O
OCI runtime:         crun (or runc)

The Docker toolchain:

Building images:     docker build (or docker buildx)
Running containers:  docker run (dockerd → containerd → runc)
Moving images:       docker push/pull
Kubernetes runtime:  containerd
OCI runtime:         runc

OCI Runtimes — The Lowest Layer

Runtime Language Key Feature
runc Go Reference implementation, Docker default
crun C 2× faster startup, lower memory, Podman default
Kata Containers Go Runs each container in a lightweight VM (hardware isolation)
gVisor (runsc) Go User-space kernel — intercepts syscalls (Google)
Firecracker Rust MicroVM backend for AWS Lambda / Fargate
youki Rust Rust rewrite of runc

Security spectrum:

Less isolation ◄──────────────────────────────► More isolation
   (faster)                                      (slower)

  runc/crun         gVisor            Kata / Firecracker
  namespaces        user-space        micro-VMs
  + cgroups         kernel            + hardware isolation
  (shared kernel)   (syscall proxy)   (separate kernel per container)

Part 5

Kubernetes Architecture


What is Kubernetes?

Kubernetes (K8s) is an open-source container orchestration platform that automates deployment, scaling, and management of containerized applications.

What K8s manages:

  • Scheduling — which node runs which container
  • Scaling — horizontal pod autoscaling (HPA)
  • Networking — service discovery, load balancing, ingress
  • Storage — persistent volumes, storage classes
  • Self-healing — restart failed containers, reschedule on healthy nodes
  • Rolling updates — zero-downtime deployments
  • Config & secrets — centralized configuration management

What K8s is NOT:

  • Not a PaaS (no app-level framework)
  • Not a CI/CD system (but integrates with them)
  • Not a VM orchestrator (that's OpenStack, vSphere)

Kubernetes Architecture — The Big Picture

┌─────────────────── Control Plane ───────────────────┐
│                                                      │
│  ┌──────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ API      │  │ etcd         │  │ Scheduler     │  │
│  │ Server   │  │ (consensus   │  │ (where to     │  │
│  │ (REST)   │  │  store)      │  │  place pods)  │  │
│  └────┬─────┘  └──────────────┘  └───────────────┘  │
│       │                                              │
│  ┌────┴─────────────────┐  ┌──────────────────────┐  │
│  │ Controller Manager   │  │ Cloud Controller Mgr │  │
│  │ (reconciliation      │  │ (cloud-specific:     │  │
│  │  loops)              │  │  LB, nodes, routes)  │  │
│  └──────────────────────┘  └──────────────────────┘  │
└──────────────────────────────────────────────────────┘
              │ kubectl / API calls
              ▼
┌─────────────── Worker Nodes ────────────────────────┐
│  Node 1              Node 2              Node 3     │
│  ┌──────────┐       ┌──────────┐       ┌────────┐  │
│  │ kubelet  │       │ kubelet  │       │kubelet │  │
│  │ CRI ─────┤       │ CRI ─────┤       │CRI ───┤  │
│  │ runtime  │       │ runtime  │       │runtime │  │
│  │ kube-    │       │ kube-    │       │kube-   │  │
│  │  proxy   │       │  proxy   │       │ proxy  │  │
│  └──────────┘       └──────────┘       └────────┘  │
└─────────────────────────────────────────────────────┘

Control Plane Components

API Server (kube-apiserver)

  • Front door for all K8s operations
  • RESTful API — all communication goes through here
  • Authentication, authorization (RBAC), admission control
  • Stores/retrieves state from etcd

etcd

  • Distributed key-value store (Raft consensus)
  • Single source of truth for all cluster state
  • All K8s objects stored here (pods, services, secrets, configs)

Scheduler (kube-scheduler)

  • Decides which node a new pod should run on
  • Considers: resources, affinity, anti-affinity, taints, tolerations

Controller Manager (kube-controller-manager)

  • Runs reconciliation loops (controllers)
  • Deployment controller, ReplicaSet controller, Node controller, etc.
  • Watches desired state vs actual state → takes corrective action

Worker Node Components

kubelet

  • Agent running on every node
  • Receives pod specs from API server
  • Calls the Container Runtime via CRI to start/stop containers
  • Reports node and pod status back to control plane
  • Manages liveness/readiness probes

Container Runtime

  • Actually runs the containers
  • Must implement the CRI (Container Runtime Interface)
  • Options: containerd, CRI-O, (Docker via cri-dockerd — deprecated)

kube-proxy

  • Manages network rules on each node
  • Implements Kubernetes Services (ClusterIP, NodePort, LoadBalancer)
  • Modes: iptables (default), IPVS (high-performance), nftables (new)

The Pod — Kubernetes' Atomic Unit

┌─────────────────── Pod ──────────────────────┐
│                                               │
│  Shared:                                      │
│  ├── Network namespace (same IP, localhost)   │
│  ├── IPC namespace                            │
│  ├── Volumes (shared storage)                 │
│  └── (optionally) PID namespace               │
│                                               │
│  ┌─────────────┐  ┌──────────────────┐        │
│  │ Container 1 │  │ Container 2      │        │
│  │ (main app)  │  │ (sidecar/proxy)  │        │
│  │             │  │                  │        │
│  │ Port 8080   │  │ Port 15001       │        │
│  └─────────────┘  └──────────────────┘        │
│                                               │
│  ┌─────────────────────────────┐              │
│  │ Init Container(s)          │ ← run first  │
│  │ (setup, migration, etc.)   │   then exit  │
│  └─────────────────────────────┘              │
│                                               │
│  Pod IP: 10.244.1.5 (all containers share)    │
│  Node: worker-02                              │
└───────────────────────────────────────────────┘

All containers in a pod share the same network namespace — they can reach each other on localhost.


Part 6

CRI — Container Runtime Interface

The Bridge Between kubelet and Containers


Why CRI Exists

The Docker Problem (pre-CRI):

Before CRI (K8s < 1.5):

  kubelet ──── Docker-specific code ──── dockerd ──── containerd ──── runc

  Problems:
  ✗ Kubelet was hardcoded to Docker's API
  ✗ Adding a new runtime = modifying kubelet source code
  ✗ Docker daemon had features K8s didn't need (build, swarm, etc.)
  ✗ Extra layer of indirection (kubelet→dockerd→containerd→runc)

The CRI Solution (K8s 1.5+, stable 1.26+):

After CRI:

  kubelet ──── CRI (gRPC) ──── containerd ──── runc
                    or
  kubelet ──── CRI (gRPC) ──── CRI-O     ──── runc/crun

  Benefits:
  ✓ kubelet is runtime-agnostic
  ✓ Any CRI-compliant runtime works
  ✓ Direct path — no unnecessary Docker daemon
  ✓ Kubernetes removed dockershim in v1.24

CRI — The gRPC Interface

CRI defines a gRPC protocol with two services:

1. RuntimeService — Container Lifecycle

service RuntimeService {
    // Sandbox (pod-level) operations
    rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse);
    rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse);
    rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse);

    // Container operations
    rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse);
    rpc StartContainer(StartContainerRequest) returns (StartContainerResponse);
    rpc StopContainer(StopContainerRequest) returns (StopContainerResponse);
    rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse);

    // Exec / Attach / Port-forward
    rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse);
    rpc Exec(ExecRequest) returns (ExecResponse);
    rpc Attach(AttachRequest) returns (AttachResponse);
}

2. ImageService — Image Management

service ImageService {
    rpc PullImage(PullImageRequest) returns (PullImageResponse);
    rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse);
    rpc ListImages(ListImagesRequest) returns (ListImagesResponse);
    rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse);
}

CRI-O — Purpose-Built for Kubernetes

CRI-O is a lightweight CRI implementation that does one thing: run containers for Kubernetes.

┌───────────────── kubelet ──────────────────┐
│                                             │
│  "I need a pod with nginx:1.27 container"  │
│                                             │
└──────────────────┬──────────────────────────┘
                   │ CRI gRPC (unix socket)
                   ▼
┌───────────────── CRI-O ───────────────────┐
│                                            │
│  1. Pull image from registry               │
│  2. Create pod sandbox (pause container)   │
│  3. Set up networking (CNI plugin call)    │
│  4. Create container in sandbox            │
│  5. Invoke OCI runtime (runc/crun)         │
│  6. Monitor via conmon                     │
│                                            │
│  Scope: ONLY what K8s needs               │
│  No build, no docker CLI, no swarm        │
│                                            │
└──────────────────┬─────────────────────────┘
                   │ OCI spec
                   ▼
┌───────────────── runc / crun ─────────────┐
│  clone() → unshare() → pivot_root()       │
│  → execve() the container entrypoint      │
└────────────────────────────────────────────┘

CRI-O vs containerd — Comparison

Feature CRI-O containerd
Purpose K8s-only CRI runtime General-purpose container runtime
Versioning Matches K8s versions (1.30.x) Independent releases
Scope Minimal — CRI + OCI Broader — CRI + Docker + others
Used by Docker? No Yes (Docker's core runtime)
Default in OpenShift, SUSE Rancher GKE, EKS, AKS, kubeadm default
Image pull containers/image library Own image pull implementation
Networking CNI plugins CNI plugins
Storage containers/storage Own snapshotter framework
Build images? No No (but nerdctl can)
Configuration Drop-in config files TOML config
OCI runtimes runc, crun, Kata, gVisor runc, gVisor, Kata

Rule of thumb: CRI-O = lean & K8s-only. containerd = versatile & widely adopted.


The Full Pod Start Sequence

kubectl apply -f pod.yaml
    │
    ▼
API Server → stores in etcd
    │
    ▼
Scheduler → assigns to Node-2
    │ (binding written to etcd)
    ▼
kubelet on Node-2 (watches API server)
    │
    ▼ CRI: RunPodSandbox()
Container Runtime (CRI-O / containerd)
    │
    ├── 1. Create pod sandbox (pause container with new namespaces)
    ├── 2. Call CNI plugin → allocate IP, set up veth pair
    ├── 3. CRI: CreateContainer() → prepare rootfs (overlay mount)
    ├── 4. CRI: StartContainer() → invoke OCI runtime
    │       └── runc/crun: clone(NEWNS|NEWPID|NEWNET|...)
    │                      → pivot_root() → execve("nginx")
    ├── 5. conmon monitors container stdio + exit
    └── 6. kubelet reports pod status → API server → etcd

Total time: 1-3 seconds (for a cached image)

Part 7

Bare Metal to K8s — The Full Journey


The Complete Evolution

LEVEL 0: BARE METAL
┌──────────────────────────┐
│  App A     App B    App C│  1 server = 1 (or few) apps
│  ════════════════════════│  Low utilization, slow provisioning
│  Host OS (Linux)         │
│  Physical Hardware       │
└──────────────────────────┘
         │ need isolation + better utilization
         ▼
LEVEL 1: VIRTUALIZATION
┌──────────────────────────┐
│ ┌──VM──┐ ┌──VM──┐       │  Hardware-level isolation
│ │App A │ │App B │  ...   │  Full OS per workload
│ │OS    │ │OS    │        │  Minutes to provision
│ └──────┘ └──────┘        │
│ Hypervisor (KVM/ESXi)    │
│ Physical Hardware        │
└──────────────────────────┘
         │ need faster scaling + less overhead
         ▼
LEVEL 2: CONTAINERIZATION
┌──────────────────────────┐
│ ┌────┐ ┌────┐ ┌────┐    │  Process-level isolation
│ │ A  │ │ B  │ │ C  │... │  Shared kernel
│ └────┘ └────┘ └────┘    │  Seconds to provision
│ Container Runtime        │
│ Host OS (Linux Kernel)   │
│ (VM or Bare Metal)       │
└──────────────────────────┘
         │ need orchestration at scale
         ▼
LEVEL 3: KUBERNETES
┌──────────────────────────┐
│ K8s Control Plane        │  Automated scheduling, scaling,
│ ┌─Node─┐ ┌─Node─┐       │  self-healing, networking,
│ │Pod Pod│ │Pod Pod│ ...  │  service discovery, config mgmt
│ │CRI   │ │CRI   │       │  Declarative desired-state model
│ └──────┘ └──────┘        │
│ Infrastructure (VMs/BM)  │
└──────────────────────────┘

Building a K8s Cluster from Bare Metal Linux

Step-by-step:

1. PREPARE THE NODE (bare metal Linux)
   ├── Disable swap:           swapoff -a
   ├── Load kernel modules:    overlay, br_netfilter
   ├── Set sysctl:             net.bridge.bridge-nf-call-iptables = 1
   │                           net.ipv4.ip_forward = 1
   └── Install container runtime (choose one):
       ├── containerd (+ CNI plugins)
       └── CRI-O

2. INSTALL KUBERNETES COMPONENTS
   ├── kubeadm  (cluster bootstrapper)
   ├── kubelet  (node agent)
   └── kubectl  (CLI client)

3. INITIALIZE CONTROL PLANE (first node)
   └── kubeadm init --pod-network-cidr=10.244.0.0/16

4. INSTALL CNI PLUGIN (pod networking)
   └── kubectl apply -f calico.yaml  (or Cilium, Flannel, ...)

5. JOIN WORKER NODES
   └── kubeadm join <control-plane>:6443 --token <token> ...

6. VERIFY
   └── kubectl get nodes  →  Ready, Ready, Ready

The Network Stack — What Actually Happens

Internet
    │
    ▼
┌─────────────── Physical NIC (eth0) ─────────────┐
│  IP: 10.0.0.100                                  │
│                                                  │
│  kube-proxy (iptables/IPVS rules)                │
│  ├── NodePort 30080 → Service ClusterIP → Pod IP │
│  └── LoadBalancer → External IP → Pods           │
│                                                  │
│  ┌──── CNI (Calico/Cilium) ─────────────────┐   │
│  │  Pod Network: 10.244.0.0/16              │   │
│  │                                           │   │
│  │  ┌── Pod A ──┐     ┌── Pod B ──┐         │   │
│  │  │ eth0      │     │ eth0      │         │   │
│  │  │10.244.1.5 │←───→│10.244.1.6 │         │   │
│  │  └───────────┘     └───────────┘         │   │
│  │     veth             veth                 │   │
│  │       └──── bridge/tunnel/eBPF ────┘      │   │
│  └───────────────────────────────────────────┘   │
│                                                  │
│  Cross-node: VXLAN / IPIP / BGP / WireGuard      │
└──────────────────────────────────────────────────┘

Analogy: Medical Triage → Kubernetes Scheduling

Medical Triage determines the priority of patient admission based on urgency.

Kubernetes Scheduling determines the priority and placement of pods based on resource needs and constraints.

Medical Triage Kubernetes Scheduling
Immediate (Red) — life-threatening PriorityClass: system-critical — must run first
Urgent (Orange) — serious but stable Guaranteed QoS — resources reserved
Standard (Yellow) — can wait Burstable QoS — requests < limits
Non-urgent (Green) — minor BestEffort QoS — no guarantees
Deceased — no treatment Evicted/Preempted — killed for higher priority
Available beds determine placement Available node resources determine scheduling
Specialist wards (cardio, neuro) Node affinity / taints (GPU node, high-mem)

Both systems: assess → classify → prioritize → assign resources under constraints.


Summary — The Full Stack Map

Layer 7: APPLICATION
  │  Your microservices, APIs, frontends, ML models
  │
Layer 6: ORCHESTRATION
  │  Kubernetes (scheduling, scaling, networking, self-healing)
  │  Helm charts, Operators, GitOps (ArgoCD/Flux)
  │
Layer 5: CONTAINER RUNTIME INTERFACE (CRI)
  │  kubelet ← gRPC → containerd / CRI-O
  │
Layer 4: CONTAINER RUNTIME
  │  containerd, CRI-O, Podman (standalone)
  │
Layer 3: OCI RUNTIME
  │  runc, crun, Kata Containers, gVisor
  │
Layer 2: LINUX KERNEL PRIMITIVES
  │  Namespaces, cgroups, seccomp, capabilities, OverlayFS
  │
Layer 1: OPERATING SYSTEM
  │  Linux (Ubuntu, RHEL, Flatcar, Talos)
  │
Layer 0: INFRASTRUCTURE
     Bare Metal  or  Virtual Machines (Type 1 hypervisor)

🧠 Session 2 — Key Takeaways

  1. Containers are NOT a kernel feature — they're built from namespaces + cgroups + seccomp + capabilities
  2. Network namespaces give each container its own full network stack
  3. OCI standards ensure images and runtimes are interchangeable
  4. Multi-stage Dockerfiles with non-root users are the gold standard
  5. Docker ≠ the only option — Podman (daemonless), Buildah, CRI-O are production-proven
  6. CRI decoupled Kubernetes from Docker — any CRI-compliant runtime works
  7. CRI-O is purpose-built for K8s; containerd is more general-purpose
  8. The real-world stack: App → K8s → CRI → containerd/CRI-O → runc → Linux kernel → VM → Hypervisor → Hardware

📖 Recommended Reading & Resources

Specifications

Books

  • Brendan Burns, "Kubernetes: Up and Running" (3rd ed., O'Reilly)
  • Liz Rice, "Container Security" (O'Reilly) — Linux primitives deep-dive
  • Michael Hausenblas, "Learning Modern Linux" (O'Reilly)

Hands-on


❓ Questions & Discussion

Discussion prompts:

  1. Why did Kubernetes remove Docker support (dockershim)?
  2. When would you choose CRI-O over containerd?
  3. What stops a container from escaping to the host?
  4. How does the triage analogy map to your team's deployment priorities?

Thank You!

🖥️ Session 1: Bare Metal → Virtualization

🐳 Session 2: Containers → Kubernetes

"You can't build cloud-native without understanding what's beneath the clouds."

Next steps for your team:

  1. Run unshare --pid --net --mount --fork /bin/bash on a test box
  2. Build a multi-stage Dockerfile for one of your services
  3. Try kubeadm init on a spare node
  4. Explore crictl to interact with CRI directly

Master Class — Speaker Notes & Teaching Guide

From Bare Metal to Kubernetes (2 × 90 min)


SESSION 1: From Bare Metal to Virtualization

Slide Timing Guide

Slide(s) Topic Mins Teaching Notes
1–2 Title + Agenda 2 Set expectations: "By end of today, you'll understand everything between hardware and VMs"
3–6 Bare Metal 15 Ask the room: "Who has installed an OS on bare metal?" Start with what they know. Highlight the utilization problem — this motivates everything that follows.
7–9 What is Virtualization 10 History slide is a great storytelling moment. The Popek & Goldberg theorem is the "aha" — x86 wasn't virtualizable until 2005!
10–13 Hypervisor Core 20 Key teaching point: Explain Type 1 vs Type 2 with the building analogy: Type 1 = the building foundation itself, Type 2 = a room inside someone else's building. KVM slide is critical — "Linux IS the hypervisor" blows minds.
14–17 Reference Model 15 Use the traffic controller analogy for Dispatcher. Walk through the CR3 example step by step — this makes it concrete. Poll: "What happens when a VM tries to reboot?"
18–21 Virt vs Container 10 This is the bridge to Session 2. Key message: "In production, you use BOTH." The full stack diagram at the end is the money shot.
22–23 Takeaways + Q&A 10 Recap the 7 key points. Open discussion.

Key Stories & Analogies to Use

The Hotel Analogy (Virtualization)

"Think of a hypervisor as a hotel building. Each VM is a complete hotel room with its own bathroom, kitchen, and bedroom. Guests are fully isolated — what happens in room 301 doesn't affect room 302. But each room takes significant space and resources."

The Apartment Analogy (Containerization — preview)

"Containers are like apartments in a shared building. They share plumbing (kernel), electrical (CPU scheduler), and foundation (hardware). Each has its own locked door (namespaces) and utility meter (cgroups). Much more efficient, but the building superintendent (kernel) is a shared dependency."

The Traffic Controller (Dispatcher)

"The dispatcher is a traffic controller at an intersection. It doesn't drive any car, it doesn't fuel any car — it just decides which lane each car goes to. Privileged instruction? Go to the Interpreter lane. Resource change? Go to the Allocator lane."

The x86 Problem — Tell It As a Mystery

"In 1974, Popek and Goldberg proved that you CAN virtualize any architecture... IF sensitive instructions always trap. For 25 years, x86 couldn't do this — 17 instructions were sensitive but didn't trap. VMware's genius was binary translation: rewrite the guest code on-the-fly to replace those sneaky instructions with safe trapping versions. Then Intel said 'fine, we'll fix it in hardware' — VT-x in 2005."


Common Questions & Answers

Q: Is Docker a hypervisor? A: No. Docker is a container runtime that uses Linux kernel primitives (namespaces, cgroups). It doesn't run full operating systems — just isolated processes on a shared kernel.

Q: Is KVM Type 1 or Type 2? A: This is debated! Technically it's Type 1 — the KVM module makes the Linux kernel itself into a hypervisor. But since you still have a full Linux userspace, some argue it's a "hybrid." The practical answer: it delivers Type 1 performance with Type 2 convenience.

Q: Why can't I just use containers for everything? A: Containers share the host kernel — a kernel vulnerability affects ALL containers. VMs provide hardware-level isolation. In regulated environments (banking, defense, healthcare), VM isolation is often a compliance requirement. Also, you can't run Windows containers on a Linux kernel (native — WSL2 uses a VM).

Q: What about WSL2? A: WSL2 is actually a lightweight Hyper-V VM running a real Linux kernel. It's Type 1 virtualization (Hyper-V is bare metal, Windows runs in the root partition) with a great developer experience layer on top.


SESSION 2: From Containers to Kubernetes

Slide Timing Guide

Slide(s) Topic Mins Teaching Notes
1–2 Title + Agenda 2 Quick recap of Session 1's key points before diving in
3–8 Linux Kernel Primitives 15 DEMO OPPORTUNITY: Run unshare live. Show lsns. Create a network namespace and show isolated ip addr. This is the most educational part — demystify the "magic" of containers.
9–11 Containerization 10 The layer diagram and OCI spec slide are key. Emphasize: "A container image is just a tarball of filesystem layers + metadata JSON."
12–14 Dockerfiles & Buildx 15 LIVE CODING: Write a Dockerfile together. Show the DO vs DON'T side by side. Multi-stage builds are the #1 practical takeaway.
15–19 Container Runtimes 15 The landscape diagram is the anchor slide. Key message: "Docker is 3 layers: dockerd → containerd → runc. Kubernetes skips dockerd."
20–23 Kubernetes Architecture 15 Draw on whiteboard: Start with "a user types kubectl apply" and trace the full path. The pod start sequence is the master slide.
24–28 CRI Deep Dive 10 The protobuf definitions make it concrete — CRI is just a gRPC API. CRI-O vs containerd comparison is a common team decision point.
29–31 Full Journey + Triage 5 The triage analogy lands well with mixed audiences. The full stack map is the synthesis of both sessions.
32–33 Takeaways + Q&A 5 End with the hands-on next steps.

Live Demo Script (Session 2)

Demo 1: Build a Container From Scratch (5 min)

# Show current namespaces
lsns

# Create a new PID + mount + UTS namespace
sudo unshare --pid --mount --uts --fork /bin/bash

# Inside the new namespace:
hostname container-demo
hostname  # shows "container-demo"

ps aux    # only shows processes in this namespace!
# PID 1 is our bash shell

# Exit and show host hostname is unchanged
exit
hostname  # still the original hostname

Demo 2: Network Namespace (5 min)

# Create a network namespace
sudo ip netns add demo-ns

# Show it's completely empty (no interfaces)
sudo ip netns exec demo-ns ip addr
# Only loopback, and it's DOWN

# Bring up loopback
sudo ip netns exec demo-ns ip link set lo up

# Create a veth pair (virtual ethernet cable)
sudo ip link add veth-host type veth peer name veth-ns

# Move one end into the namespace
sudo ip link set veth-ns netns demo-ns

# Assign IPs
sudo ip addr add 10.0.0.1/24 dev veth-host
sudo ip link set veth-host up
sudo ip netns exec demo-ns ip addr add 10.0.0.2/24 dev veth-ns
sudo ip netns exec demo-ns ip link set veth-ns up

# Ping across the namespace boundary!
ping -c 2 10.0.0.2

# Cleanup
sudo ip netns del demo-ns

Demo 3: cgroup resource limit (3 min)

# Create a cgroup with 50MB memory limit (cgroups v2)
sudo mkdir /sys/fs/cgroup/demo
echo "52428800" | sudo tee /sys/fs/cgroup/demo/memory.max

# Run a process in that cgroup
echo $$ | sudo tee /sys/fs/cgroup/demo/cgroup.procs

# Try to allocate more than 50MB → OOM killed!
python3 -c "x = ' ' * 60_000_000"
# Killed!

Demo 4: crictl basics (2 min, needs a K8s node)

# List pods via CRI
sudo crictl pods

# List containers
sudo crictl ps

# Inspect a container
sudo crictl inspect <container-id>

# Pull an image via CRI
sudo crictl pull nginx:1.27

# Check runtime info
sudo crictl info

Common Questions & Answers (Session 2)

Q: Why did Kubernetes remove Docker support? A: Kubernetes never "used Docker" — it used containerd (inside Docker). The dockershim was a translation layer in kubelet that converted CRI calls to Docker API calls, which then called containerd anyway. Removing it: (a) eliminated a maintenance burden, (b) removed an unnecessary indirection layer, (c) let kubelet talk directly to containerd via CRI. Your container images still work — they're OCI standard.

Q: Should we use CRI-O or containerd? A: Both are production-grade.

  • containerd if: you want the most widely adopted option with the largest community (default for GKE, EKS, AKS kubeadm).
  • CRI-O if: you want a minimal, K8s-only runtime with version-locked releases (default for OpenShift, Rancher). Neither is "better" — it's organizational preference.

Q: Are containers less secure than VMs? A: Yes, by default. Containers share the host kernel → a kernel exploit affects all containers. VMs have hardware isolation (VT-x, separate kernel per VM). However, container security can be hardened significantly with: seccomp profiles, AppArmor/SELinux, rootless containers, read-only rootfs, network policies, and tools like Falco. For maximum isolation, use Kata Containers (container UX, VM isolation).

Q: What is a "pause" container? A: When CRI-O/containerd creates a pod, they first start a tiny "pause" container (literally does pause() syscall — sleeps forever). This container holds the pod's network namespace alive. When you add application containers to the pod, they join this existing namespace. If the app container crashes and restarts, the network namespace (and IP address) survive because the pause container is still running.

Q: Can I run Kubernetes on bare metal? A: Absolutely — and many high-performance workloads do (no hypervisor overhead). Tools for bare metal K8s: kubeadm, k3s, Talos Linux, Flatcar Container Linux, Tinkerbell for provisioning, MetalLB for load balancing, Rook/Ceph for storage.


Whiteboard Diagrams to Draw Live

1. "What Happens When You Type kubectl run nginx --image=nginx"

Draw step by step:

  1. kubectl → API Server (REST call)
  2. API Server → etcd (store pod spec)
  3. Scheduler watches → picks node
  4. Scheduler → API Server (binding)
  5. kubelet on chosen node watches → sees new pod
  6. kubelet → CRI-O (RunPodSandbox)
  7. CRI-O → runc (create pause container + namespaces)
  8. CRI-O → CNI (setup networking, assign IP)
  9. kubelet → CRI-O (CreateContainer, StartContainer)
  10. CRI-O → runc (unshare + execve nginx)
  11. kubelet → API Server (pod status: Running)

2. "The Isolation Stack"

Draw as concentric security rings:

  • Outer: Hardware (separate machines)
  • Next: VMs (hypervisor isolation, separate kernels)
  • Next: Containers (namespace + cgroup isolation, shared kernel)
  • Next: Processes (standard OS isolation)
  • Inner: Threads (shared address space)

Each ring = different cost/performance/isolation tradeoff.


General Teaching Tips

  1. Start each concept with WHY, then HOW. "Before we explain cgroups, let's understand why you need them — imagine 50 containers and one starts eating all the memory..."

  2. Use the "zoom in" technique. Show the full stack diagram → "Today we're zooming into THIS layer."

  3. Every 15 minutes, interact. Ask a question, run a demo, or do a quick poll. 90 minutes of pure slides = sleeping audience.

  4. The triage analogy works. Your team likely knows medical triage from common knowledge. Map it: "immediate = critical pods, urgent = guaranteed QoS, standard = burstable, non-urgent = best-effort, deceased = evicted."

  5. End with hands-on homework. Give specific commands to try. People remember what they do, not what they hear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment