Skip to content

Instantly share code, notes, and snippets.

@dexterbrylle
Created April 24, 2026 10:38
Show Gist options
  • Select an option

  • Save dexterbrylle/998f2298fc57b931740aeb834fc50251 to your computer and use it in GitHub Desktop.

Select an option

Save dexterbrylle/998f2298fc57b931740aeb834fc50251 to your computer and use it in GitHub Desktop.
homelab-docs

DevSecOps Homelab - Full Documentation Aggregation

Aggregated on Fri 24 Apr 2026 18:38:02 PST


Source: docs/architecture/gitops-diagrams.md


GitOps Workflow Diagrams


1. Repository Structure (Multi-Environment)

To simulate a professional environment, the gitops-apps repository is structured to separate global infrastructure from individual product environments.

%%{init: {'theme': 'dark'}}%%
flowchart TB
    subgraph GiteaOrg["Gitea Organization: homelab"]
        subgraph GitopsApps["gitops-apps (Main Repository)"]
            A1["argocd-apps/<br/>Root App Manifests"]
            A2["production/<br/>prd-alpha, prd-beta"]
            A3["development/<br/>dev-gamma, dev-delta"]
            A4["infrastructure/<br/>Longhorn, LGTM Stack, Ingress"]
            A5["security/<br/>Falco, Trivy, Kyverno"]
        end

        subgraph IaC["Infrastructure as Code"]
            T1["terraform-proxmox/<br/>Proxmox VM Provisioning"]
            AN1["ansible-playbooks/<br/>K3s & Service Bootstrap"]
        end
    end
Loading

2. ArgoCD App of Apps Pattern

The "Root" application manages the state of all other applications. This allows you to add a new "Product" just by adding a YAML file to the production/ or development/ folder.

%%{init: {'theme': 'dark'}}%%
flowchart TB
    subgraph Root["Root Management"]
        RootApp["root-application.yaml<br/>(The Master Sync)"]
    end

    subgraph Projects["ArgoCD AppProjects"]
        InfraP["Project: infrastructure<br/>(Cluster Wide)"]
        ProdP["Project: production<br/>(Strict Security)"]
        DevP["Project: development<br/>(Audit Security)"]
    end

    subgraph Apps["Sync Targets"]
        InfraP --> Longhorn & LGTMStack & Ingress
        ProdP --> ProductAlpha & ProductBeta
        DevP --> SandboxApp
    end

    RootApp --> InfraP & ProdP & DevP
Loading

3. Secret Management Flow (Vault + GitOps)

Secrets are never stored in Git. Instead, Git contains a "Secret Reference" that Vault uses to inject real credentials at runtime.

%%{init: {'theme': 'dark'}}%%
flowchart LR
    subgraph Git["Git Repo"]
        YAML["Deployment YAML<br/>(with Annotations)"]
    end

    subgraph K8s["Kubernetes Cluster"]
        Pod["Application Pod"]
        Sidecar["Vault Agent (Sidecar)"]
    end

    subgraph V["HashiCorp Vault"]
        Store[("Encrypted Secrets")]
    end

    YAML -->|ArgoCD Sync| Pod
    Sidecar -->|Auth via K8s ServiceAccount| V
    Store -->|Inject as File| Sidecar
    Sidecar -->|Mounted Volume| Pod
Loading

Source: docs/architecture/kubernetes-diagrams.md


Kubernetes Cluster Architecture Diagrams

All diagrams use Mermaid syntax. Render in VS Code with Mermaid extension or view on GitHub.


1. K3s Cluster Workload Distribution (24/7 Master)

Since pve-vader and pve-sidious are online 24/7, all critical infrastructure runs on them. pve-maul is used only for ephemeral sandbox testing.

%%{init: {'theme': 'dark'}}%%
flowchart TB
    subgraph pveVader["pve-vader (24/7 Master)"]
        subgraph VaderVMs["Proxmox VMs/LXCs"]
            pfS["pfSense VM"]
            DNS["AdGuard Home"]
            VPN["Tailscale"]
        end
        subgraph NodeV["k3s-master-01"]
            K3sS["K3s Control Plane"]
            Argo["ArgoCD"]
            Vault["Vault (Leader)"]
        end
    end

    subgraph pveSidious["pve-sidious (24/7 Node)"]
        subgraph NodeS["k3s-worker-01"]
            Prom["Prometheus"]
            Loki["Loki Storage"]
            Tempo["Tempo Tracing"]
            Vault2["Vault (Follower)"]
        end
    end

    subgraph pveMaul["pve-maul (Hack Box)"]
        subgraph NodeM["k3s-sandbox-01"]
            Kali["Kali Linux"]
            Vuln["Vulnerable Targets"]
        end
    end

    style pveVader fill:#0d47a1,stroke:#2196f3
    style pveSidious fill:#1b5e20,stroke:#4caf50
    style pveMaul fill:#4a1c1c,stroke:#ff5252
Loading

2. LGTM + OpenTelemetry Monitoring Flow

%%{init: {'theme': 'dark'}}%%
flowchart TB
    subgraph Apps["Instrumented Applications"]
        ProductA["Product A"]
        ProductB["Product B"]
    end

    subgraph OTel["Telemetry Collection"]
        Collector["OTel Collector (Hub)"]
    end

    subgraph Storage["LGTM Persistence Layer"]
        Loki[("Loki (Logs)")]
        Prom[("Prometheus (Metrics)")]
        Tempo[("Tempo (Traces)")]
    end

    subgraph Visualization["Visualization"]
        Grafana["Grafana Central"]
    end

    Apps -->|OTLP| Collector
    Collector -->|Push| Loki
    Collector -->|Push| Tempo
    Prom -->|Scrape| Collector

    Loki & Prom & Tempo --> Grafana

    style OTel fill:#7b1fa2,stroke:#ab47bc
    style Storage fill:#1565c0,stroke:#42a5f5
Loading

3. High Availability Behavior (Quorum)

With pve-maul off, the 3-replica HA services (Vault, Longhorn) enter a "Degraded but Available" state.

%%{init: {'theme': 'dark'}}%%
flowchart LR
    subgraph Cluster["Primary Lab Cluster"]
        Node1["Vader (ON)"]
        Node2["Sidious (ON)"]
        Node3["Maul (OFF)"]
    end

    subgraph HAState["HA Service Status (e.g. Vault)"]
        Replica1["Replica 1 (Active)"]
        Replica2["Replica 2 (Standby)"]
        Replica3["Replica 3 (Offline)"]
    end

    Node1 --- Replica1
    Node2 --- Replica2
    Node3 -.- Replica3
Loading

Note right of HAState: 2/3 Replicas Online = Quorum Maintained


4. Storage Mapping (Virtual Disks)

Longhorn does not access the Proxmox host directly. It uses Virtual Disks backed by the physical SATA SSDs.

%%{init: {'theme': 'dark'}}%%
flowchart TB
    subgraph HostV["pve-vader"]
        SATA_V["SATA SSD"]
        ThinV["LVM Thin Pool"]
        VDisk_V["Virtual Disk (/dev/sdb)"]
    end

    subgraph VM_V["k3s-master-01"]
        LH_V["Longhorn Engine"]
    end

    SATA_V --> ThinV --> VDisk_V --> VM_V
    VM_V -->|Mount| LH_V
Loading

Source: docs/architecture/network-diagrams.md


Network Architecture Diagrams

All diagrams use Mermaid syntax. Render in VS Code with Mermaid extension or view on GitHub.


1. High-Level Network Topology (Double NAT)

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#1a1a2e', 'primaryTextColor': '#eaeaea', 'primaryBorderColor': '#16213e', 'lineColor': '#0f3460', 'secondaryColor': '#16213e', 'tertiaryColor': '#0f3460'}}}%%
flowchart TB
    subgraph Internet["🌐 Internet"]
        WAN((WAN))
    end

    subgraph Home["🏠 Home Network (192.168.1.0/24)"]
        Router["AX3000 Router (No Bridge Mode)<br/>192.168.1.1"]
        DevMachine["Dev Machine<br/>(MacBook)"]
    end

    subgraph Proxmox["Proxmox VE Cluster"]
        subgraph Vader["pve-vader (24/7 Master)"]
            pfSense["pfSense VM (Double NAT)<br/>WAN: 192.168.1.x<br/>LAN: 10.10.10.1"]
            AdGuard["AdGuard LXC<br/>DNS: 10.10.10.2"]
            Tailscale["Tailscale LXC<br/>VPN: 10.10.10.3"]
            K3sMaster["K3s Master VM<br/>10.10.10.10"]
        end

        subgraph Sidious["pve-sidious (24/7 Node)"]
            K3sWorker1["K3s Worker VM<br/>10.10.10.12"]
            BlueVM1["Blue Team VMs<br/>Wazuh/ELK"]
        end

        subgraph Maul["pve-maul (Hack Box - Optional)"]
            K3sWorker2["Sandbox K3s Node"]
            RedVM1["Red Team VMs<br/>Kali/Targets"]
        end
    end

    WAN --> Router
    Router -->|Physical Link| pfSense
    Router --> DevMachine
    pfSense --> AdGuard
    pfSense --> K3sMaster
    pfSense --> K3sWorker1
    pfSense --> K3sWorker2
    pfSense --> RedVM1
    pfSense --> BlueVM1

    style Vader fill:#0f3460,stroke:#e94560
    style Sidious fill:#1a1a2e,stroke:#e94560
    style Maul fill:#1a1a2e,stroke:#e94560
Loading

2. Physical to Virtual Network Mapping

%%{init: {'theme': 'dark'}}%%
flowchart TB
    subgraph Physical["Actual Hardware (lsblk verified)"]
        subgraph VaderHW["pve-vader (Master)"]
            V_CPU["Intel i5-8500"]
            V_NVMe["477GB NVMe (OS/VMs)"]
            V_SATA["238GB SATA SSD (Longhorn)"]
        end

        subgraph SidiousHW["pve-sidious (Worker)"]
            S_CPU["Intel i5-8500"]
            S_NVMe["477GB NVMe (OS/VMs)"]
            S_SATA["238GB SATA SSD (Longhorn)"]
        end

        subgraph MaulHW["pve-maul (Hack Box)"]
            M_CPU["Intel i5-8500T"]
            M_NVMe["238GB NVMe (Limited)"]
        end
    end

    subgraph VirtualVMs["Critical VM Placement"]
        pfSenseVM["pfSense (on Vader)"]
        K3sServer["K3s Master (on Vader)"]
        K3sWorker1["K3s Worker (on Sidious)"]
        KaliVM["Kali Linux (on Maul)"]
    end

    V_NVMe --> pfSenseVM
    V_NVMe --> K3sServer
    S_NVMe --> K3sWorker1
    M_NVMe --> KaliVM

    V_SATA -.->|Virtio Disk| K3sServer
    S_SATA -.->|Virtio Disk| K3sWorker1
Loading

3. Firewall Rules Summary

Source Destination Action Purpose
LAN (10.10.10.0/24) WAN ALLOW Internet access
OPT (10.20.20.0/24) WAN ALLOW Sandbox updates
OPT (10.20.20.0/24) LAN (10.10.10.0/24) BLOCK Isolate hack box from lab
LAN (10.10.10.0/24) Home (192.168.1.0/24) BLOCK Isolate lab from family
Tailscale Subnet LAN ALLOW Remote admin access

Source: docs/checklist/implementation-checklist.md


DevSecOps Homelab Implementation Checklist

Status Legend: [ ] Not Started | [~] In Progress | [x] Complete | [-] Skipped/N/A


Phase 0: Prerequisites & Planning

0.1 Hardware Preparation

  • Verify all 3 nodes have Proxmox VE 8.x installed
  • Confirm network connectivity between all nodes
  • Verify storage availability on each node:
    • pve-vader: 477GB NVMe + 238GB SATA SSD
    • pve-sidious: 477GB NVMe + 238GB SATA SSD
    • pve-maul: 238GB NVMe (Hack Box - 141GB local-lvm)
  • Document MAC addresses and IP assignments
  • Configure BIOS/UEFI settings (virtualization enabled, power management)

0.2 Network Planning Documentation

  • Reserve IP addresses on router for static assignments
  • Document final IP scheme:
    • Proxmox management IPs (192.168.1.x)
    • VNet1 Homelab-Net (10.10.10.0/24)
    • VNet2 Sandbox-Net (10.20.20.0/24)
  • Plan Tailscale subnet router placement

0.3 Development Machine Setup

  • Install Terraform >= 1.6
  • Install Ansible >= 2.15
  • Install kubectl
  • Install helm
  • Install Tailscale client
  • Configure SSH keypair for infrastructure access
  • Install Proxmox Terraform provider credentials

Phase 1: Proxmox Cluster Configuration

1.1 Cluster Formation

  • Create Proxmox cluster on pve-vader (Master)
    pvecm create homelab-cluster --link0 192.168.1.11
  • Join pve-sidious to cluster (24/7 Node)
    pvecm add 192.168.1.11 --link0 192.168.1.12
  • Join pve-maul to cluster (Hack Box)
    pvecm add 192.168.1.11 --link0 192.168.1.10
  • Verify cluster quorum: pvecm status (expect Quorum: 1, Nodes: 3)

1.2 Storage Configuration

  • Physical Prep: Create LVM thin pools on vader/sidious SATA SSDs (See storage-checklist.md § Storage Operations)
    • Run wipefs, pvcreate, vgcreate vg-longhorn, lvcreate on both nodes
  • Register storage pool in Proxmox UI: Datacenter → Storage → Add → LVM-Thin
    • ID: vg-longhorn | VG: vg-longhorn | Thin Pool: tp-longhorn | Nodes: vader, sidious
    • (Required before Terraform can provision virtual disks from this pool)
  • Verify pool visible: pvesm status on each node
  • Enable directory storage for ISOs/templates on each node
  • Upload Ubuntu 24.04 Cloud-Init image to all nodes (used as VM template)

1.3 SDN (Software-Defined Network) Setup

  • Enable Proxmox SDN via UI: Datacenter → SDN → Zones → Add → VXLAN
    • Zone ID: vxlan-zone | MTU: 1450 | Nodes: vader, sidious, maul
  • Create VNet1: ID vnet-homelab | Tag: 100 | CIDR: 10.10.10.0/24
  • Create VNet2: ID vnet-sandbox | Tag: 200 | CIDR: 10.20.20.0/24
  • Apply SDN configuration:
    pvesh create /cluster/sdn --action apply
  • Verify VNets are visible on all nodes:
    pvesh get /cluster/sdn/vnets

Phase 2: Automation & IaC (Bootstrap)

Reference: Guide 01 (Local Setup) | Guide 03 (Terraform) | Guide 04 (Ansible)

2.1 Local Tooling

  • Install required tools (Guide 01): terraform >= 1.6, ansible >= 2.15, kubectl, helm, jq, yq
  • Generate SSH keypair: ssh-keygen -t ed25519 -C "homelab" -f ~/.ssh/homelab
  • Distribute public key to all Proxmox nodes
  • Create Proxmox API token (root@pam → Datacenter → Permissions → API Tokens)
  • Configure .envrc with PM_API_URL, PM_API_TOKEN, KUBECONFIG
  • Verify API access: curl -sk https://192.168.1.11:8006/api2/json/nodes | jq .
  • Verify SSH access to all three nodes

2.2 Terraform Project Setup

  • Create file terraform/environments/homelab/providers.tf (provider: bpg/proxmox)
  • Create file terraform/environments/homelab/variables.tf
  • Create file terraform/environments/homelab/main.tf
  • Create reusable VM module: terraform/modules/vm/{main,variables,outputs}.tf
  • Run terraform init in terraform/environments/homelab/
  • Run terraform validate — expect zero errors
  • Run terraform plan — review all resources before applying

2.3 Ansible Project Setup

  • Create file ansible/ansible.cfg
  • Create file ansible/inventories/homelab/hosts.yml with all node IPs/users
  • Create file ansible/inventories/homelab/group_vars/all.yml
  • Install required collections: ansible-galaxy collection install community.general ansible.posix
  • Test inventory: ansible all -m ping -i ansible/inventories/homelab/hosts.yml
  • Create roles: common, adguard, tailscale, k3s

Phase 3: Core Infrastructure (on pve-vader)

3.1 pfSense Router VM

  • Provision VM via Terraform: 2 vCPU, 4GB RAM, 3 NICs (WAN, LAN, OPT)
  • Configure LAN interface (10.10.10.1) and OPT (10.20.20.1)
  • Set Firewall Rule: BLOCK OPT → LAN (Sandbox isolation)

3.2 Management Services

  • Provision AdGuard Home LXC (10.10.10.2)
  • Provision Tailscale Subnet Router LXC (10.10.10.3)
  • Configure DNS Rewrites for *.homelab.local

Phase 4: Kubernetes Cluster Deployment

4.1 K3s Master (on pve-vader)

  • Provision VM via Terraform: 4 vCPU, 8GB RAM, 100GB Disk
  • Attach secondary SATA-backed Virtual Disk for Longhorn
  • Bootstrap K3s Server with --cluster-init and --disable traefik
  • BACKUP: Configure automated etcd snapshots to local disk

4.2 K3s Worker (on pve-sidious)

  • Provision Worker VM: 4 vCPU, 8GB RAM, 100GB Disk
  • Attach secondary SATA-backed Virtual Disk for Longhorn
  • Join worker to the master node

4.3 Core Cluster Services (Post-K3s)

Note: Traefik is disabled at install time. ingress-nginx must be installed before any service can be exposed via Ingress.

  • Install Ingress-Nginx via Helm (critical — replaces disabled Traefik)
    helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
    helm install ingress-nginx ingress-nginx/ingress-nginx -n ingress-nginx --create-namespace \
      --set controller.service.type=NodePort \
      --set controller.service.nodePorts.http=30080 \
      --set controller.service.nodePorts.https=30443
  • Verify ingress controller is Running: kubectl get pods -n ingress-nginx
  • Install cert-manager via Helm (TLS automation)
    helm install cert-manager jetstack/cert-manager -n cert-manager --create-namespace \
      --set crds.enabled=true
  • Verify cert-manager pods are Running: kubectl get pods -n cert-manager
  • (Optional, Phase 10+) Deploy cloudflared tunnel pod for public service exposure

Phase 5: Storage Layer (Longhorn)

5.1 Installation & Disk Mapping

Reference: Guide 06 | Guide 05 (K3s)

  • Install iSCSI client on all K3s nodes: apt install -y open-iscsi && systemctl enable --now iscsid
  • Install Longhorn via Helm:
    helm repo add longhorn https://charts.longhorn.io
    helm install longhorn longhorn/longhorn -n longhorn-system --create-namespace \
      --set defaultSettings.defaultDataPath=/mnt/longhorn \
      --set defaultSettings.defaultReplicaCount=2 \
      --set defaultSettings.storageMinimalAvailablePercentage=10
  • Verify all Longhorn pods Running: kubectl get pods -n longhorn-system
  • Format and mount secondary disks inside VMs (see storage-checklist.md § Configure Longhorn):
    mkfs.ext4 /dev/sdb
    mkdir -p /mnt/longhorn
    echo '/dev/sdb /mnt/longhorn ext4 defaults,noatime,nofail 0 2' >> /etc/fstab && mount /mnt/longhorn
  • Apply longhorn-node-config.yaml to register disks: kubectl apply -f longhorn-node-config.yaml

5.2 Storage Classes

Important: longhorn-default (2 replicas) is set as the default StorageClass. With only vader + sidious active, a 3-replica class would immediately show Degraded. Only use longhorn-critical for workloads that truly require 3 replicas and where degraded state is acceptable when Maul is offline.

  • longhorn-default (2 replicas) — default StorageClass
  • longhorn-critical (3 replicas — HA, for Vault/Gitea)
  • longhorn-ephemeral (1 replica — cache/temp)
  • Verify: kubectl get storageclass
  • Verify Longhorn sees SATA disks: kubectl -n longhorn-system get nodes.longhorn.io -o wide

Phase 6: GitOps Infrastructure

Reference: Guide 07

6.1 PostgreSQL + Gitea

  • Add Bitnami Helm repo: helm repo add bitnami https://charts.bitnami.com/bitnami
  • Deploy PostgreSQL (namespace: postgresql, storage: longhorn-critical)
  • Add Gitea Helm repo: helm repo add gitea-charts https://dl.gitea.com/charts/
  • Deploy Gitea (namespace: gitea, storage: longhorn-critical)
  • Access Gitea via NodePort and complete initial setup
  • Create Gitea organisation homelab and repositories: gitops-apps, terraform-proxmox, ansible-playbooks
  • Store Gitea admin credentials in Vault (Phase 7), not in shell history

6.2 ArgoCD

  • Add Argo Helm repo: helm repo add argo https://argoproj.github.io/argo-helm
  • Deploy ArgoCD (namespace: argocd, storage: longhorn-default)
    • Do not use --set server.dev.enabled=true; use a valid bcrypt hash for admin password
  • Access ArgoCD UI, change default admin password
  • Connect ArgoCD to Gitea gitops-apps repository via SSH key (preferred) or HTTPS token

6.3 App-of-Apps Bootstrap

  • Create repo folder structure: argocd-apps/, infrastructure/, services/, monitoring/, security/
  • Commit argocd-apps/root-application.yaml and argocd-apps/projects.yaml to git
  • Apply root application: kubectl apply -f gitops-apps/argocd-apps/root-application.yaml
  • Verify ArgoCD shows root app as Synced/Healthy

Phase 7: Security Tooling & Red/Blue Team

Reference: Guide 08 | Guide 09

7.1 HashiCorp Vault

  • Add HashiCorp Helm repo: helm repo add hashicorp https://helm.releases.hashicorp.com
  • Deploy Vault in standalone mode (not dev mode) with Longhorn-backed storage (longhorn-critical)
  • Initialise Vault: vault operator init -key-shares=5 -key-threshold=3save unseal keys and root token securely
  • Unseal Vault: provide 3 of 5 keys via vault operator unseal
  • Enable KV-v2 engine: vault secrets enable -path=homelab kv-v2
  • Enable Kubernetes auth: vault auth enable kubernetes
  • Create homelab-apps policy and role in Vault
  • Verify: vault status shows Sealed: false, HA Enabled: false (standalone)

7.2 Runtime Security

  • Add Falco Helm repo: helm repo add falcosecurity https://falcosecurity.github.io/charts
  • Deploy Falco with eBPF driver (driver.kind=ebpf) — requires kernel ≥ 4.14 with BTF
  • Apply custom Falco rules ConfigMap
  • Test: trigger a shell-in-container event and confirm Falco logs it
  • Add Aqua Helm repo: helm repo add aqua https://aquasecurity.github.io/helm-charts/
  • Deploy Trivy Operator (namespace: trivy-system) with in-cluster DB server enabled
  • Add Kyverno Helm repo: helm repo add kyverno https://kyverno.github.io/kyverno/
  • Deploy Kyverno (namespace: kyverno)
  • Apply baseline policies: disallow-privileged, require-limits, disallow-latest-tag (start in audit mode)
  • Promote policies to enforce mode after validating no existing workloads violate them

7.3 Red Team (on pve-maul)

  • Download current Kali Linux installer ISO (not VMware image) and upload to pve-maul
  • Provision Kali Linux VM (80GB Disk) via qm create on pve-maul, attached to vnet-sandbox
  • Deploy DVWA and Juice Shop in isolated K3s namespaces (add WARNING: not production-safe label)

7.4 Blue Team

  • Deploy Wazuh (SIEM) in blue-team namespace via Helm — use longhorn-default storage
  • Configure Wazuh agents on K3s nodes and Proxmox hosts
  • Verify Wazuh dashboard accessible and receiving alerts

Phase 8: Observability (LGTM Stack)

Reference: Guide 10

8.1 Metrics & Logs

  • Add Prometheus community Helm repo: helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
  • Deploy kube-prometheus-stack (includes Grafana) — namespace: monitoring
  • Add Grafana Helm repo: helm repo add grafana https://grafana.github.io/helm-charts
  • Deploy Loki (standalone chart, not deprecated loki-stack) — namespace: logging
  • Deploy Promtail DaemonSet for log collection
  • Configure Grafana Loki datasource: http://loki.logging.svc.cluster.local:3100

8.2 Tracing & OpenTelemetry

  • Deploy Grafana Tempo — namespace: monitoring
  • Configure Grafana Tempo datasource: http://tempo.monitoring.svc.cluster.local:3200 (port 3200, not 3100)
  • Add OTel Helm repo: helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
  • Deploy OpenTelemetry Collector with prometheusremotewrite exporter for metrics
  • Deploy a test app with OTLP instrumentation and verify traces appear in Grafana Tempo

Phase 9: Sandbox Dev Cluster (on pve-maul)

9.1 Sandbox Provisioning

  • Create a single-node, disposable K3s cluster on the Hack Box
  • Use for high-risk learning or malware analysis practice

Verification Checklist

Post-Deployment Health Checks

Check Command Expected Result
Proxmox Quorum pvecm status Quorum: 1, Nodes: 3
K3s Node Health kubectl get nodes -o wide All nodes Ready
Longhorn Storage kubectl -n longhorn-system get nodes.longhorn.io Both nodes healthy
ArgoCD Sync kubectl get applications -n argocd All apps Synced/Healthy
Vault Status vault status Sealed: false
Falco Running kubectl get pods -n falco All pods Running
Network Isolation SSH to Kali: ping 10.10.10.2 Request timeout
LGTM Traces Deploy OTel test app Traces visible in Grafana
  • All health checks above pass
  • No Kyverno policy violations in audit logs: kubectl get policyreport -A
  • Trivy vulnerability report generated: kubectl get vulnerabilityreports -A

Phase 10: Documentation & Maintenance

10.1 Maintenance Schedule

  • Weekly: Review security alerts
  • Monthly: Update OS packages
  • Quarterly: Update Kubernetes version

10.2 Backup Strategy

  • PBS Setup: Provision a Proxmox Backup Server VM on pve-maul
    • Allocate 60GB-80GB for the backup datastore
    • Configure deduplication and compression
    • Set Retention Policy: Keep last 2 days of backups
  • Automated Jobs: Schedule nightly backups for all critical VMs (Vader/Sidious)
  • K8s Backups: Configure Longhorn recurring snapshots (every 4-8 hours)
  • Offsite (Optional): Configure Velero to sync critical metadata to S3/Local Storage

Phase 11: Identity & Access Management (SSO)

Reference: Guide 11

11.1 Identity Store

  • Choose backend:
    • Option A: Windows Server 2022 VM on pve-vader — promote to Domain Controller (homelab.local)
    • Option B: Deploy lldap via ArgoCD (lightweight, recommended for DevSecOps learning)
  • Create a standard user and homelab-admins group in the chosen directory
  • Create a service account (e.g., authelia-bind) for Authelia to query LDAP

11.2 Authelia Prerequisites

  • Deploy Redis (session store) in the security namespace
    • Authelia requires Redis for session management — do not skip
  • Re-use the existing PostgreSQL instance from Phase 6 (add a authelia database)

11.3 Authelia Deployment

  • Add Authelia Helm repo: helm repo add authelia https://charts.authelia.com
  • Create authelia-values.yaml with:
    • authentication_backend.ldap.url pointing to LLDAP/AD
    • storage.postgres pointing to the existing PostgreSQL
    • session.redis pointing to Redis
    • access_control rules (e.g., one_factor for internal tools, two_factor for ArgoCD/Vault)
  • Deploy Authelia in the security namespace: helm install authelia authelia/authelia -n security -f authelia-values.yaml
  • Integrate secret values via Vault (do not put passwords in authelia-values.yaml plaintext)

11.4 SSO Integration

  • Configure ingress-nginx forward-auth annotations on all protected Ingress objects
  • Configure Gitea as an OIDC client of Authelia (client_id, client_secret in Authelia config)
  • Configure ArgoCD OIDC login via Authelia
  • Protect Grafana, Longhorn UI, and ArgoCD behind Authelia 2FA

11.5 Verification

  • Open private/incognito window → navigate to https://argocd.homelab.local
  • Confirm redirect to Authelia portal
  • Login and verify access is granted with correct TOTP/LDAP credentials
  • Confirm unauthenticated requests are rejected (HTTP 401/302)

Source: docs/checklist/storage-checklist.md


Storage Verification & Operations Checklist

Verified hardware output from each node and step-by-step storage operations for preparing Longhorn persistent storage.


Hardware Verification

pve-maul

root@pve-maul:~# lsblk                                                                                                 NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
nvme0n1            259:0    0 238.5G  0 disk
├─nvme0n1p1        259:1    0  1007K  0 part
├─nvme0n1p2        259:2    0     1G  0 part /boot/efi
└─nvme0n1p3        259:3    0 237.5G  0 part
  ├─pve-swap       252:0    0     8G  0 lvm  [SWAP]
  ├─pve-root       252:1    0  69.4G  0 lvm  /
  ├─pve-data_tmeta 252:2    0   1.4G  0 lvm
  │ └─pve-data     252:4    0 141.2G  0 lvm
  └─pve-data_tdata 252:3    0 141.2G  0 lvm
    └─pve-data     252:4    0 141.2G  0 lvm
root@pve-maul:~# fdisk -l
Disk /dev/nvme0n1: 238.47 GiB, 256060514304 bytes, 500118192 sectors
Disk model: ADATA SX8200PNP
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: E61EB1D3-EC90-4BE5-96C2-7F88C8489778

Device           Start       End   Sectors   Size Type
/dev/nvme0n1p1      34      2047      2014  1007K BIOS boot
/dev/nvme0n1p2    2048   2099199   2097152     1G EFI System
/dev/nvme0n1p3 2099200 500118158 498018959 237.5G Linux LVM


Disk /dev/mapper/pve-swap: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/pve-root: 69.37 GiB, 74482450432 bytes, 145473536 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

root@pve-maul:~# pvesm status
Name             Type     Status           Total            Used       Available        %
local             dir     active        71017632         3537964        63826448    4.98%
local-lvm     lvmthin     active       148086784               0       148086784    0.00%
root@pve-maul:~#   cat /etc/pve/storage.cfg
dir: local
        path /var/lib/vz
        content iso,vztmpl,backup

lvmthin: local-lvm
        thinpool data
        vgname pve
        content rootdir,images
root@pve-maul:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                   12G     0   12G   0% /dev
tmpfs                 2.4G  1.3M  2.4G   1% /run
/dev/mapper/pve-root   68G  3.4G   61G   6% /
tmpfs                  12G   28M   12G   1% /dev/shm
efivarfs              118K   55K   59K  48% /sys/firmware/efi/efivars
tmpfs                 5.0M     0  5.0M   0% /run/lock
tmpfs                 1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
tmpfs                  12G     0   12G   0% /tmp
/dev/nvme0n1p2       1022M  8.8M 1014M   1% /boot/efi
/dev/fuse             128M   16K  128M   1% /etc/pve
tmpfs                 1.0M     0  1.0M   0% /run/credentials/getty@tty1.service
tmpfs                 2.4G  4.0K  2.4G   1% /run/user/0
root@pve-maul:~#
root@pve-maul:~# lsblk -f
NAME               FSTYPE      FSVER    LABEL UUID                                   FSAVAIL FSUSE% MOUNTPOINTS
nvme0n1
├─nvme0n1p1
├─nvme0n1p2        vfat        FAT32          E77D-92ED                              1013.2M     1% /boot/efi
└─nvme0n1p3        LVM2_member LVM2 001       vWU0Gw-xMdB-HvB7-Zoca-7UuH-q7ZB-GW1Rpg
  ├─pve-swap       swap        1              ea4070e5-eefe-4d85-b66f-fa13be69926a                  [SWAP]
  ├─pve-root       ext4        1.0            c3a3ef3d-b6a6-4cd2-85e6-bf756c3d1731     60.9G     5% /
  ├─pve-data_tmeta
  │ └─pve-data
  └─pve-data_tdata
    └─pve-data

pve-sidious

root@pve-sidious:~# lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                  8:0    0 238.5G  0 disk
└─sda1               8:1    0 238.5G  0 part
nvme0n1            259:0    0 476.9G  0 disk
├─nvme0n1p1        259:1    0  1007K  0 part
├─nvme0n1p2        259:2    0     1G  0 part /boot/efi
└─nvme0n1p3        259:3    0   475G  0 part
  ├─pve-swap       252:0    0     8G  0 lvm  [SWAP]
  ├─pve-root       252:1    0    96G  0 lvm  /
  ├─pve-data_tmeta 252:2    0   3.5G  0 lvm
  │ └─pve-data     252:4    0 347.9G  0 lvm
  └─pve-data_tdata 252:3    0 347.9G  0 lvm
    └─pve-data     252:4    0 347.9G  0 lvm
root@pve-sidious:~# fdisk-l
-bash: fdisk-l: command not found
root@pve-sidious:~# fdisk -l
Disk /dev/nvme0n1: 476.94 GiB, 512110190592 bytes, 1000215216 sectors
Disk model: Colorful CN600 512GB PRO
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: EAC52235-00AD-4DF6-852C-0750ED95AE21

Device           Start       End   Sectors  Size Type
/dev/nvme0n1p1      34      2047      2014 1007K BIOS boot
/dev/nvme0n1p2    2048   2099199   2097152    1G EFI System
/dev/nvme0n1p3 2099200 998244352 996145153  475G Linux LVM


Disk /dev/sda: 238.47 GiB, 256060514304 bytes, 500118192 sectors
Disk model: Colorful SL500 2
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: DEC3BA58-81C3-4088-B0A4-3509FBE7EE5E

Device     Start       End   Sectors   Size Type
/dev/sda1     34 500117503 500117470 238.5G Linux filesystem


Disk /dev/mapper/pve-swap: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/pve-root: 96 GiB, 103079215104 bytes, 201326592 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
root@pve-sidious:~# pvesm status
Name             Type     Status           Total            Used       Available        %
local             dir     active        98497780         3575700        89872532    3.63%
local-lvm     lvmthin     active       364797952               0       364797952    0.00%
root@pve-sidious:~# cat /etc/pve/storage.cfg
dir: local
	path /var/lib/vz
	content iso,vztmpl,backup

lvmthin: local-lvm
	thinpool data
	vgname pve
	content rootdir,images
root@pve-sidious:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                   16G     0   16G   0% /dev
tmpfs                 3.2G  1.2M  3.2G   1% /run
/dev/mapper/pve-root   94G  3.5G   86G   4% /
tmpfs                  16G   28M   16G   1% /dev/shm
efivarfs              150K   78K   68K  54% /sys/firmware/efi/efivars
tmpfs                 5.0M     0  5.0M   0% /run/lock
tmpfs                 1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
tmpfs                  16G     0   16G   0% /tmp
/dev/nvme0n1p2       1022M  8.8M 1014M   1% /boot/efi
/dev/fuse             128M   16K  128M   1% /etc/pve
tmpfs                 1.0M     0  1.0M   0% /run/credentials/getty@tty1.service
tmpfs                 3.2G  4.0K  3.2G   1% /run/user/0
root@pve-sidious:~# lsblk -f
NAME               FSTYPE      FSVER    LABEL UUID                                   FSAVAIL FSUSE% MOUNTPOINTS
sda
└─sda1             ext4        1.0      Files 52e05ce3-e51c-4a86-a4b2-5eacfd9b3096
nvme0n1
├─nvme0n1p1
├─nvme0n1p2        vfat        FAT32          4AD4-24E4                              1013.2M     1% /boot/efi
└─nvme0n1p3        LVM2_member LVM2 001       TQNOWn-EvuS-AYTt-qOZg-UnRA-iGdf-7GW5Mv
  ├─pve-swap       swap        1              d9e824f3-7c1b-4f1b-ae5d-0fe6949e7849                  [SWAP]
  ├─pve-root       ext4        1.0            da7a9d2c-3d52-4fc1-adf9-c543be9dca5d     85.7G     4% /
  ├─pve-data_tmeta
  │ └─pve-data
  └─pve-data_tdata
    └─pve-data

pve-vader

root@pve-vader:~# lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                  8:0    0 238.5G  0 disk
└─sda1               8:1    0 238.5G  0 part
nvme0n1            259:0    0 476.9G  0 disk
├─nvme0n1p1        259:1    0  1007K  0 part
├─nvme0n1p2        259:2    0     1G  0 part /boot/efi
└─nvme0n1p3        259:3    0 475.9G  0 part
  ├─pve-swap       252:0    0     8G  0 lvm  [SWAP]
  ├─pve-root       252:1    0    96G  0 lvm  /
  ├─pve-data_tmeta 252:2    0   3.6G  0 lvm
  │ └─pve-data     252:4    0 348.8G  0 lvm
  └─pve-data_tdata 252:3    0 348.8G  0 lvm
    └─pve-data     252:4    0 348.8G  0 lvm
root@pve-vader:~# fdisk -l
Disk /dev/nvme0n1: 476.94 GiB, 512110190592 bytes, 1000215216 sectors
Disk model: Colorful CN600 512GB PRO
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 0905714F-0DFC-4F16-9888-622B297706E0

Device           Start        End   Sectors   Size Type
/dev/nvme0n1p1      34       2047      2014  1007K BIOS boot
/dev/nvme0n1p2    2048    2099199   2097152     1G EFI System
/dev/nvme0n1p3 2099200 1000215182 998115983 475.9G Linux LVM


Disk /dev/sda: 238.47 GiB, 256060514304 bytes, 500118192 sectors
Disk model: Colorful SL500 2
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: A2A608B6-6638-48F6-AB47-461241BC3907

Device     Start       End   Sectors   Size Type
/dev/sda1   2048 500118158 500116111 238.5G Linux filesystem


Disk /dev/mapper/pve-swap: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/pve-root: 96 GiB, 103079215104 bytes, 201326592 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
root@pve-vader:~# pvesm status
Name             Type     Status           Total            Used       Available        %
local             dir     active        98497780         3573856        89874376    3.63%
local-lvm     lvmthin     active       365760512               0       365760512    0.00%
root@pve-vader:~# cat /etc/pve/storage.cfg
dir: local
	path /var/lib/vz
	content iso,vztmpl,backup

lvmthin: local-lvm
	thinpool data
	vgname pve
	content rootdir,images
root@pve-vader:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                   16G     0   16G   0% /dev
tmpfs                 3.2G  1.2M  3.2G   1% /run
/dev/mapper/pve-root   94G  3.5G   86G   4% /
tmpfs                  16G   28M   16G   1% /dev/shm
efivarfs              150K   75K   71K  52% /sys/firmware/efi/efivars
tmpfs                 5.0M     0  5.0M   0% /run/lock
tmpfs                 1.0M     0  1.0M   0% /run/credentials/systemd-journald.service
tmpfs                  16G     0   16G   0% /tmp
/dev/nvme0n1p2       1022M  8.8M 1014M   1% /boot/efi
/dev/fuse             128M   20K  128M   1% /etc/pve
tmpfs                 1.0M     0  1.0M   0% /run/credentials/getty@tty1.service
tmpfs                 3.2G  4.0K  3.2G   1% /run/user/0
root@pve-vader:~# lsblk -f
NAME               FSTYPE      FSVER    LABEL UUID                                   FSAVAIL FSUSE% MOUNTPOINTS
sda
└─sda1             ext4        1.0            6bed5c80-5e4c-4fd4-842c-96cf73d40e12
nvme0n1
├─nvme0n1p1
├─nvme0n1p2        vfat        FAT32          7ED7-8304                              1013.2M     1% /boot/efi
└─nvme0n1p3        LVM2_member LVM2 001       4nPkAt-XcQV-0Upy-odzR-pVkS-jf2P-PV2FRc
  ├─pve-swap       swap        1              d3525fc7-22b0-4d5b-a5b8-e4fc0ec2cc60                  [SWAP]
  ├─pve-root       ext4        1.0            142fff43-9bd5-4ec8-8c57-798b66f37277     85.7G     4% /
  ├─pve-data_tmeta
  │ └─pve-data
  └─pve-data_tdata
    └─pve-data

Storage Plan

pve-vader = 24x7 node (Master node for Proxmox and K3s) pve-sidious = 24x7 node (Proxmox Quorum + K3s Worker) pve-maul = Hack Box (Security lab - turned on for practice)

Node Role VMs/LXCs Disk Source
pve-vader 24x7 Master pfSense (32GB), K3s Master (100GB), AdGuard (8GB), Tailscale (4GB) local-lvm (~415GB free) + SATA SSD (256GB)
pve-sidious 24x7 Worker K3s Worker 01 (100GB), Longhorn on SATA SSD local-lvm (~415GB free) + SATA SSD (256GB)
pve-maul Hack Box Kali Linux (80GB), Security Sandboxes local-lvm (141GB free)

Storage Operations

1. Prepare SATA SSDs for Longhorn (pve-vader & pve-sidious)

To make the SATA SSDs accessible to Longhorn (which runs inside K3s VMs), we will create a dedicated LVM thin pool on each host and attach a virtual disk to the VMs.

# === Run on BOTH pve-vader AND pve-sidious ===

# 1. Clear existing filesystem signatures (Critical)
wipefs -a /dev/sda1

# 2. Initialize as LVM Physical Volume
pvcreate /dev/sda1

# 3. Create a Volume Group on the SATA SSD
vgcreate vg-longhorn /dev/sda1

# 4. Create a Thin Pool
lvcreate -L 230G -T vg-longhorn/tp-longhorn

# 4. In Terraform (Guide 03), we will attach a disk from this pool to the K3s VMs
# It will appear inside the VM as /dev/sdb

2. Configure Longhorn to Use Dedicated Virtual Disk

After deploying Longhorn (Phase 4), configure the storage path inside the K3s VM.

# === Run inside K3s Master and Workers ===

# 1. Format the second disk
mkfs.ext4 /dev/sdb

# 2. Mount it
mkdir -p /mnt/longhorn
echo '/dev/sdb /mnt/longhorn ext4 defaults,noatime,nofail 0 2' >> /etc/fstab
mount /mnt/longhorn

Apply node configuration via manifest (using K3s node names):

# longhorn-node-config.yaml
apiVersion: longhorn.io/v1beta2
kind: Node
metadata:
  name: k3s-master-01  # Correct K3s node name
  namespace: longhorn-system
spec:
  disks:
    sata-ssd:
      path: /mnt/longhorn
      storageReserved: 26843545600  # 25 GiB in bytes (Longhorn expects bytes, not a percentage)
      tags:
        - ssd
        - longhorn
---
apiVersion: longhorn.io/v1beta2
kind: Node
metadata:
  name: k3s-worker-01  # Correct K3s node name
  namespace: longhorn-system
spec:
  disks:
    sata-ssd:
      path: /mnt/longhorn
      storageReserved: 26843545600  # 25 GiB in bytes (Longhorn expects bytes, not a percentage)
      tags:
        - ssd
        - longhorn
kubectl apply -f longhorn-node-config.yaml

3. Shrink pve-root on pve-vader & pve-sidious (96GB -> 40GB)

Caution

Take a Proxmox backup before proceeding. This is a destructive, irreversible operation on a live host. In the Proxmox UI, go to Datacenter → Backup → Backup Now for both pve-vader and pve-sidious and store the backup on a USB drive or pve-maul. Do not skip this step.

WARNING: This requires rebooting into rescue/live mode. Cannot shrink a mounted root filesystem. Prerequisite: Boot from Proxmox installer USB in rescue mode, or use SystemRescue ISO.

# === Boot from Proxmox installer USB, select "Rescue Boot" ===
# Or boot a SystemRescue ISO via Proxmox ISO mount

# Activate LVM volumes
vgchange -ay

# Check filesystem before shrinking
e2fsck -f /dev/mapper/pve-root

# Shrink filesystem AND logical volume in one step (to 40GB)
# --resizefs shrinks the filesystem first, then the LV
lvreduce --resizefs -L 40G /dev/pve/root

# Extend the thin pool to consume freed space (~56GB per node)
lvextend -l +100%FREE /dev/pve/data

# Verify
lvs
# pve-root should show 40G
# pve-data (thin pool) should show ~404GB

# Reboot
exit
reboot

Expected result after shrinking both nodes:

Node pve-root local-lvm (thin pool)
pve-vader 40GB ~404GB
pve-sidious 40GB ~404GB
# After reboot, verify
pvesm status
df -h /
lvs

4. Verify Longhorn Replica Placement

After Longhorn is deployed and configured (Phase 4):

# Verify Longhorn sees the SATA SSD disks on both worker nodes
kubectl -n longhorn-system get nodes.longhorn.io -o wide

# Check disk status
kubectl -n longhorn-system get disks.longhorn.io -A

# Verify storage capacity
kubectl -n longhorn-system get nodes.longhorn.io \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.diskStatus.sata-ssd.storageAvailable}{"\n"}{end}'

# Create a test PVC to verify replica scheduling
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: longhorn-test
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 1Gi
EOF

# Verify volume has 2 replicas across nodes
kubectl -n longhorn-system get volumes.longhorn.io

# Clean up test
kubectl delete pvc longhorn-test -n default

Expected capacity:

  • pve-vader: ~236GB (SATA SSD) dedicated to Longhorn
  • pve-sidious: ~236GB (SATA SSD) dedicated to Longhorn
  • Total replicated capacity: ~472GB with 2 replicas (effective ~236GB usable)

Source: docs/guides/01-local-environment-setup.md


Guide 01: Local Environment Setup

Configure your development machine for infrastructure provisioning and cluster management.


Overview

This guide prepares your local macOS/Linux machine with all required tools for deploying and managing the homelab infrastructure.

Time Required: ~15 minutes Prerequisites: macOS or Linux machine, internet connection


Tools to Install

Tool Version Purpose
Terraform >= 1.6 Provision Proxmox VMs/LXCs
Ansible >= 2.15 Bootstrap services
kubectl >= 1.28 Kubernetes management
helm >= 3.14 Kubernetes package management
Tailscale Latest VPN access to homelab
direnv Latest Auto-load .envrc per project directory
jq Latest JSON processing
yq Latest YAML processing

macOS Setup

1. Install Homebrew (if not installed)

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

2. Install Core Tools

# Terraform
brew tap hashicorp/tap
brew install hashicorp/tap/terraform

# Ansible
brew install ansible

# kubectl
brew install kubectl

# helm
brew install helm

# Tailscale
brew install --cask tailscale

# direnv (auto-loads .envrc when you cd into the project directory)
brew install direnv
# Add to your shell init file (~/.zshrc or ~/.bashrc):
echo 'eval "$(direnv hook zsh)"' >> ~/.zshrc && source ~/.zshrc

# jq (JSON processor)
brew install jq

# yq (YAML processor)
brew install yq

3. Verify Installations

terraform version
ansible --version
kubectl version --client
helm version
tailscale version
jq --version
yq --version

4. Start Tailscale

sudo tailscale up

Follow the browser prompt to authenticate.


Linux Setup (Ubuntu/Debian)

1. Update System

sudo apt update && sudo apt upgrade -y

2. Install Dependencies

sudo apt install -y curl wget git software-properties-common

3. Install Terraform

wget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraform

4. Install Ansible

sudo apt update
sudo apt install -y ansible

5. Install kubectl

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/

6. Install helm

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

7. Install Tailscale

curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

8. Install jq and yq

sudo apt install -y jq
wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/bin/yq
chmod +x /usr/bin/yq

9. Install direnv

sudo apt install -y direnv
# Add to ~/.bashrc
echo 'eval "$(direnv hook bash)"' >> ~/.bashrc && source ~/.bashrc

SSH Key Generation

Generate SSH keys for infrastructure access:

# Generate new key pair
ssh-keygen -t ed25519 -C "homelab" -f ~/.ssh/homelab

# Display public key (add to Proxmox nodes)
cat ~/.ssh/homelab.pub

Copy the public key and add it to each Proxmox node:

# On each Proxmox node
mkdir -p ~/.ssh
echo "YOUR_PUBLIC_KEY_HERE" >> ~/.ssh/authorized_keys

Project Directory Structure

Create the project structure:

cd /Volumes/Codex/Projects/homelab

# Create directories
mkdir -p terraform/environments/homelab
mkdir -p ansible/inventories/homelab/group_vars
mkdir -p ansible/playbooks
mkdir -p ansible/roles
mkdir -p gitops-apps/{infrastructure,services,monitoring,security}

# Verify structure
tree -L 2 -d

Environment Variables

Create the project environment file and activate it with direnv:

# Create .envrc file
cat > .envrc << 'EOF'
# Proxmox Credentials (bpg/proxmox provider env var names)
export PROXMOX_VE_ENDPOINT="https://192.168.1.11:8006"
export PROXMOX_VE_API_TOKEN="root@pam!terraform=<your-token-here>"

# Kubernetes
export KUBECONFIG="${HOME}/.kube/homelab-config"

# Terraform SSH key (injected into Cloud-Init)
export TF_VAR_ssh_public_key=$(cat ~/.ssh/homelab.pub 2>/dev/null || echo "")

# Tailscale
export TS_AUTHKEY=""  # Optional: for automated Tailscale auth

# Git
export GIT_USERNAME="your-gitea-username"
export GIT_EMAIL="your-email@example.com"
EOF

# Allow direnv to load this file (run once per project)
# direnv will then auto-load/unload .envrc whenever you cd in/out of the directory
direnv allow .

Important

.envrc contains secrets. Make sure .envrc is in .gitignore before committing:

grep '.envrc' .gitignore || echo '.envrc' >> .gitignore

Proxmox API Token Setup

Create API Token on pve-vader

  1. Log into Proxmox web UI: https://192.168.1.11:8006
  2. Go to Datacenter > Permissions > API Tokens
  3. Click Add: Select root@pam
  4. Set Token ID: terraform
  5. Uncheck Privilege Separation (note: for production use, create a dedicated terraform@pve user with minimal permissions instead of root@pam)
  6. Click Generate
  7. Copy the token immediately (it won't be shown again)

Add Token to Environment

# Edit .envrc and set the token
export PROXMOX_VE_API_TOKEN="root@pam!terraform=YOUR_TOKEN_HERE"

# Reload
direnv allow .

Verify Configuration

Test Proxmox API Access

curl -s "https://192.168.1.11:8006/api2/json/nodes" \
  -H "Authorization: PVEAPIToken=root@pam!terraform@homelab=YOUR_TOKEN" \
  -k | jq .

Expected output:

{
  "data": [
    {"node": "vader", "status": "online", ...},
    {"node": "sidious", "status": "online", ...},
    {"node": "maul", "status": "online", ...}
  ]
}

Test SSH Access

# Test SSH to each node
ssh -i ~/.ssh/homelab root@192.168.1.11 "hostname"
ssh -i ~/.ssh/homelab root@192.168.1.12 "hostname"
ssh -i ~/.ssh/homelab root@192.168.1.10 "hostname"

Git Configuration

Configure git for the project:

git config --global user.name "Your Name"
git config --global user.email "your-email@example.com"
git config --global core.sshCommand "ssh -i ~/.ssh/homelab"

# Initialize git repository
git init
git add .
git commit -m "Initial: Project structure"

Quick Verification Script

Create and run verification script:

cat > verify-setup.sh << 'EOF'
#!/bin/bash

echo "🔍 Verifying local environment setup..."

# Check tools
tools=("terraform" "ansible" "kubectl" "helm" "jq" "yq")
for tool in "${tools[@]}"; do
  if command -v $tool &> /dev/null; then
    echo "✅ $tool: $(command -v $tool)"
  else
    echo "❌ $tool: NOT FOUND"
  fi
done

# Check Tailscale
if tailscale status &> /dev/null; then
  echo "✅ Tailscale: Connected"
else
  echo "⚠️  Tailscale: Not connected"
fi

# Check SSH key
if [ -f ~/.ssh/homelab ]; then
  echo "✅ SSH key exists"
else
  echo "❌ SSH key not found"
fi

# Check Proxmox API
if [ -n "$PM_PASS_TOKEN" ]; then
  echo "✅ Proxmox token configured"
else
  echo "❌ Proxmox token not set"
fi

echo "✨ Setup verification complete!"
EOF

chmod +x verify-setup.sh
./verify-setup.sh

Troubleshooting

Terraform not found

# macOS: Rehash brew
hash -r

# Linux: Check PATH
echo $PATH | grep -o "[^:]*"

Proxmox API connection refused

# Verify API URL is correct
echo $PM_API_URL

# Test connectivity
ping -c 3 192.168.1.10

# Check if API token is valid
curl -k "https://192.168.1.10:8006/api2/json/version" \
  -H "Authorization: PVEAPIToken=root@pam!terraform@homelab=$PM_PASS_TOKEN"

SSH connection refused

# Verify SSH is running on nodes
ssh root@192.168.1.10 "systemctl status ssh"

# Check if key is added to node
ssh -i ~/.ssh/homelab root@192.168.1.10 "cat ~/.ssh/authorized_keys"

Next Steps

Once local setup is verified:

➡️ Continue to Guide 02: Proxmox Cluster


Checklist

  • All required tools installed
  • SSH key generated and distributed
  • Proxmox API token created
  • Project directory structure created
  • Environment variables configured
  • Proxmox API access verified
  • SSH access verified to all nodes
  • Tailscale connected

Source: docs/guides/02-proxmox-cluster.md


Guide 02: Proxmox Cluster Configuration

Form the Proxmox cluster and configure Software-Defined Networking (SDN) for network isolation.


Overview

This guide creates a 3-node Proxmox VE cluster and configures VXLAN-based virtual networks. pve-vader is the primary master.

Time Required: ~20 minutes Prerequisites: All three Proxmox nodes installed and reachable via SSH on the management network (192.168.1.x)


Phase 1: Verify Node Configuration

1.1 Verify Network on All Nodes

Run on each node (pve-vader, pve-sidious, pve-maul):

Ensure /etc/hosts contains all three hosts:

192.168.1.11 pve-vader.homelab.local pve-vader
192.168.1.12 pve-sidious.homelab.local pve-sidious
192.168.1.10 pve-maul.homelab.local pve-maul

Verify connectivity between nodes:

# Run on pve-vader
ping -c 3 192.168.1.12   # to pve-sidious
ping -c 3 192.168.1.10   # to pve-maul

Note

MTU consideration: VXLAN encapsulates with an overhead of 50 bytes. The physical NICs must have MTU ≥ 1500 (default). If you set NIC MTU to 9000 (jumbo frames), set VNet MTU to 8950 instead of 1450. Do not mix MTU sizes across nodes.


Phase 2: Create Cluster

2.1 Initialize Cluster on pve-vader (Master)

# SSH to pve-vader
ssh root@192.168.1.11

# Create the cluster using the management IP as link0
pvecm create homelab-cluster --link0 192.168.1.11

2.2 Join pve-sidious to Cluster (24/7 Node)

# SSH to pve-sidious
ssh root@192.168.1.12

# Join the cluster
pvecm add 192.168.1.11 --link0 192.168.1.12

2.3 Join pve-maul to Cluster (Hack Box)

# SSH to pve-maul
ssh root@192.168.1.10

# Join the cluster
pvecm add 192.168.1.11 --link0 192.168.1.10

2.4 Verify Cluster Formation

# Run on pve-vader
pvecm status

# Expected output (key fields):
# Quorate: Yes
# Total votes: 3
# Nodes: 3
#   ID  Votes Flags  Name
#    1    1   M,Ees  pve-vader
#    2    1   Ees    pve-sidious
#    3    1   Ees    pve-maul

# Also verify from the Proxmox UI:
# Datacenter → Cluster → check all 3 nodes show green

Phase 3: Configure Proxmox SDN (VXLAN)

3.1 Create VXLAN Zone

  1. Navigate to Datacenter > SDN > Zones
  2. Click Add: VXLAN
    • Zone: vxlan-zone
    • MTU: 1450 (must be 50 less than physical NIC MTU)
    • Nodes: Select all 3 nodes

3.2 Create VNets

VNet Name Tag CIDR Purpose
vnet-homelab 100 10.10.10.0/24 Primary K3s Cluster
vnet-sandbox 200 10.20.20.0/24 Isolated Hack Box

3.3 Apply Configuration

# Apply SDN configuration to cluster (Proxmox VE 8.x)
pvesh create /cluster/sdn --action apply

3.4 Verify SDN

# Verify VNets are visible on all nodes
pvesh get /cluster/sdn/vnets

# Expected: both vnet-homelab and vnet-sandbox listed

# Verify on individual node
ssh root@192.168.1.12 "pvesh get /nodes/pve-sidious/network" | grep -E 'vnet|vxlan'

Phase 4: Quorum & Availability

Since pve-vader and pve-sidious are online 24/7, they maintain quorum (2/3 votes) automatically even when pve-maul is powered off.

State Votes Online Status
Vader + Sidious + Maul 3/3 Healthy
Vader + Sidious 2/3 Healthy (Quorum Maintained)
Vader Only 1/3 Cluster Read-Only (Quorum Lost)

Completion Checklist

  • /etc/hosts updated on all 3 nodes with correct hostnames
  • Cluster created on pve-vader: pvecm create homelab-cluster --link0 192.168.1.11
  • pve-sidious joined cluster: pvecm status shows Nodes: 2, Quorate: Yes
  • pve-maul joined cluster: pvecm status shows Nodes: 3
  • VXLAN zone vxlan-zone created (MTU: 1450)
  • VNet vnet-homelab (Tag: 100) created
  • VNet vnet-sandbox (Tag: 200) created
  • SDN applied: pvesh create /cluster/sdn --action apply
  • VNets verified on all nodes: pvesh get /cluster/sdn/vnets

Source: docs/guides/03-terraform-infrastructure.md


Guide 03: Terraform Infrastructure Provisioning

Deploy all VMs and LXC containers using Terraform with the Proxmox provider.


Overview

This guide provisions the primary infrastructure. All management and control plane VMs are pinned to pve-vader (Master) and pve-sidious (24/7 Worker).

Time Required: ~30 minutes
Prerequisites: Guide 02 completed, direnv and Terraform installed (Guide 01), .envrc configured with PROXMOX_VE_API_TOKEN

Terraform Provider: bpg/proxmox — actively maintained, supports Cloud-Init and multi-disk VMs.


Phase 1: Prerequisite — Register Storage Pool in Proxmox

Important

This step must be completed before running terraform plan. Terraform references vg-longhorn by name and will fail if it isn't registered in Proxmox as a storage backend.

After creating the LVM thin pool on pve-vader and pve-sidious (see storage-checklist.md § Storage Operations):

  1. Proxmox UI → Datacenter → Storage → Add → LVM-Thin
  2. Repeat for each node (vader and sidious):
    • ID: vg-longhorn
    • Volume group: vg-longhorn
    • Thin pool: tp-longhorn
    • Nodes: select only the node you're adding it to
  3. Verify: pvesm status on each node should list vg-longhorn as active

Phase 2: Terraform Project Structure

The following files are already present in the repo. Review them before running:

terraform/
├── environments/
│   └── homelab/
│       ├── providers.tf   ← bpg/proxmox provider + required_version
│       ├── variables.tf   ← all input variables (API token, SSH key, etc.)
│       └── main.tf        ← all VM/LXC resource definitions
└── modules/
    └── vm/
        ├── main.tf        ← proxmox_virtual_environment_vm resource
        ├── variables.tf   ← module input variables
        └── outputs.tf     ← vm_id, vm_name, ip_address

VM Allocation Summary

VM Name Node vCPU RAM Boot Disk Data Disk On Boot
pfSense-Primary vader 2 4GB 32GB (local-lvm)
k3s-master-01 vader 4 8GB 100GB (local-lvm) 200GB (vg-longhorn)
k3s-worker-01 sidious 4 8GB 100GB (local-lvm) 200GB (vg-longhorn)
kali-linux maul 4 8GB 80GB (local-lvm)

Capacity Check

Node NVMe Committed SATA Committed Status
Vader ~132GB 200GB ✅ Stable
Sidious ~100GB 200GB ✅ Stable
Maul ~80GB 0GB ✅ Stable

Phase 3: Deploy Infrastructure

3.1 Initialise Terraform

cd terraform/environments/homelab

# Download provider plugins
terraform init

# Expected: "Terraform has been successfully initialized!"
# Provider bpg/proxmox will be downloaded from registry.terraform.io

3.2 Validate Configuration

# Check configuration syntax and references
terraform validate

# Expected: "Success! The configuration is valid."

3.3 Preview Changes

# Set your SSH public key for Cloud-Init injection
export TF_VAR_ssh_public_key=$(cat ~/.ssh/homelab.pub)

# Generate and review the execution plan
terraform plan -out=tfplan

# Review the output carefully:
# + resource "proxmox_virtual_environment_vm" means a VM will be created
# Read through all planned changes before applying

3.4 Apply Infrastructure

# Apply the plan (will prompt for confirmation)
terraform apply tfplan

# Monitor progress — VM cloning can take 2–5 minutes per VM
# Expected final line: "Apply complete! Resources: N added, 0 changed, 0 destroyed."

3.5 Verify VMs

# List all VMs via Proxmox API
terraform show

# Or via Proxmox CLI on each node:
ssh root@192.168.1.11 "qm list"  # vader VMs
ssh root@192.168.1.12 "qm list"  # sidious VMs
ssh root@192.168.1.10 "qm list"  # maul VMs

# Verify SSH access to K3s master (may take 60s for cloud-init to complete)
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "uname -a"

Phase 4: pfSense Manual Configuration

pfSense cannot be fully automated via Terraform — the initial NIC assignment requires the console.

  1. Open Proxmox UI → pve-vader → VM 100 (pfSense-Primary) → Console
  2. Complete pfSense installer
  3. Assign interfaces:
    • vtnet0WAN (receives IP from home router via DHCP)
    • vtnet1LAN (10.10.10.1/24)
    • vtnet2OPT1 (10.20.20.1/24)
  4. Set firewall rule: Block OPT1 → LAN (sandbox isolation)
  5. In pfSense web UI (https://10.10.10.1): set DNS to point to AdGuard (10.10.10.2)

Completion Checklist

  • vg-longhorn registered in Proxmox UI as LVM-Thin storage on both vader and sidious
  • terraform init completed successfully
  • terraform validate returns "Success"
  • terraform plan reviewed — no unexpected changes
  • terraform apply completed — all VMs created
  • qm list on vader shows: pfSense (100), k3s-master-01 (200)
  • qm list on sidious shows: k3s-worker-01 (201)
  • qm list on maul shows: kali-linux (800) with status stopped
  • SSH to k3s-master-01 (ubuntu@10.10.10.10) succeeds
  • pfSense initial NIC assignment completed via console

Source: docs/guides/04-ansible-bootstrap.md


Guide 04: Ansible Bootstrap

Bootstrap services on provisioned VMs and LXCs using Ansible playbooks.


Overview

This guide uses Ansible to configure the base services on all provisioned infrastructure, including K3s prerequisites, AdGuard Home, and Tailscale.

Time Required: ~30 minutes Prerequisites: Guide 03 completed, Ansible installed


Architecture

Ansible Playbooks
├── adguard.yml           # Configure AdGuard Home
├── tailscale.yml         # Configure Tailscale
├── k3s-prereqs.yml       # K3s prerequisites
└── common.yml            # Common configuration

Ansible Roles
├── common/               # Common tasks
├── adguard/              # AdGuard Home
├── tailscale/            # Tailscale VPN
└── k3s/                  # K3s Kubernetes

Phase 1: Create Ansible Configuration

1.1 Create ansible.cfg

Create ansible/ansible.cfg (this file has already been created — see repo root):

Important

Before running any playbooks, install the required Ansible collections. These provide modules used throughout all roles.

ansible-galaxy collection install community.general ansible.posix

1.2 Verify Inventory

cd /Volumes/Codex/Projects/homelab/ansible

# Test inventory — all nodes must be reachable before proceeding
ansible -i inventories/homelab/hosts.yml all -m ping

# Expected output:
# k3s-master-01 | SUCCESS => {...}
# k3s-worker-01 | SUCCESS => {...}
# adguard       | SUCCESS => {...}
# tailscale     | SUCCESS => {...}

Phase 2: Create Common Role

2.1 Create Common Role Structure

mkdir -p ansible/roles/common/{tasks,handlers,files,templates}

2.2 Create Common Tasks

Create ansible/roles/common/tasks/main.yml:

---
# Common system configuration tasks

- name: Update apt cache
  ansible.builtin.apt:
    update_cache: true
    cache_valid_time: 3600
  become: true

- name: Upgrade all packages
  ansible.builtin.apt:
    upgrade: dist
    autoremove: true
    autoclean: true
  become: true

- name: Install common packages
  ansible.builtin.apt:
    name:
      - curl
      - wget
      - git
      - vim
      - htop
      - net-tools
      - dnsutils
      - jq
      - ca-certificates
      - gnupg
      - lsb-release
      - python3
      - python3-pip
    state: present
  become: true

- name: Set timezone to UTC
  community.general.timezone:
    name: UTC
  become: true

- name: Configure sysctl settings
  ansible.posix.sysctl:
    name: "{{ item.name }}"
    value: "{{ item.value }}"
    state: present
    reload: true
  loop:
    - { name: net.ipv4.ip_forward, value: 1 }
    - { name: net.ipv6.conf.all.forwarding, value: 1 }
    - { name: net.bridge.bridge-nf-call-iptables, value: 1 }
    - { name: net.bridge.bridge-nf-call-ip6tables, value: 1 }
  become: true
  when: inventory_hostname in groups['k3s_master'] + groups['k3s_workers']

- name: Disable swap
  ansible.builtin.command: swapoff -a
  become: true
  changed_when: false
  failed_when: false

- name: Remove swap from fstab
  ansible.builtin.lineinfile:
    path: /etc/fstab
    regexp: '^(.*\sswap\s+sw\s+.*)$'
    state: absent
  become: true

- name: Load kernel modules
  community.general.modprobe:
    name: "{{ item }}"
    state: present
  loop:
    - overlay
    - br_netfilter
    - nf_conntrack
  become: true

- name: Persist kernel modules
  ansible.builtin.copy:
    dest: "/etc/modules-load.d/{{ item }}.conf"
    content: "{{ item }}\n"
    mode: '0644'
  loop:
    - overlay
    - br_netfilter
  become: true

- name: Create homelab user
  ansible.builtin.user:
    name: homelab
    system: true
    shell: /bin/bash
    home: /opt/homelab
    create_home: true
  become: true

Phase 3: Create AdGuard Role

3.1 Create AdGuard Role Structure

mkdir -p ansible/roles/adguard/{tasks,handlers,templates}

3.2 Create AdGuard Tasks

Create ansible/roles/adguard/tasks/main.yml:

---
# AdGuard Home installation and configuration

- name: Download AdGuard Home
  ansible.builtin.get_url:
    url: "https://github.com/AdguardTeam/AdGuardHome/releases/latest/download/AdGuardHome_linux_amd64.tar.gz"
    dest: /tmp/AdGuardHome.tar.gz
    mode: '0644'
  become: true

- name: Create AdGuard directory
  ansible.builtin.file:
    path: /opt/AdGuardHome
    state: directory
    mode: '0755'
  become: true

- name: Extract AdGuard Home
  ansible.builtin.unarchive:
    src: /tmp/AdGuardHome.tar.gz
    dest: /opt/AdGuardHome
    remote_src: true
  become: true

- name: Install AdGuard Home
  ansible.builtin.command: /opt/AdGuardHome/AdGuardHome -s install
  args:
    chdir: /opt/AdGuardHome
    creates: /opt/AdGuardHome/AdGuardHome.service
  become: true
  notify: restart adguard

- name: Start AdGuard Home
  ansible.builtin.systemd:
    name: AdGuardHome
    state: started
    enabled: true
    daemon_reload: true
  become: true

- name: Wait for AdGuard to be ready
  ansible.builtin.wait_for:
    port: 3000
    delay: 5
    timeout: 60

- name: Display AdGuard setup URL
  ansible.builtin.debug:
    msg: "AdGuard Home available at http://{{ ansible_host }}:3000"

Create ansible/roles/adguard/handlers/main.yml:

---
- name: restart adguard
  ansible.builtin.systemd:
    name: AdGuardHome
    state: restarted
  become: true

Phase 4: Create Tailscale Role

4.1 Create Tailscale Role Structure

mkdir -p ansible/roles/tailscale/{tasks,handlers}

4.2 Create Tailscale Tasks

Create ansible/roles/tailscale/tasks/main.yml:

---
# Tailscale installation and configuration
# Requires: ansible.posix collection (ansible-galaxy collection install ansible.posix)

- name: Create APT keyrings directory
  ansible.builtin.file:
    path: /etc/apt/keyrings
    state: directory
    mode: '0755'
  become: true

- name: Download Tailscale GPG key
  ansible.builtin.get_url:
    url: https://pkgs.tailscale.com/stable/ubuntu/noble.nokey.gpg
    dest: /etc/apt/keyrings/tailscale.gpg
    mode: '0644'
  become: true

- name: Add Tailscale APT repository
  ansible.builtin.apt_repository:
    repo: "deb [signed-by=/etc/apt/keyrings/tailscale.gpg] https://pkgs.tailscale.com/stable/ubuntu noble main"
    state: present
    filename: tailscale
  become: true

- name: Install Tailscale
  ansible.builtin.apt:
    name:
      - tailscale
    state: present
    update_cache: true
  become: true

- name: Start Tailscale daemon
  ansible.builtin.systemd:
    name: tailscaled
    state: started
    enabled: true
  become: true

- name: Check Tailscale status
  ansible.builtin.command: tailscale status
  register: tailscale_status
  failed_when: false
  changed_when: false

- name: Display Tailscale authentication URL
  ansible.builtin.debug:
    msg: |
      Tailscale is not authenticated. Please authenticate manually:
      1. SSH to the node: ssh {{ ansible_user }}@{{ ansible_host }}
      2. Run: sudo tailscale up --advertise-routes=10.10.10.0/24,10.20.20.0/24
      3. Follow the browser prompt
  when: tailscale_status.rc != 0

Phase 5: Create Playbooks

5.1 Create Common Playbook

Create ansible/playbooks/common.yml:

---
# Apply common configuration to all hosts

- name: Apply common configuration
  hosts: all
  become: true
  gather_facts: true

  roles:
    - role: common
      tags: common

5.2 Create AdGuard Playbook

Create ansible/playbooks/adguard.yml:

---
# Deploy AdGuard Home

- name: Configure AdGuard Home
  hosts: network_services
  become: true
  gather_facts: true

  pre_tasks:
    - name: Include common role
      include_role:
        name: common
      tags: common

  roles:
    - role: adguard
      tags: adguard

5.3 Create Tailscale Playbook

Create ansible/playbooks/tailscale.yml:

---
# Deploy Tailscale

- name: Configure Tailscale
  hosts: network_services
  become: true
  gather_facts: true

  pre_tasks:
    - name: Include common role
      include_role:
        name: common
      tags: common

  roles:
    - role: tailscale
      tags: tailscale

5.4 Create K3s Prerequisites Playbook

Create ansible/playbooks/k3s-prereqs.yml:

---
# Prepare K3s nodes
# NOTE: Do NOT install the OS 'containerd' package here.
# K3s ships its own bundled containerd. Installing the OS package creates a
# conflicting socket at /run/containerd/containerd.sock and will cause K3s to fail.

- name: Prepare K3s nodes
  hosts: k3s_master, k3s_workers
  become: true
  gather_facts: true

  roles:
    - role: common
      tags: common

  post_tasks:
    - name: Install iSCSI client (required by Longhorn)
      ansible.builtin.apt:
        name:
          - open-iscsi
          - nfs-common
        state: present
        update_cache: true

    - name: Enable and start iSCSI daemon
      ansible.builtin.systemd:
        name: iscsid
        state: started
        enabled: true

    - name: Load required kernel modules for K3s networking
      community.general.modprobe:
        name: "{{ item }}"
        state: present
      loop:
        - br_netfilter
        - overlay
        - nf_conntrack

    - name: Persist kernel modules across reboots
      ansible.builtin.copy:
        dest: /etc/modules-load.d/k3s.conf
        content: |
          br_netfilter
          overlay
          nf_conntrack
        mode: '0644'

    - name: Configure required sysctl parameters for K3s
      ansible.posix.sysctl:
        name: "{{ item.key }}"
        value: "{{ item.value }}"
        state: present
        reload: true
      loop:
        - { key: "net.bridge.bridge-nf-call-iptables", value: "1" }
        - { key: "net.bridge.bridge-nf-call-ip6tables", value: "1" }
        - { key: "net.ipv4.ip_forward", value: "1" }

Phase 6: Run Playbooks

6.1 Apply Common Configuration

cd /Volumes/Codex/Projects/homelab/ansible

# Apply to all hosts
ansible-playbook -i inventories/homelab/hosts.yml playbooks/common.yml

# Expected output: All tasks completed successfully

6.2 Deploy AdGuard Home

# Deploy to network services
ansible-playbook -i inventories/homelab/hosts.yml playbooks/adguard.yml

# Expected output:
# TASK [adguard : Display AdGuard setup URL]
# ok: [adguard] => {
#     "msg": "AdGuard Home available at http://10.10.10.2:3000"
# }

6.3 Deploy Tailscale

# Deploy to network services
ansible-playbook -i inventories/homelab/hosts.yml playbooks/tailscale.yml

# Authenticate Tailscale manually on the node:
ssh -i ~/.ssh/homelab ubuntu@10.10.10.3
sudo tailscale up --advertise-routes=10.10.10.0/24,10.20.20.0/24

6.4 Apply K3s Prerequisites

# Prepare K3s nodes
ansible-playbook -i inventories/homelab/hosts.yml playbooks/k3s-prereqs.yml

# Verify iSCSI daemon is running on all K3s nodes
ansible -i inventories/homelab/hosts.yml k3s_master,k3s_workers \
  -m shell -a "systemctl status iscsid" --become

# Verify required kernel modules are loaded
ansible -i inventories/homelab/hosts.yml k3s_master,k3s_workers \
  -m shell -a "lsmod | grep -E 'br_netfilter|nf_conntrack'" --become

Phase 7: Verification

7.1 Verify AdGuard Home

# Check AdGuard is running
curl -s http://10.10.10.2:3000

# Should return HTML (AdGuard setup page)

Access in browser: http://10.10.10.2:3000

Complete the setup wizard:

  1. Create admin password
  2. Configure upstream DNS (Cloudflare: 1.1.1.1)
  3. Enable DNS filtering lists
  4. Set as DHCP DNS server on pfSense

7.2 Verify Tailscale

# Check Tailscale status
ssh -i ~/.ssh/homelab ubuntu@10.10.10.3 "sudo tailscale status"

# Should show:
# # Active connections:
# ...

7.3 Verify K3s Prerequisites

# Verify containerd
ansible -i inventories/homelab/hosts.yml k3s_master,k3s_workers -m shell -a "containerd --version" --become

# Verify kernel modules
ansible -i inventories/homelab/hosts.yml k3s_master,k3s_workers -m shell -a "lsmod | grep -E 'overlay|br_netfilter'" --become

# Verify swap is disabled
ansible -i inventories/homelab/hosts.yml k3s_master,k3s_workers -m shell -a "free -h" --become

Phase 8: Create Group Variables

Create ansible/inventories/homelab/group_vars/all.yml:

---
# Common variables for all hosts

# Timezone
homelab_timezone: UTC

# DNS servers
homelab_dns_servers:
  - 10.10.10.2  # AdGuard Home
  - 1.1.1.1     # Cloudflare fallback

# NTP servers
homelab_ntp_servers:
  - pool.ntp.org

# Domain
homelab_domain: homelab.local

Create ansible/inventories/homelab/group_vars/k3s.yml:

---
# K3s cluster variables

# K3s version
k3s_version: "v1.28.5+k3s1"

# K3s configuration
k3s_cluster_cidr: "10.42.0.0/16"
k3s_service_cidr: "10.43.0.0/16"

# K3s server URL (for workers)
k3s_server_url: "https://10.10.10.10:6443"

# Disable Traefik (using ingress-nginx instead)
k3s_disable_traefik: true

Troubleshooting

Ansible Connection Issues

Issue: UNREACHABLE! => {"failed": true, "msg": "Failed to connect..."}

Solution:

# Test SSH manually
ssh -i ~/.ssh/homelab -vvv ubuntu@10.10.10.10

# Check if host is reachable
ping -c 3 10.10.10.10

# Verify ansible inventory
ansible-inventory -i inventories/homelab/hosts.yml --list

AdGuard Won't Start

Issue: AdGuard service fails to start

Solution:

# Check service logs
ssh -i ~/.ssh/homelab ubuntu@10.10.10.2 "sudo journalctl -u AdGuardHome -n 50"

# Check if port is already in use
ssh -i ~/.ssh/homelab ubuntu@10.10.10.2 "sudo netstat -tulpn | grep 3000"

# Restart manually
ssh -i ~/.ssh/homelab ubuntu@10.10.10.2 "sudo systemctl restart AdGuardHome"

Tailscale Authentication

Issue: Tailscale not authenticated

Solution:

# SSH to the node
ssh -i ~/.ssh/homelab ubuntu@10.10.10.3

# Run authentication command
sudo tailscale up --advertise-routes=10.10.10.0/24,10.20.20.0/24

# Follow browser prompt
# Approve subnet routes in Tailscale admin console

Next Steps

Base services configured:

➡️ Continue to Guide 05: K3s Cluster


Checklist

  • Ansible configuration created
  • Inventory verified
  • Common role created
  • AdGuard role created
  • Tailscale role created
  • Playbooks created
  • Common configuration applied
  • AdGuard Home deployed and configured
  • Tailscale deployed and authenticated
  • K3s prerequisites applied
  • All services verified

Source: docs/guides/05-k3s-cluster.md


Guide 05: K3s Kubernetes Cluster

Deploy a highly available K3s Kubernetes cluster with 1 master and 2 worker nodes.


Overview

This guide deploys a K3s Kubernetes cluster using Ansible. K3s is a lightweight, certified Kubernetes distribution perfect for edge computing and homelabs.

Time Required: ~45 minutes Prerequisites: Guide 04 completed


Architecture

                    K3s Cluster
    ┌────────────────────────────────────────┐
    │                                        │
    │  ┌──────────────┐  ┌─────────────────┐ │
    │  │ k3s-master   │  │  k3s-worker-01  │ │
    │  │  10.10.10.10 │◄─┤   10.10.10.11   │ │
    │  │              │  │  (pve-vader)    │ │
    │  │  Control     │  │                 │ │
    │  │  Plane       │  │  Workloads      │ │
    │  └──────────────┘  └─────────────────┘ │
    │         │                               │
    │         │                               │
    │  ┌─────────────────┐                    │
    │  │ k3s-worker-02   │                    │
    │  │  10.10.10.12    │                    │
    │  │ (pve-sidious)   │                    │
    │  │                 │                    │
    │  │  Workloads      │                    │
    │  └─────────────────┘                    │
    └────────────────────────────────────────┘

Pod Network: 10.42.0.0/16 (Flannel)
Service Network: 10.43.0.0/16

Phase 1: Create K3s Ansible Role

1.1 Create Role Structure

mkdir -p ansible/roles/k3s/{tasks,handlers,templates,files}

1.2 Create K3s Master Tasks

Create ansible/roles/k3s/tasks/master.yml:

---
# K3s master installation tasks

- name: Create K3s config directory
  ansible.builtin.file:
    path: /etc/rancher/k3s
    state: directory
    mode: '0755'
  become: true

- name: Download K3s binary
  ansible.builtin.get_url:
    url: "https://github.com/k3s-io/k3s/releases/download/{{ k3s_version }}/k3s"
    dest: /usr/local/bin/k3s
    mode: '0755'
    owner: root
    group: root
  become: true

- name: Create K3s systemd service
  ansible.builtin.template:
    src: k3s.service.j2
    dest: /etc/systemd/system/k3s.service
    mode: '0644'
  become: true
  notify: restart k3s

- name: Enable and start K3s
  ansible.builtin.systemd:
    name: k3s
    state: started
    enabled: true
    daemon_reload: true
  become: true

- name: Wait for K3s to be ready
  ansible.builtin.wait_for:
    port: 6443
    delay: 10
    timeout: 300

- name: Get K3s node token
  ansible.builtin.slurp:
    src: /var/lib/rancher/k3s/server/node-token
  register: k3s_token
  become: true
  failed_when: false

- name: Save K3s token to file
  ansible.builtin.copy:
    content: "{{ k3s_token.content | b64decode }}"
    dest: /tmp/k3s-token
    mode: '0600'
  become: true
  delegate_to: localhost
  when: k3s_token.content is defined

- name: Display K3s master info
  ansible.builtin.debug:
    msg:
      - "K3s Master deployed successfully!"
      - "API Server: https://{{ ansible_host }}:6443"
      - "Node token saved to /tmp/k3s-token"

1.3 Create K3s Worker Tasks

Create ansible/roles/k3s/tasks/worker.yml:

---
# K3s worker installation tasks

- name: Download K3s binary
  ansible.builtin.get_url:
    url: "https://github.com/k3s-io/k3s/releases/download/{{ k3s_version }}/k3s"
    dest: /usr/local/bin/k3s
    mode: '0755'
    owner: root
    group: root
  become: true

- name: Create K3s config directory
  ansible.builtin.file:
    path: /etc/rancher/k3s
    state: directory
    mode: '0755'
  become: true

- name: Create K3s agent systemd service
  ansible.builtin.template:
    src: k3s-agent.service.j2
    dest: /etc/systemd/system/k3s-agent.service
    mode: '0644'
  become: true
  notify: restart k3s-agent

- name: Enable and start K3s agent
  ansible.builtin.systemd:
    name: k3s-agent
    state: started
    enabled: true
    daemon_reload: true
  become: true

- name: Wait for node to register
  ansible.builtin.pause:
    seconds: 30

- name: Display worker node info
  ansible.builtin.debug:
    msg: "K3s Worker {{ inventory_hostname }} deployed successfully!"

1.4 Create Service Templates

Create ansible/roles/k3s/templates/k3s.service.j2:

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/%i
EnvironmentFile=-/etc/sysconfig/%i
EnvironmentFile=-/etc/rancher/k3s/%i.conf

ExecStartPre=/bin/sh -xc '! test -f /etc/rancher/k3s/k3s-lock || exit 1; touch /etc/rancher/k3s/k3s-lock; rm -f /etc/rancher/k3s/k3s-lock'
ExecStart=/usr/local/bin/k3s \
    server \
    '--tls-san={{ ansible_host }}' \
    '--tls-san=k3s-master-01.homelab.local' \
    '--cluster-init' \
    {% if k3s_disable_traefik %}'--disable traefik'{% endif %} \
    '--disable-cloud-controller' \
    '--write-kubeconfig-mode 644' \
    '--node-name={{ ansible_hostname }}'

KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

Create ansible/roles/k3s/templates/k3s-agent.service.j2:

[Unit]
Description=Lightweight Kubernetes Agent
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=exec
EnvironmentFile=-/etc/default/%i
EnvironmentFile=-/etc/sysconfig/%i

ExecStart=/usr/local/bin/k3s \
    agent \
    '--server={{ k3s_server_url }}' \
    '--token={{ k3s_token }}' \
    '--node-name={{ ansible_hostname }}'

KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

1.5 Create Main Role File

Create ansible/roles/k3s/tasks/main.yml:

---
- name: Include master tasks
  include_tasks: master.yml
  when: inventory_hostname in groups['k3s_master']
  tags: k3s_master

- name: Include worker tasks
  include_tasks: worker.yml
  when: inventory_hostname in groups['k3s_workers']
  tags: k3s_worker

Create ansible/roles/k3s/handlers/main.yml:

---
- name: restart k3s
  ansible.builtin.systemd:
    name: k3s
    state: restarted
  become: true

- name: restart k3s-agent
  ansible.builtin.systemd:
    name: k3s-agent
    state: restarted
  become: true

Phase 2: Create K3s Playbook

2.1 Create K3s Installation Playbook

Create ansible/playbooks/k3s-install.yml:

---
# Deploy K3s Kubernetes cluster

- name: Deploy K3s master
  hosts: k3s_master
  become: true
  gather_facts: true

  roles:
    - role: k3s
      tags: k3s

- name: Get K3s token
  hosts: localhost
  gather_facts: false
  tasks:
    - name: Read K3s token
      ansible.builtin.slurp:
        src: /tmp/k3s-token
      register: k3s_token_file
      failed_when: false

    - name: Set token fact
      ansible.builtin.set_fact:
        k3s_token: "{{ k3s_token_file.content | b64decode }}"
      when: k3s_token_file.content is defined

- name: Deploy K3s workers
  hosts: k3s_workers
  become: true
  gather_facts: true

  roles:
    - role: k3s
      tags: k3s

  vars:
    k3s_token: "{{ hostvars[groups['k3s_master'][0]]['k3s_token'] | default(hostvars['localhost']['k3s_token']) }}"

Phase 3: Deploy K3s Cluster

3.1 Deploy K3s Master

cd /Volumes/Codex/Projects/homelab/ansible

# Deploy K3s master only
ansible-playbook -i inventories/homelab/hosts.yml playbooks/k3s-install.yml --tags k3s_master

# Expected output:
# TASK [k3s : Display K3s master info]
# ok: [k3s-master-01] => {
#     "msg": [
#         "K3s Master deployed successfully!",
#         "API Server: https://10.10.10.10:6443"
#     ]
# }

3.2 Retrieve Kubeconfig

# Copy kubeconfig from master
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "sudo cat /etc/rancher/k3s/k3s.yaml" > ~/.kube/homelab-config

# Fix server address
sed -i '' 's/127.0.0.1/10.10.10.10/g' ~/.kube/homelab-config

# Set KUBECONFIG
export KUBECONFIG=$HOME/.kube/homelab-config

# Test connectivity
kubectl get nodes

# Expected output:
# NAME           STATUS   ROLES                       AGE   VERSION
# k3s-master-01  Ready    control-plane,etcd,master   10s   v1.28.5+k3s1

3.3 Deploy K3s Workers

# Deploy K3s workers
ansible-playbook -i inventories/homelab/hosts.yml playbooks/k3s-install.yml --tags k3s_worker

# Expected output:
# TASK [k3s : Display worker node info]
# ok: [k3s-worker-01] => "K3s Worker k3s-worker-01 deployed successfully!"
# ok: [k3s-worker-02] => "K3s Worker k3s-worker-02 deployed successfully!"

3.4 Verify Cluster

# Check all nodes
kubectl get nodes -o wide

# Expected output:
# NAME           STATUS   ROLES                       AGE   VERSION
# k3s-master-01  Ready    control-plane,etcd,master   1m    v1.28.5+k3s1
# k3s-worker-01  Ready    <none>                      30s   v1.28.5+k3s1
# k3s-worker-02  Ready    <none>                      30s   v1.28.5+k3s1

# Check system pods
kubectl get pods -n kube-system

# Expected output: All pods Running or Completed

Phase 4: Post-Installation Configuration

4.1 Install Helm

# Download helm binary
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod +x get_helm.sh
./get_helm.sh

# Verify
helm version

4.2 Add Helm Repositories

# Add common repositories
helm repo add jetstack https://charts.jetstack.io
helm repo add longhorn https://charts.longhorn.io
helm repo add argo https://argoproj.github.io/argo-helm
helm repo add hashicorp https://helm.releases.hashicorp.com
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

4.3 Install cert-manager

# Install CRDs
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.1/cert-manager.crds.yaml

# Create namespace
kubectl create namespace cert-manager

# Install cert-manager
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --version v1.13.1 \
  --set installCRDs=true

# Verify
kubectl get pods -n cert-manager

4.4 Disable Traefik (Optional)

Already disabled during installation. Verify:

kubectl get pods -n kube-system | grep traefik

# Should return nothing (Traefik is disabled)

Phase 5: Cluster Verification

5.1 Test Pod Deployment

# Create test namespace
kubectl create namespace test

# Deploy nginx
kubectl create deployment nginx --image=nginx:latest -n test

# Expose service
kubectl expose deployment nginx --port=80 --type=NodePort -n test

# Get pod info
kubectl get pods -n test -o wide

# Get service info
kubectl get svc -n test

5.2 Test NodePort Access

# Get NodePort
NODE_PORT=$(kubectl get svc nginx -n test -o jsonpath='{.spec.ports[0].nodePort}')

# Test access from local machine
curl -I http://10.10.10.10:$NODE_PORT

# Should return: HTTP/1.1 200 OK

5.3 Clean Up Test Resources

# Delete test namespace
kubectl delete namespace test

Phase 6: Configure kubectl Autocomplete

# Add to shell config
cat >> ~/.zshrc << 'EOF'
# K3s Homelab
export KUBECONFIG=$HOME/.kube/homelab-config
alias k=kubectl
source <(kubectl completion zsh)
complete -F __start_kubectl k
EOF

# Reload shell
source ~/.zshrc

# Test
k get nodes

Troubleshooting

K3s Master Won't Start

Issue: K3s service fails to start

Solution:

# Check service logs
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "sudo journalctl -u k3s -n 50 --no-pager"

# Check if port 6443 is available
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "sudo netstat -tulpn | grep 6443"

# Restart K3s
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "sudo systemctl restart k3s"

Worker Nodes Can't Join

Issue: Workers stuck in "NotReady" state

Solution:

# Check worker logs
ssh -i ~/.ssh/homelab ubuntu@10.10.10.11 "sudo journalctl -u k3s-agent -n 50 --no-pager"

# Verify token is correct
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "sudo cat /var/lib/rancher/k3s/server/node-token"

# Check firewall on master
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "sudo iptables -L -n | grep 6443"

# Restart worker agent
ssh -i ~/.ssh/homelab ubuntu@10.10.10.11 "sudo systemctl restart k3s-agent"

Kubeconfig Connection Refused

Issue: kubectl get nodes returns connection refused

Solution:

# Verify kubeconfig points to correct IP
grep server: ~/.kube/homelab-config

# Should be: server: https://10.10.10.10:6443

# Test API server directly
curl -k https://10.10.10.10:6443/version

# Check if master node is ready
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "sudo systemctl status k3s"

Next Steps

K3s cluster deployed and verified:

➡️ Continue to Guide 06: Longhorn Storage


Checklist

  • K3s Ansible role created
  • K3s installation playbook created
  • K3s master deployed
  • Kubeconfig retrieved and configured
  • kubectl connectivity verified
  • K3s workers deployed
  • All nodes in Ready state
  • System pods running
  • Helm installed
  • Helm repositories added
  • cert-manager installed
  • Test deployment successful
  • Test deployment cleaned up
  • kubectl autocomplete configured

Source: docs/guides/06-longhorn-storage.md


Guide 06: Longhorn Distributed Storage

Install and configure Longhorn for distributed block storage across your K3s cluster.


Overview

Longhorn is a lightweight, reliable distributed block storage solution for Kubernetes. This guide installs Longhorn with proper storage classes for different workload types.

Time Required: ~30 minutes Prerequisites: Guide 05 completed, K3s cluster running


Architecture

                    Longhorn Storage Cluster
    ┌────────────────────────────────────────────────┐
    │                                                │
    │  Each node contributes 80GB for storage       │
    │                                                │
    │  ┌────────────┐  ┌────────────┐  ┌───────────┐ │
    │  │  k3s-      │  │  k3s-      │  │  k3s-     │ │
    │  │  master    │  │  worker-1  │  │ worker-2  │ │
    │  │            │  │            │  │           │ │
    │  │ Engine:    │  │ Engine:    │  │ Engine:   │ │
    │  │ Replica Mgr│  │ Replica Mgr│  │ Replica   │ │
    │  │ 80GB pool  │  │ 80GB pool  │  │ 80GB pool │ │
    │  └────────────┘  └────────────┘  └───────────┘ │
    │        │               │              │        │
    │        └───────────────┴──────────────┘        │
    │                       │                         │
    │              3-way replication                 │
    │              (for critical data)                │
    └────────────────────────────────────────────────┘

Storage Classes:
├── longhorn-critical (replicas: 3) - Vault, Gitea
├── longhorn-default   (replicas: 2) - Monitoring, Apps
└── longhorn-ephemeral (replicas: 1) - Cache, Temp data

Phase 1: Prerequisites

1.1 Verify Node Storage

# Check available disk space on each node
kubectl get nodes -o custom-columns=NAME:.metadata.name,DISK:.status.capacity[ephemeral-storage]

# SSH to nodes to verify
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "df -h /var/lib/rancher/k3s"
ssh -i ~/.ssh/homelab ubuntu@10.10.10.11 "df -h /var/lib/rancher/k3s"
ssh -i ~/.ssh/homelab ubuntu@10.10.10.12 "df -h /var/lib/rancher/k3s"

1.2 Install Required Dependencies

# Install NFS client on all nodes (required for Longhorn)
ansible -i ansible/inventories/homelab/hosts.yml k3s_master,k3s_workers -m shell -a "sudo apt install -y open-iscsi nfs-common" --become

# Start and enable iscsi service
ansible -i ansible/inventories/homelab/hosts.yml k3s_master,k3s_workers -m shell -a "sudo systemctl enable --now iscsid" --become

Phase 2: Install Longhorn

2.1 Add Longhorn Helm Repository

helm repo add longhorn https://charts.longhorn.io
helm repo update

2.2 Create Namespace

kubectl create namespace longhorn-system

2.3 Install Longhorn

helm install longhorn longhorn/longhorn \
  --namespace longhorn-system \
  --set defaultSettings.defaultReplicaCount=2 \
  --set defaultSettings.defaultDataPath="/var/lib/longhorn" \
  --set defaultSettings.storageMinimalAvailablePercentage=10 \
  --set defaultSettings.targetBackupCount=2 \
  --set persistence.defaultClassReplicaCount=2 \
  --set csi.kubeletRootDir=/var/lib/kubelet \
  --wait

2.4 Verify Installation

# Check Longhorn pods
kubectl get pods -n longhorn-system -w

# Expected output (after ~2-3 minutes):
# NAME                                        READY   STATUS    RESTARTS   AGE
# longhorn-driver-deployer-xxx                1/1     Running   0          2m
# instance-manager-xxx                        1/1     Running   0          2m
# engine-image-xxx                            1/1     Running   0          2m
# ...

Phase 3: Create Storage Classes

3.1 Create Storage Class Manifests

Create file gitops-apps/infrastructure/longhorn-storageclasses.yaml:

---
# Longhorn Storage Classes
# Applied via: kubectl apply -f gitops-apps/infrastructure/longhorn-storageclasses.yaml
#
# Replica strategy for a 2-node active cluster (vader + sidious):
#   longhorn-default  = 2 replicas  ← DEFAULT (survives 1 node loss)
#   longhorn-critical = 3 replicas  (will show Degraded when Maul is off - acceptable)
#   longhorn-ephemeral = 1 replica  (cache / scratch data only)

---
# Default Storage Class (2 replicas) — safe on 2-node cluster
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn-default
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
  numberOfReplicas: "2"
  staleReplicaTimeout: "2880"
  fromBackup: ""
  fsType: "ext4"
  dataEngine: "v1"
reclaimPolicy: Delete
volumeBindingMode: Immediate

---
# Critical Storage Class (3 replicas — HA, for Vault and Gitea)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn-critical
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
  numberOfReplicas: "3"
  staleReplicaTimeout: "2880"
  fromBackup: ""
  fsType: "ext4"
  dataEngine: "v1"
reclaimPolicy: Delete
volumeBindingMode: Immediate

---
# Ephemeral Storage Class (1 replica — cache/temp only)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn-ephemeral
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
  numberOfReplicas: "1"
  staleReplicaTimeout: "2880"
  fromBackup: ""
  fsType: "ext4"
  dataEngine: "v1"
reclaimPolicy: Delete
volumeBindingMode: Immediate

3.2 Apply Storage Classes

kubectl apply -f gitops-apps/infrastructure/longhorn-storageclasses.yaml

# Verify storage classes
kubectl get storageclass

# Expected output:
# NAME                         PROVISIONER           RECLAIMPOLICY
# longhorn-default (default)   driver.longhorn.io    Delete
# longhorn-critical            driver.longhorn.io    Delete
# longhorn-ephemeral           driver.longhorn.io    Delete

Phase 4: Configure Longhorn

4.1 Access Longhorn UI

# Port forward to access UI
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80

# Access at: http://localhost:8080
# Default credentials: admin / (get from secret)

Get admin password:

kubectl get secrets -n longhorn-system longhorn-password -o jsonpath="{.data.password}" | base64 -d

4.2 Configure Node Disk Cleanup

# Enable automatic disk cleanup (when disk usage > 85%)
kubectl patch -n longhorn-system settings.longhorn.io system-managed-components-pods-image-pull-policy \
  --type=merge -p '{"spec":{"value":"IfNotPresent"}}'

# Configure recurring snapshot limit (optional)
kubectl patch -n longhorn-system settings.longhorn.io recurring-job-max \
  --type=merge -p '{"spec":{"value":"5"}}'

Phase 5: Test Longhorn

5.1 Deploy Test PVC

# Create test namespace
kubectl create namespace test-storage

# Create test PVC
cat << EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
  namespace: test-storage
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn-critical
  resources:
    requests:
      storage: 1Gi
EOF

5.2 Deploy Test Pod

cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
  namespace: test-storage
spec:
  containers:
  - name: test
    image: nginx:latest
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: test-pvc
EOF

5.3 Verify Volume Replication

# Check volume status
kubectl get volumes.longhorn.io -n longhorn-system

# Describe volume to see replica distribution
kubectl describe volume -n longhorn-system pvc-xxx

# Expected: replicas distributed across 3 nodes

5.4 Test Failover

# Simulate node failure (cordoning)
kubectl cordon k3s-worker-01

# Verify volume remains accessible
kubectl exec -n test-storage test-pod -- ls /data

# Uncordon node
kubectl uncordon k3s-worker-01

Phase 6: Clean Up Test Resources

# Delete test namespace
kubectl delete namespace test-storage

# Verify all volumes cleaned up
kubectl get volumes.longhorn.io -n longhorn-system

Phase 7: Configure Backups (Optional)

7.1 Configure Backup Target

If you have NFS or S3 storage for backups:

# Example: NFS backup target
kubectl patch -n longhorn-system settings.longhorn.io backup-target -p '{"value":"nfs://192.168.1.100:/backups/longhorn"}'

# Example: S3 backup target
kubectl patch -n longhorn-system settings.longhorn.io backup-target -p '{"value":"s3://backup-bucket@region/"}'

7.2 Configure Backup Credentials

# For S3, create secret
kubectl create secret generic -n longhorn-system longhorn-backup-secret \
  --from-literal=AWS_ACCESS_KEY_ID=your-key \
  --from-literal=AWS_SECRET_ACCESS_KEY=your-secret \
  --from-literal=AWS_ENDPOINTS=https://s3.amazonaws.com

# Update settings
kubectl patch -n longhorn-system settings.longhorn.io backup-target-credential-secret -p '{"value":"longhorn-backup-secret"}'

Troubleshooting

Longhorn Pods Pending

Issue: Longhorn pods stuck in Pending state

Solution:

# Describe pod to see issue
kubectl describe pod -n longhorn-system <pod-name>

# Common issue: node selector mismatch
kubectl get nodes --show-labels

# Fix: Add required labels to nodes
kubectl label nodes k3s-master-01 node.longhorn.io/create-default-disk=true

Volumes Not Provisioning

Issue: PVC stuck in Pending state

Solution:

# Check PVC events
kubectl describe pvc <pvc-name>

# Check Longhorn engine logs
kubectl logs -n longhorn-system -l app=longhorn-manager -f

# Verify node disk space
kubectl get nodes -o custom-columns=NAME:.metadata.name,DISC:.status.capacity[ephemeral-storage]

Replica Scheduling Issues

Issue: Replicas not distributed evenly

Solution:

# Check node disk capacity
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable["storage\.openshift\.com/longhorn"]}'

Next Steps

Longhorn storage installed and configured:

➡️ Continue to Guide 07: GitOps Stack


Checklist

  • Node storage verified
  • iSCSI and NFS clients installed
  • Longhorn Helm repository added
  • Longhorn installed
  • All Longhorn pods running
  • Storage classes created
  • Longhorn UI accessible
  • Admin password retrieved
  • Test PVC created
  • Test pod deployed
  • Volume replication verified
  • Failover tested
  • Test resources cleaned up
  • Backup configured (optional)

Source: docs/guides/07-gitops-stack.md


Guide 07: GitOps Stack (Gitea + ArgoCD)

Deploy a self-hosted Git platform and GitOps controller for declarative infrastructure management.


Overview

This guide deploys Gitea (self-hosted Git) and ArgoCD (GitOps controller) to enable Infrastructure as Code workflows with declarative Kubernetes management.

Time Required: ~45 minutes Prerequisites: Guide 06 completed, Longhorn storage available


Architecture

┌─────────────────────────────────────────────────────────────┐
│                        GitOps Flow                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Developer            Gitea              ArgoCD              │
│  (local)              (Git)            (Controller)          │
│     │                   │                   │                │
│     ├─ git push ──────►│                   │                │
│     │                   ├─ webhook ───────►│                 │
│     │                   │                   │                │
│     │                   │◄── fetch ─────────┤                │
│     │                   │                   │                │
│     │                   │                   ├─ sync ───────►│
│     │                   │                   │    K8s         │
│     │                   │                   │                │
│  ┌──┴───────────────────┴───────────────────┴────────┐      │
│  │              kubectl get pods -A                     │      │
│  └────────────────────────────────────────────────────┘      │
│                                                              │
│  Services:                                                   │
│  ├── Gitea (git.homelab.local)                              │
│  ├── ArgoCD (argocd.homelab.local)                          │
│  └── Repositories:                                          │
│      ├── gitops-apps                                        │
│      ├── gitops-infrastructure                              │
│      └── ansible-playbooks                                  │
└─────────────────────────────────────────────────────────────┘

Phase 1: Deploy PostgreSQL

1.1 Create PostgreSQL Namespace

kubectl create namespace postgresql

1.2 Deploy PostgreSQL

helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

# Store passwords in a values file — do NOT inline credentials in shell history
# Create gitops-apps/infrastructure/postgresql-values.yaml (add to .gitignore!)
cat > /tmp/postgresql-values.yaml <<'EOF'
auth:
  enablePostgresUser: true
  postgresPassword: "<set-a-strong-password>"
  database: gitea
  username: gitea
  password: "<set-a-strong-password>"
primary:
  persistence:
    enabled: true
    storageClass: longhorn-critical
    size: 10Gi
EOF

helm install postgresql bitnami/postgresql \
  --namespace postgresql \
  --create-namespace \
  --values /tmp/postgresql-values.yaml \
  --wait

1.3 Get PostgreSQL Connection String

# Get PostgreSQL password
export POSTGRES_PASSWORD=$(kubectl get secret -n postgresql postgresql -o jsonpath="{.data.postgres-password}" | base64 -d)

# Get connection details
POSTGRES_HOST="postgresql.postgresql.svc.cluster.local"
POSTGRES_PORT="5432"
POSTGRES_USER="postgres"
POSTGRES_DB="gitea"

echo "PostgreSQL connection string:"
echo "postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${POSTGRES_HOST}:${POSTGRES_PORT}/${POSTGRES_DB}"

Phase 2: Deploy Gitea

2.1 Create Gitea Namespace

kubectl create namespace gitea

2.2 Install Gitea

helm repo add gitea-charts https://dl.gitea.com/charts/
helm repo update

helm install gitea gitea-charts/gitea \
  --namespace gitea \
  --set gitea.config.server.DOMAIN=git.homelab.local \
  --set gitea.config.server.ROOT_URL=https://git.homelab.local \
  --set gitea.config.server.SSH_DOMAIN=git.homelab.local \
  --set gitea.config.server.SSH_PORT=2222 \
  --set gitea.config.database.DB_TYPE=postgres \
  --set gitea.config.database.HOST=${POSTGRES_HOST}:${POSTGRES_PORT} \
  --set gitea.config.database.NAME=${POSTGRES_DB} \
  --set gitea.config.database.USER=${POSTGRES_USER} \
  --set gitea.config.database.PASSWD=${POSTGRES_PASSWORD} \
  --set gitea.admin.username=admin \
  --set gitea.admin.password=ChangeMe!123 \
  --set gitea.admin.email=admin@homelab.local \
  --set persistence.enabled=true \
  --set persistence.storageClass=longhorn-critical \
  --set persistence.size=10Gi \
  --set service.ssh.type=LoadBalancer \
  --set service.ssh.ports.ssh=2222 \
  --set ingress.enabled=true \
  --set ingress.className=nginx \
  --set ingress.hosts[0].host=git.homelab.local \
  --set ingress.hosts[0].paths[0].path=/ \
  --set ingress.hosts[0].paths[0].pathType=Prefix \
  --set ingress.tls=true \
  --set ingress.tls[0].hosts[0]=git.homelab.local \
  --set ingress.tls[0].secretName=git-homelab-tls \
  --wait

2.3 Configure Ingress (Manual)

If ingress-nginx is not yet installed, create service via NodePort:

# Patch service to NodePort for now
kubectl patch svc gitea-http -n gitea -p '{"spec":{"type":"NodePort","ports":[{"port":3000,"targetPort":3000,"nodePort":30080}],"selector":{"app.kubernetes.io/name":"gitea"}}}'

# Access Gitea
# http://10.10.10.10:30080

2.4 Access Gitea

  1. Open browser: http://10.10.10.10:30080 (or via ingress if configured)
  2. Login with admin credentials
    • Username: admin
    • Password: ChangeMe!123

Phase 3: Initialize Gitea Repositories

3.1 Create Organization

Via Gitea UI:

  1. Navigate to Organizations > Create Organization
  2. Name: homelab
  3. Visibility: Private
  4. Create

3.2 Create Repositories

Create the following repositories in the homelab organization:

Repository Description Visibility
gitops-apps Application manifests Private
gitops-infrastructure Infrastructure manifests Private
terraform-proxmox Terraform code Private
ansible-playbooks Ansible playbooks Private

3.3 Get Git Credentials

# Generate personal access token in Gitea:
# User Settings > Applications > Generate Token

# Store token securely
echo "export GITEA_TOKEN=your_token_here" >> ~/.zshrc
source ~/.zshrc

Phase 4: Deploy ArgoCD

4.1 Create ArgoCD Namespace

kubectl create namespace argocd

4.2 Install ArgoCD

helm install argocd argo/argo-cd \
  --namespace argocd \
  --set server.service.type=NodePort \
  --set server.service.nodePortHttp=30081 \
  --set server.service.nodePortHttps=30443 \
  --wait

# After install, set a strong admin password via argocd CLI:
# argocd admin initial-password -n argocd   ← get the auto-generated initial password
# argocd login 10.10.10.10:30081
# argocd account update-password

4.3 Access ArgoCD UI

# Port forward to access UI
kubectl port-forward -n argocd svc/argocd-server 8080:443

# Access at: https://localhost:8080
# Accept self-signed certificate

4.4 Get Initial Admin Password

# The password is set to "admin" via the hashed password above
# Username: admin
# Password: admin

Change password on first login.


Phase 5: Configure ArgoCD

5.1 Connect ArgoCD to Gitea

Via ArgoCD UI:

  1. Navigate to Settings > Repositories
  2. Click Connect Repo
  3. Select Git
  4. Enter details:
    • Repository URL: https://git.homelab.local/homelab/gitops-apps.git
    • Username: admin
    • Password: <Gitea password or token>
    • Skip server verification: true
  5. Click Connect

5.2 Create ArgoCD Projects

Create project manifest gitops-apps/argocd-apps/projects.yaml:

---
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: homelab
  namespace: argocd
spec:
  description: Homelab Project
  sourceRepos:
    - '*'
  destinations:
    - namespace: '*'
      server: https://kubernetes.default.svc
  clusterResourceWhitelist:
    - group: '*'
      kind: '*'
  namespaceResourceWhitelist:
    - group: '*'
      kind: '*'
  orphanedResources:
    warn: false

Apply the project:

kubectl apply -f gitops-apps/argocd-apps/projects.yaml

Phase 6: Create App of Apps Pattern

6.1 Create Root Application

Create gitops-apps/argocd-apps/root-application.yaml:

---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root-app
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: homelab

  source:
    repoURL: https://git.homelab.local/homelab/gitops-apps.git
    targetRevision: main
    path: argocd-apps

  destination:
    server: https://kubernetes.default.svc
    namespace: argocd

  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
      - CreateNamespace=true
      - PruneLast=true

6.2 Create Application Directory Structure

cd /Volumes/Codex/Projects/homelab/gitops-apps

mkdir -p argocd-apps/{infrastructure,services,monitoring,security}

6.3 Create Infrastructure Applications

Create argocd-apps/infrastructure/longhorn.yaml:

---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: longhorn
  namespace: argocd
spec:
  project: homelab
  source:
    repoURL: https://git.homelab.local/homelab/gitops-apps.git
    targetRevision: main
    path: infrastructure/longhorn
  destination:
    server: https://kubernetes.default.svc
    namespace: longhorn-system
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

6.4 Push to Gitea

# Initialize git repository
cd /Volumes/Codex/Projects/homelab/gitops-apps

git init
git add .
git commit -m "Initial ArgoCD apps structure"

# Add Gitea remote
git remote add origin https://git.homelab.local/homelab/gitops-apps.git

# Push (will prompt for credentials)
git push -u origin main

6.5 Deploy Root Application

kubectl apply -f gitops-apps/argocd-apps/root-application.yaml

Phase 7: Verify GitOps Workflow

7.1 Test Workflow

# Create a test application
cat > gitops-apps/services/test-nginx.yaml << EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: test-nginx
  namespace: argocd
spec:
  project: homelab
  source:
    repoURL: https://git.homelab.local/homelab/gitops-apps.git
    targetRevision: main
    path: apps/test
  destination:
    server: https://kubernetes.default.svc
    namespace: test
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
EOF

# Create test deployment
mkdir -p gitops-apps/apps/test
cat > gitops-apps/apps/test/deployment.yaml << EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  selector:
    app: nginx
  ports:
  - port: 80
    targetPort: 80
EOF

# Push to git
git add gitops-apps/services/test-nginx.yaml gitops-apps/apps/test/
git commit -m "Add test application"
git push

# Watch ArgoCD sync
kubectl get application -n argocd -w

Troubleshooting

ArgoCD Can't Connect to Gitea

Issue: Repository connection fails

Solution:

# Verify Gitea is accessible
curl -k https://git.homelab.local/api/v1/version

# Check ArgoCD repo server logs
kubectl logs -n argocd -l app.kubernetes.io/name=argocd-repo-server -f

# Verify credentials
echo "https://admin:password@git.homelab.local" | base64

Gitea Database Connection Issues

Issue: Gitea can't connect to PostgreSQL

Solution:

# Check PostgreSQL is running
kubectl get pods -n postgresql

# Check Gitea logs
kubectl logs -n gitea -l app.kubernetes.io/name=gitea -f

# Verify connection string
kubectl get secret -n gitea gitea -o jsonpath='{.data\.gitea\.config}' | base64 -d

Next Steps

GitOps stack deployed:

➡️ Continue to Guide 08: Security Tooling


Checklist

  • PostgreSQL deployed
  • PostgreSQL connection verified
  • Gitea namespace created
  • Gitea deployed via Helm
  • Gitea accessible via browser
  • Admin user created
  • Organization created
  • Repositories created
  • Git credentials configured
  • ArgoCD namespace created
  • ArgoCD deployed via Helm
  • ArgoCD UI accessible
  • Admin password retrieved
  • ArgoCD connected to Gitea
  • ArgoCD projects created
  • App of Apps pattern configured
  • Test application deployed
  • GitOps workflow verified

Source: docs/guides/08-security-tooling.md


Guide 08: Security Tooling

Deploy comprehensive security tooling including Vault, Falco, Trivy, and Kyverno.


Overview

This guide installs a complete security stack for your homelab, covering secrets management, runtime security, vulnerability scanning, and policy enforcement.

Time Required: ~60 minutes Prerequisites: Guide 07 completed, ArgoCD running


Architecture

                    Security Stack
    ┌────────────────────────────────────────────┐
    │                                            │
    │  ┌───────────┐  ┌──────────────┐          │
    │  │   Vault   │  │    Falco     │          │
    │  │ Secrets   │  │ Runtime      │          │
    │  │ Mgmt      │  │ Security     │          │
    │  └───────────┘  └──────────────┘          │
    │                                            │
    │  ┌──────────────┐  ┌──────────────┐       │
    │  │    Trivy     │  │   Kyverno    │       │
    │  │ Vulnerability│  │ Policy       │       │
    │  │ Scanner      │  │ Engine       │       │
    │  └──────────────┘  └──────────────┘       │
    │                                            │
    │  Features:                                  │
    │  - Secrets injection via Vault             │
    │  - Real-time threat detection (Falco)      │
    │  - Image and manifest scanning (Trivy)     │
    │  - Policy enforcement (Kyverno)            │
    └────────────────────────────────────────────┘

Phase 1: Deploy HashiCorp Vault

1.1 Create Namespace

kubectl create namespace vault

1.2 Install Vault

helm repo add hashicorp https://helm.releases.hashicorp.com
helm repo update

helm install vault hashicorp/vault \
  --namespace vault \
  --set server.dev.enabled=false \
  --set server.standalone.enabled=true \
  --set ui.enabled=true \
  --set ui.serviceType=NodePort \
  --set ui.serviceNodePort=8200 \
  --set injector.enabled=true \
  --set server.dataStorage.enabled=true \
  --set server.dataStorage.storageClass=longhorn-critical \
  --set server.dataStorage.size=5Gi \
  --set 'server.standalone.config=storage "file" { path = "/vault/data" }
listener "tcp" {
  address     = "0.0.0.0:8200"
  tls_disable = true
}
api_addr     = "http://0.0.0.0:8200"
disable_mlock = true' \
  --wait

1.3 Initialize and Unseal Vault

# Port forward to access Vault
kubectl port-forward -n vault svc/vault 8200:8200 &

export VAULT_ADDR='http://127.0.0.1:8200'

# Initialise Vault (first time only — save ALL output securely!)
vault operator init -key-shares=5 -key-threshold=3

# Unseal using 3 of the 5 unseal keys returned above:
vault operator unseal   # repeat 3 times with different keys

# Log in with the root token returned by init:
export VAULT_TOKEN='<root-token-from-init>'

# Verify
vault status

1.4 Configure Vault

# Enable KV secrets engine
vault secrets enable -path=homelab kv-v2

# Create test secret
vault kv put homelab/test username=admin password=ChangeMe!

# Enable Kubernetes auth
vault auth enable kubernetes

# Configure Kubernetes auth
vault write auth/kubernetes/config \
  kubernetes_host="https://kubernetes.default.svc:443"

# Create policy for applications
vault policy write homelab-apps - << EOF
path "homelab/data/*" {
  capabilities = ["read"]
}
EOF

# Create role for application
vault write auth/kubernetes/role/homelab-apps \
  bound_service_account_names="*" \
  bound_service_account_namespaces="*" \
  policies=homelab-apps \
  ttl=24h

1.5 Create ArgoCD Application for Vault

Create gitops-apps/security/vault.yaml:

---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: vault
  namespace: argocd
spec:
  project: homelab
  source:
    repoURL: https://git.homelab.local/homelab/gitops-apps.git
    targetRevision: main
    path: security/vault
  destination:
    server: https://kubernetes.default.svc
    namespace: vault
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Phase 2: Deploy Falco

2.1 Create Namespace

kubectl create namespace falco

2.2 Install Falco

helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

helm install falco falcosecurity/falco \
  --namespace falco \
  --set driver.kind=ebpf \
  --set tty=true \
  --set falco.jsonOutput=true \
  --set falco.jsonIncludeOutputProperty=true \
  --set falco.logLevel=info \
  --set falco.priority=debug \
  --wait

2.3 Configure Falco Rules

Create config map gitops-apps/security/falco/rules.yaml:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: falco-custom-rules
  namespace: falco
data:
  homelab_rules.yaml: |
    - rule: Homelab Shell in Container
      desc: Detect shell spawned in container (homelab-specific)
      condition: >
        spawned_process
        and container
        and shell_procs
        and not user_expected_shell_spawn
      output: >
        Shell spawned in container (user=%user.name container_name=%container.name
        shell=%proc.name parent=%proc.pname cmdline=%proc.cmdline image=%container.image.repository)
      priority: WARNING
      tags: [shell, container]

Apply rules:

kubectl apply -f gitops-apps/security/falco/rules.yaml

2.4 Test Falco

# Trigger Falco event
kubectl run test-shell --image=nginx:latest --restart=Never -i -- sh -c "whoami"

# Check Falco logs
kubectl logs -n falco -l app.kubernetes.io/name=falco -f

# Cleanup
kubectl delete pod test-shell

Phase 3: Deploy Trivy

3.1 Create Namespace

kubectl create namespace trivy-system

3.2 Install Trivy Operator

helm repo add aqua https://aquasecurity.github.io/helm-charts/
helm repo update

helm install trivy-operator aqua/trivy-operator \
  --namespace trivy-system \
  --set serviceMonitor.enabled=true \
  --set trivy.enabled=true \
  --set trivy.image.repository=ghcr.io/aquasecurity/trivy \
  --set trivy.image.tag=latest \
  --set trivy.server.enabled=false \
  --set trivy.dbRepository=ghcr.io/aquasecurity/trivy-db \
  --set operator.builtInTrivyServer=false \
  --wait

3.3 Create Scan Job

# Scan all deployed images
cat << EOF | kubectl apply -f -
apiVersion: aquasecurity.github.io/v1alpha1
kind: ClusterComplianceReport
metadata:
  name: homelab-compliance
spec:
  cron: "0 0 * * *"
  reportType: summary
  format: json
  compliance:
    checks:
      - id: AVD-KSV-0015
        severity: HIGH
      - id: AVD-KSV-0016
        severity: MEDIUM
EOF

3.4 View Vulnerability Reports

# List vulnerability reports
kubectl get vulnerabilityreports -A

# Describe specific report
kubectl describe vulnerabilityreports -n trivy-system <report-name>

Phase 4: Deploy Kyverno

4.1 Create Namespace

kubectl create namespace kyverno

4.2 Install Kyverno

helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update

helm install kyverno kyverno/kyverno \
  --namespace kyverno \
  --set replicaCount=1 \
  --set initContainer.resources.limits.memory=500Mi \
  --wait

4.3 Create Baseline Policies

Create gitops-apps/security/kyverno/policies.yaml:

---
# Policy: Disallow privileged containers
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-privileged
spec:
  validationFailureAction: enforce
  background: true
  rules:
    - name: validate-privileged
      match:
        resources:
          kinds:
            - Pod
      validate:
        message: "Privileged mode is not allowed"
        pattern:
          spec:
            =(containers):
              - =(securityContext):
                  =(privileged): false

---
# Policy: Require resource limits
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-limits
spec:
  validationFailureAction: enforce
  background: true
  rules:
    - name: validate-resources
      match:
        resources:
          kinds:
            - Pod
      validate:
        message: "CPU and memory resource limits are required"
        pattern:
          spec:
            =(containers):
              - resources:
                  limits:
                    memory: "?*"
                    cpu: "?*"

---
# Policy: Disallow latest tag
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-latest-tag
spec:
  validationFailureAction: enforce
  background: true
  rules:
    - name: validate-image-tag
      match:
        resources:
          kinds:
            - Pod
      validate:
        message: "Using the ':latest' tag is not allowed"
        foreach:
          - list: request.object.spec.containers
            pattern:
              image: "!*:latest"

---
# Policy: Auto-generate default-deny NetworkPolicy for production namespaces
# This generate rule fires when a Namespace with label environment=production is created.
# It creates a default-deny-ingress NetworkPolicy in that namespace automatically.
# Namespaces in the exclusion list are skipped.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-default-deny-networkpolicy
  annotations:
    policies.kyverno.io/title: Add Default-Deny NetworkPolicy
    policies.kyverno.io/description: >
      Automatically generate a default-deny-ingress NetworkPolicy in every
      namespace labelled 'environment: production'. This ensures all ingress
      is blocked unless explicitly allowed by another NetworkPolicy.
spec:
  rules:
    - name: generate-default-deny
      match:
        any:
          - resources:
              kinds:
                - Namespace
              selector:
                matchLabels:
                  environment: production
      exclude:
        any:
          - resources:
              namespaces:
                - kube-system
                - kyverno
                - falco
                - vault
                - argocd
                - longhorn-system
                - monitoring
                - logging
      generate:
        apiVersion: networking.k8s.io/v1
        kind: NetworkPolicy
        name: default-deny-ingress
        namespace: "{{request.object.metadata.name}}"
        synchronize: true
        data:
          spec:
            podSelector: {}
            policyTypes:
              - Ingress

Apply policies:

kubectl apply -f gitops-apps/security/kyverno/policies.yaml

4.4 Test Kyverno Policies

# Test privileged container policy (should fail)
kubectl run test-privileged --image=nginx:latest --privileged --restart=Never

# Expected: Error from server (Forbidden): admission webhook denied the request

# Test latest tag policy (should fail)
kubectl run test-latest --image=nginx:latest --restart=Never

# Expected: Error from server (Forbidden): admission webhook denied the request

# Test with proper image tag (should succeed)
kubectl run test-proper --image=nginx:1.25 --restart=Never

# Cleanup
kubectl delete pod test-proper

Phase 5: Create Security Dashboard

5.1 Install ingress-nginx (for dashboards)

kubectl create namespace ingress-nginx

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --set controller.service.type=NodePort \
  --set controller.service.nodePorts.http=30080 \
  --set controller.service.nodePorts.https=30443 \
  --wait

5.2 Access Security Tools

Tool URL Credentials
Vault http://10.10.10.10:8200 Token: dev-xxxxx
Falco Logs kubectl logs -n falco -l app.kubernetes.io/name=falco -
Trivy Reports kubectl get vulnerabilityreports -A -
Kyverno Policies kubectl get clusterpolicies -

Phase 6: Create ArgoCD Applications

Create application manifests for all security tools in gitops-apps/security/:

# Create applications directory
mkdir -p gitops-apps/security/{vault,falco,trivy,kyverno}

# Create ArgoCD applications for each security tool
# (Similar to Vault example in Phase 1.5)

Push to Gitea:

git add gitops-apps/security/
git commit -m "Add security tooling applications"
git push

Troubleshooting

Vault Won't Initialize

Issue: Vault stuck in pending state

Solution:

# Check Vault logs
kubectl logs -n vault -l app.kubernetes.io/name=vault -f

# Check storage
kubectl get pvc -n vault

# Restart Vault
kubectl rollout restart deployment vault -n vault

Falco Not Detecting Events

Issue: No events in Falco logs

Solution:

# Check Falco status
kubectl get pods -n falco

# Verify eBPF is loaded
kubectl exec -n falco -l app.kubernetes.io/name=falco -- lsmod | grep falco

# Check Falco version
kubectl exec -n falco -l app.kubernetes.io/name=falco -- falco --version

Trivy Scans Timing Out

Issue: Trivy scans stuck in running state

Solution:

# Check Trivy operator logs
kubectl logs -n trivy-system -l app.kubernetes.io/name=trivy-operator -f

# Check available resources
kubectl top nodes
kubectl top pods -n trivy-system

# Increase timeout
kubectl patch trivy trivy -n trivy-system -p '{"spec":{"timeout":"5m"}}'

Next Steps

Security tooling deployed:

➡️ Continue to Guide 09: Red/Blue Team


Checklist

  • Vault namespace created
  • Vault deployed
  • Vault initialized and unsealed
  • KV secrets engine enabled
  • Kubernetes auth configured
  • Vault policies created
  • Falco namespace created
  • Falco installed with eBPF driver
  • Custom Falco rules applied
  • Falco tested with shell event
  • Trivy namespace created
  • Trivy operator installed
  • Scan jobs created
  • Vulnerability reports accessible
  • Kyverno namespace created
  • Kyverno installed
  • Baseline policies applied
  • Policies tested
  • Ingress controller installed
  • All tools accessible
  • ArgoCD applications created
  • Security stack verified

Source: docs/guides/09-red-blue-team.md


Guide 09: Red/Blue Team Infrastructure

Deploy isolated security sandbox environments for red team (attack) and blue team (defense) exercises.


Overview

This guide creates isolated network segments for security testing, including attack tools, vulnerable targets, and defensive monitoring infrastructure.

Time Required: ~45 minutes Prerequisites: Guide 08 completed, pfSense configured with VNet2


Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Security Sandbox Network                  │
│                    (VNet2: 10.20.20.0/24)                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Red Team (Attack)              Blue Team (Defense)          │
│  ┌──────────────┐              ┌──────────────┐             │
│  │  Kali Linux  │              │ Security     │             │
│  │  10.20.20.10 │              │ Onion        │             │
│  │              │              │ (Wazuh)       │             │
│  │  Tools:      │              │              │             │
│  │  - Metasploit│              │ ELK Stack    │             │
│  │  - Burp      │              │ IDS/IPS      │             │
│  │  - Nmap      │              │ Log Analysis │             │
│  └──────────────┘              └──────────────┘             │
│         │                              │                     │
│         │     Vulnerable Targets       │                     │
│         │    ┌──────────────────┐      │                     │
│         └───►│ Metasploitable 3 │      │                     │
│              │ DVWA             │      │                     │
│              │ Juice Shop       │      │                     │
│              └──────────────────┘      │                     │
│                                          │                     │
│  Firewall Rules:                          │                     │
│  ❌ VNet2 → VNet1 (blocked)              │                     │
│  ✅ VNet2 → WAN (allowed)                │                     │
│  ⚠️  VNet1 → VNet2 (restricted)          │                     │
└─────────────────────────────────────────────────────────────┘

Phase 1: Configure pfSense Firewall

1.1 Access pfSense

Via Proxmox console:

# Open console for pfsense-router VM
# Default credentials: admin / pfsense

Or via web UI:

http://10.10.10.1

1.2 Configure OPT Interface (VNet2)

Via pfSense Web UI:

  1. Navigate to Interfaces > OPT1
  2. Enable interface
  3. Configure:
    • Name: SANDBOX
    • IPv4: 10.20.20.1/24
    • Gateway: None
  4. Save and Apply

1.3 Create Firewall Rules

Navigate to Firewall > Rules > SANDBOX

Add rules:

Action Interface Source Destination Port Description
Block SANDBOX * Homelab Net * Block access to production
Pass SANDBOX * WAN * Allow internet
Pass Homelab Net Mgmt IPs SANDBOX * Allow management access

Critical: The first rule blocks all traffic from VNet2 to VNet1.


Phase 2: Deploy Red Team Infrastructure

2.1 Create Kali Linux Template

This requires manual installation via ISO:

# Get the current Kali Linux installer ISO URL
# Check https://www.kali.org/get-kali/#kali-installer-images for the latest version
# Example (update the version before running):
KALI_VERSION=$(curl -s https://cdimage.kali.org/current/ | grep -oP 'kali-linux-[0-9.]+' | head -1)
INSTALLER_ISO="kali-linux-${KALI_VERSION}-installer-amd64.iso"

# Download the installer ISO (NOT the vmware or live variant — Proxmox needs the installer)
wget "https://cdimage.kali.org/current/${INSTALLER_ISO}" -O /tmp/kali.iso

# Copy to pve-maul ISO storage
scp /tmp/kali.iso root@192.168.1.10:/var/lib/vz/template/iso/

# Create VM on pve-maul via CLI, attached to vnet-sandbox (VNet2)
qm create 8000 \
  --name kali-linux \
  --memory 4096 \
  --cores 4 \
  --net0 virtio,bridge=vnet-sandbox,tag=200 \
  --scsihw virtio-scsi-pci \
  --ide2 local:iso/${INSTALLER_ISO},media=cdrom \
  --scsi0 local-lvm:80,format=raw \
  --boot order=ide2 \
  --ostype l26 \
  --node maul

# Start VM and complete Kali installation interactively
qm start 8000

# Access the console in Proxmox UI: Nodes → pve-maul → 8000 (kali-linux) → Console
# After installation completes:
qm stop 8000

# Optional: Convert to template for fast cloning
qm template 8000

2.2 Deploy Red Team VMs

Update Terraform to include red team VMs. Add to terraform/environments/homelab/main.tf:

# Red Team Infrastructure

# Kali Linux
module "kali_linux" {
  source        = "../../modules/vm"

  name          = "kali-linux"
  target_node   = "maul"
  clone_template = "kali-template"

  cores         = 4
  memory        = 8192
  cpu_type      = "host"

  network_bridge = "vnet-sandbox"
  network_tag     = 200
  network_firewall = true

  disk_size      = "80G"
  disk_storage   = "local-lvm"
  cloudinit_storage = "local-lvm"

  ip_address     = "10.20.20.10"
  gateway        = "10.20.20.1"

  ssh_public_keys = [var.ssh_public_key]

  onboot         = false
  tags           = ["security", "red-team", "sandbox"]
}

# Parrot OS (alternative to Kali)
module "parrot_os" {
  source        = "../../modules/vm"

  name          = "parrot-os"
  target_node   = "maul"
  clone_template = "parrot-template"

  cores         = 4
  memory        = 8192
  cpu_type      = "host"

  network_bridge = "vnet-sandbox"
  network_tag     = 200

  disk_size      = "80G"
  disk_storage   = "local-lvm"
  cloudinit_storage = "local-lvm"

  ip_address     = "10.20.20.11"
  gateway        = "10.20.20.1"

  ssh_public_keys = [var.ssh_public_key]

  onboot         = false
  tags           = ["security", "red-team", "sandbox"]
}

2.3 Apply Terraform

cd terraform/environments/homelab

terraform plan -out=tfplan-redteam
terraform apply tfplan-redteam

Phase 3: Deploy Vulnerable Targets

3.1 Deploy Metasploitable3

Metasploitable3 requires Docker. Create via Ansible:

Create ansible/playbooks/deploy-targets.yml:

---
# Deploy vulnerable targets for red team practice
# These run on the kali-linux VM inside the sandbox network (10.20.20.0/24)
# They are intentionally vulnerable — NEVER expose these to the internet.

- name: Deploy vulnerable targets
  hosts: kali_linux
  become: true
  gather_facts: true

  tasks:
    - name: Install Docker and dependencies
      ansible.builtin.apt:
        name:
          - docker.io
          - docker-compose-plugin
          - python3-docker
        state: present
        update_cache: true

    - name: Start Docker service
      ansible.builtin.systemd:
        name: docker
        state: started
        enabled: true

    # Option A: Metasploitable2 (Docker-native, always works)
    # Container image: tleemcjr/metasploitable2
    - name: Pull Metasploitable2 image
      community.docker.docker_image:
        name: tleemcjr/metasploitable2
        source: pull

    - name: Run Metasploitable2 container
      community.docker.docker_container:
        name: metasploitable2
        image: tleemcjr/metasploitable2
        ports:
          - "21:21"     # FTP
          - "22:22"     # SSH
          - "80:80"     # HTTP
          - "3306:3306" # MySQL
        restart_policy: unless-stopped
        state: started

    # Option B: DVWA (Damn Vulnerable Web Application)
    - name: Run DVWA container
      community.docker.docker_container:
        name: dvwa
        image: ghcr.io/digininja/dvwa:latest
        ports:
          - "8080:80"
        env:
          DB_SERVER: "dvwa-db"
        restart_policy: unless-stopped
        state: started

    - name: Display target info
      ansible.builtin.debug:
        msg:
          - "Metasploitable2: http://{{ ansible_host }} (ports 21, 22, 80, 3306)"
          - "DVWA: http://{{ ansible_host }}:8080"
          - "NOTE: For rapid7/metasploitable3 (Ubuntu-based), use Vagrant on a local machine:"
          - "  git clone https://github.com/rapid7/metasploitable3 && cd metasploitable3 && vagrant up"

3.2 Deploy DVWA

# Deploy DVWA in K3s cluster (isolated namespace)
kubectl create namespace dvwa

cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dvwa
  namespace: dvwa
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dvwa
  template:
    metadata:
      labels:
        app: dvwa
    spec:
      nodeSelector:
        kubernetes.io/hostname: "k3s-worker-01"
      containers:
      - name: dvwa
        image: vulnerables/web-dvwa
        ports:
        - containerPort: 80
        env:
        - name: RECAPTCHA_DISABLED
          value: "true"
---
apiVersion: v1
kind: Service
metadata:
  name: dvwa
  namespace: dvwa
spec:
  type: NodePort
  selector:
    app: dvwa
  ports:
  - port: 80
    targetPort: 80
    nodePort: 30880
EOF

# Access DVWA
# http://10.10.10.11:30880
# Default credentials: admin / password

3.3 Deploy OWASP Juice Shop

kubectl create namespace juice-shop

cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: juice-shop
  namespace: juice-shop
spec:
  replicas: 1
  selector:
    matchLabels:
      app: juice-shop
  template:
    metadata:
      labels:
        app: juice-shop
    spec:
      nodeSelector:
        kubernetes.io/hostname: "k3s-worker-01"
      containers:
      - name: juice-shop
        image: bkimminich/juice-shop:latest
        ports:
        - containerPort: 3000
---
apiVersion: v1
kind: Service
metadata:
  name: juice-shop
  namespace: juice-shop
spec:
  type: NodePort
  selector:
    app: juice-shop
  ports:
  - port: 3000
    targetPort: 3000
    nodePort: 30881
EOF

Phase 4: Deploy Blue Team Infrastructure

4.1 Create Blue Team Namespace

kubectl create namespace blue-team

4.2 Deploy Wazuh (SIEM)

helm repo add wazuh https://wazuh.github.io/wazuh-helm
helm repo update

helm install wazuh wazuh/wazuh \
  --namespace blue-team \
  --set wazuh-manager.enabled=true \
  --set wazuh-indexer.enabled=true \
  --set wazuh-dashboard.enabled=true \
  --set wazuh-manager.persistence.enabled=true \
  --set wazuh-manager.persistence.storageClass=longhorn-default \
  --set wazuh-manager.persistence.size=20Gi \
  --set wazuh-indexer.persistence.enabled=true \
  --set wazuh-indexer.persistence.storageClass=longhorn-default \
  --set wazuh-indexer.persistence.size=20Gi \
  --wait

Access Wazuh Dashboard:

# Port forward
kubectl port-forward -n blue-team svc/wazuh-dashboard 5601:5601

# Access at: https://localhost:5601
# Default credentials: admin / admin

4.3 Deploy ELK Stack (Alternative)

helm repo add elastic https://helm.elastic.co
helm repo update

helm install elasticsearch elastic/elasticsearch \
  --namespace blue-team \
  --set replicas=1 \
  --set minimumMasterNodes=1 \
  --set persistence.enabled=true \
  --set persistence.storageClass=longhorn-default \
  --set persistence.size=10Gi \
  --wait

helm install kibana elastic/kibana \
  --namespace blue-team \
  --set replicas=1 \
  --wait

Phase 5: Configure Network Policies

5.1 Create Isolation Policies

Create gitops-apps/security/network-policies.yaml:

---
# Network policies for sandbox isolation

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-sandbox-to-production
  namespace: dvwa
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: dvwa

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-blue-team-egress
  namespace: blue-team
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector: {}
  - to:
    - namespaceSelector: {}

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-blue-team-ingress
  namespace: blue-team
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: blue-team
  - from:
    - namespaceSelector:
        matchLabels:
          name: argocd

Apply policies:

# Label namespaces
kubectl label namespace dvwa name=dvwa
kubectl label namespace juice-shop name=juice-shop
kubectl label namespace blue-team name=blue-team

# Apply policies
kubectl apply -f gitops-apps/security/network-policies.yaml

Phase 6: Verification

6.1 Test Network Isolation

# From Kali Linux (should fail)
ssh -i ~/.ssh/homelab root@10.20.20.10 "ping -c 3 10.10.10.2"

# Expected: Packet filtered / timeout

# From production (should fail)
kubectl run test-sandbox --image=nicolaka/netshoot -n test --rm -it --restart=Never -- wget -O- http://10.20.20.10

# Expected: Connection refused

6.2 Test Vulnerable Targets

# Access DVWA
curl -I http://10.10.10.11:30880

# Access Juice Shop
curl -I http://10.10.10.11:30881

# Access Metasploitable3 (from Kali)
ssh -i ~/.ssh/homelab root@10.20.20.10 "curl -I http://localhost"

6.3 Verify Blue Team Logging

# Access Wazuh dashboard
# Navigate to Discover > wazuh-alerts-*

# Check for alerts from vulnerable targets
# Verify Falco events are being logged

Phase 7: Cleanup

7.1 Cleanup Test Resources

# Delete test pods
kubectl delete pod test-sandbox -n test

# Delete vulnerable targets (when not in use)
kubectl delete namespace dvwa juice-shop

# Shutdown red team VMs (when not in use)
qm stop kali-linux
qm stop parrot-os

Troubleshooting

Network Isolation Not Working

Issue: Sandbox can still access production

Solution:

# Verify network policies
kubectl get networkpolicies -A

# Check pfSense rules
# Firewall > Rules > SANDBOX

# Verify VXLAN tag
ip link show | grep vxlan | grep 200

Wazuh Dashboard Won't Load

Issue: Can't access Wazuh UI

Solution:

# Check pod status
kubectl get pods -n blue-team

# Check logs
kubectl logs -n blue-team -l app=wazuh-dashboard -f

# Reset admin password
kubectl exec -it -n blue-team wazuh-dashboard-0 -- bash
# Inside: /usr/share/kibana/bin/kibana-setup-passwords

Next Steps

Red/Blue team infrastructure deployed:

➡️ Continue to Guide 10: Monitoring Stack


Checklist

  • pfSense OPT interface configured
  • Firewall rules created (VNet2 → VNet1 blocked)
  • Kali Linux template created
  • Red team VMs deployed
  • Metasploitable3 deployed
  • DVWA deployed
  • Juice Shop deployed
  • Wazuh/ELK deployed
  • Blue team namespace created
  • Network policies applied
  • Network isolation verified
  • Vulnerable targets accessible
  • Blue team logging verified
  • Cleanup procedures documented

Source: docs/guides/10-monitoring-stack.md


Guide 10: Monitoring Stack (LGTM + OpenTelemetry)

Deploy the full LGTM stack (Loki, Grafana, Tempo, Mimir/Metrics) with OpenTelemetry for enterprise-grade observability.


Overview

This guide implements a modern observability pipeline. Instead of apps talking directly to databases, everything sends data via the OpenTelemetry (OTel) Protocol to a central collector, which then routes it to the appropriate LGTM component.

Time Required: ~75 minutes Prerequisites: Guide 09 completed


Architecture (LGTM Stack)

                    Observability Pipeline
    ┌──────────────────────────────────────────────────────┐
    │  Applications (Production / Development / Sandbox)   │
    └──────────┬───────────────────┬───────────────────────┘
               │ (Metrics/Logs/Traces via OTLP)
               ▼
    ┌──────────────────────────────────────────────────────┐
    │             OpenTelemetry (OTel) Collector           │
    │        (Processing, Batching, and Routing)           │
    └──────────┬────────┬──────────┬───────────────┬───────┘
               │        │          │               │
        ┌──────▼───┐┌───▼────┐┌────▼─────┐  ┌──────▼──────┐
        │Prometheus││  Loki  ││  Tempo   │  │ AlertManager│
        │ (Metrics)││ (Logs) ││ (Traces) │  │  (Alerts)   │
        └──────┬───┘└───┬────┘└────┬─────┘  └──────┬──────┘
               │        │          │               │
               └────────┴────┬─────┴───────────────┘
                             ▼
                    ┌──────────────────┐
                    │     Grafana      │
                    │ (Visualization)  │
                    └──────────────────┘

Phase 1: Deploy Prometheus Stack (Metrics)

1.1 Create Namespace

kubectl create namespace monitoring

1.2 Install kube-prometheus-stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClass=longhorn-default \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set grafana.adminPassword=admin \
  --set grafana.service.type=NodePort \
  --set grafana.service.nodePort=30090 \
  --wait

Phase 2: Deploy Loki (Logs)

Important

grafana/loki-stack is deprecated. Use the standalone grafana/loki chart (SingleBinary mode for homelab) with grafana/alloy as the log collector agent.

2.1 Create Namespace

kubectl create namespace logging

2.2 Install Loki (SingleBinary mode)

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm install loki grafana/loki \
  --namespace logging \
  --set loki.commonConfig.replication_factor=1 \
  --set loki.storage.type=filesystem \
  --set singleBinary.replicas=1 \
  --set singleBinary.persistence.enabled=true \
  --set singleBinary.persistence.storageClass=longhorn-default \
  --set singleBinary.persistence.size=20Gi \
  --wait

2.3 Install Grafana Alloy (Log Collector Agent)

Alloy replaces the deprecated Promtail as the recommended log shipping agent:

helm install alloy grafana/alloy \
  --namespace logging \
  --set alloy.clustering.enabled=false \
  --set controller.type=daemonset \
  --wait

Verify Loki is running:

kubectl get pods -n logging
# Expected: loki-0 Running, alloy-* Running on each node

Phase 3: Deploy Grafana Tempo (Traces)

3.1 Install Tempo

helm install tempo grafana/tempo \
  --namespace monitoring \
  --set tempo.storage.trace.backend=local \
  --set tempo.persistence.enabled=true \
  --set tempo.persistence.storageClass=longhorn-default \
  --set tempo.persistence.size=20Gi \
  --wait

Verify Tempo is running:

kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo

Phase 4: Deploy OpenTelemetry Collector

The OTel Collector is the single entry point for all telemetry data. It receives OTLP from applications and routes to the appropriate backend.

4.1 Install OTel Collector

Create otel-values.yaml:

mode: deployment
config:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: "0.0.0.0:4317"
        http:
          endpoint: "0.0.0.0:4318"

  exporters:
    # Metrics: push to Prometheus via remote_write (Prometheus is pull-based;
    # use prometheusremotewrite, NOT the 'prometheus' exporter which is a scrape endpoint only)
    prometheusremotewrite:
      endpoint: "http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/api/v1/write"

    # Logs: push to Loki
    loki:
      endpoint: "http://loki.logging.svc.cluster.local:3100/loki/api/v1/push"

    # Traces: forward to Tempo via OTLP gRPC
    otlp/tempo:
      endpoint: "tempo.monitoring.svc.cluster.local:4317"
      tls:
        insecure: true

  service:
    pipelines:
      metrics:
        receivers: [otlp]
        exporters: [prometheusremotewrite]
      logs:
        receivers: [otlp]
        exporters: [loki]
      traces:
        receivers: [otlp]
        exporters: [otlp/tempo]
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-collector open-telemetry/opentelemetry-collector \
  --namespace monitoring \
  --create-namespace \
  -f otel-values.yaml \
  --wait

Enable Prometheus remote_write (required for OTel metrics pipeline):

# Patch kube-prometheus-stack to enable the remote_write receiver
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --reuse-values \
  --set prometheus.prometheusSpec.enableRemoteWriteReceiver=true

Phase 5: Configure Grafana LGTM Integration

Access Grafana: http://10.10.10.10:30090 (K3s master node)

Add datasources in Grafana UI (Connections → Data Sources → Add new):

Datasource Type URL
Prometheus Prometheus http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090
Loki Loki http://loki.logging.svc.cluster.local:3100
Tempo Tempo http://tempo.monitoring.svc.cluster.local:3200port 3200, not 3100

Caution

Tempo's HTTP API is on port 3200. Port 3100 is Loki's push endpoint. Using the wrong port will cause "No data" in Grafana Explore.

Link traces to logs by configuring Tempo → Derived Fields → Loki in Grafana datasource settings.

# Verify datasource connectivity from Grafana pods
kubectl exec -n monitoring deploy/kube-prometheus-stack-grafana -- \
  wget -qO- http://loki.logging.svc.cluster.local:3100/ready
# Expected: "ready"

kubectl exec -n monitoring deploy/kube-prometheus-stack-grafana -- \
  wget -qO- http://tempo.monitoring.svc.cluster.local:3200/ready
# Expected: "ready"

Phase 6: Enterprise "Multi-Cluster" Simulation

To replicate an enterprise setup, use Namespace Labels and Resource Quotas to simulate different environments on the same physical hardware:

Simulated Environment Namespace Prefix Kyverno Mode Storage Class
Production prd-* Enforce longhorn-critical (3 replicas)
Development dev-* Audit longhorn-default (2 replicas)
Sandbox (Maul) external None local-lvm

Verification

Test Trace Generation

# Deploy a test telemetry app connected to the OTel Collector
kubectl run telemetry-test -n monitoring \
  --image=ghcr.io/open-telemetry/opentelemetry-demo/productcatalogservice:latest \
  --env="OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.monitoring.svc.cluster.local:4317" \
  --env="OTEL_SERVICE_NAME=telemetry-test"

# Wait ~30 seconds, then check Grafana → Explore → Tempo
kubectl delete pod telemetry-test -n monitoring

Troubleshooting

No Traces appearing in Tempo

kubectl logs -n monitoring -l app.kubernetes.io/name=opentelemetry-collector --tail=50
# Look for "exporter/otlp/tempo" errors
  • Confirm Tempo pod is Running: kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo
  • Verify Grafana datasource URL uses port 3200

Loki not receiving logs

kubectl logs -n logging -l app.kubernetes.io/name=alloy --tail=50
  • Check Loki is healthy: kubectl get pods -n logging

Metrics not appearing in Prometheus

# Verify remote_write receiver is enabled
kubectl get prometheus -n monitoring -o yaml | grep enableRemoteWriteReceiver

Completion Checklist

  • Prometheus deployed (via kube-prometheus-stack)
  • Loki deployed (standalone chart, SingleBinary mode)
  • Grafana Alloy DaemonSet running on all nodes (log collection)
  • Tempo deployed
  • OTel Collector routing: metrics → Prometheus remote_write, logs → Loki, traces → Tempo
  • Grafana datasources configured: Prometheus (9090), Loki (3100), Tempo (3200)
  • Trace-to-log correlation configured in Grafana Tempo datasource
  • Multi-environment namespace simulation configured

Source: docs/guides/11-identity-sso.md


Guide 11: Identity & SSO (Authelia + LLDAP)

Implement enterprise-grade Single Sign-On (SSO) and Multi-Factor Authentication (MFA) across your homelab.


Overview

This guide integrates Authelia with an LLDAP (Lightweight LDAP) backend to protect lab services behind a unified login portal. It replicates a "Zero Trust" architecture where every request is authenticated before reaching the application.

Time Required: ~60 minutes
Prerequisites: Guide 07 completed (PostgreSQL and ArgoCD running), Guide 05 (ingress-nginx running)


Architecture

User
 │
 ▼
Ingress-Nginx ──► (forward-auth check) ──► Authelia (auth.homelab.local)
 │                                              │
 │    ┌─────────────────────────────────────────┤
 │    │  Session Store (Redis)                  │
 │    │  Storage (PostgreSQL)                   │
 │    │  Identity (LLDAP → LDAP protocol)       │
 │    └─────────────────────────────────────────┘
 │
 ▼
Protected Service (Grafana / ArgoCD / Gitea)

Phase 1: Deploy Identity Store (LLDAP)

LLDAP is a lightweight LDAP server with a modern web UI — the recommended option for DevSecOps learning.

Alternative: For enterprise Active Directory experience, provision a Windows Server 2022 VM on pve-vader, promote it to a Domain Controller (homelab.local), and create an authelia-bind service account. LLDAP requires no Windows licensing and all steps below work with both.

1.1 Create LLDAP ArgoCD Application

Create gitops-apps/security/lldap.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: lldap
  namespace: argocd
spec:
  project: homelab
  source:
    repoURL: https://lldap.github.io/charts
    chart: lldap
    targetRevision: "*"
    helm:
      values: |
        env:
          LLDAP_LDAP_PORT: "3890"
          LLDAP_HTTP_PORT: "17170"
          LLDAP_LDAP_BASE_DN: "dc=homelab,dc=local"
          LLDAP_JWT_SECRET:
            valueFrom:
              secretKeyRef:
                name: lldap-secrets
                key: jwt-secret
          LLDAP_LDAP_USER_PASS:
            valueFrom:
              secretKeyRef:
                name: lldap-secrets
                key: admin-password
        persistence:
          enabled: true
          storageClass: longhorn-default
          size: 1Gi
        service:
          type: ClusterIP
  destination:
    server: https://kubernetes.default.svc
    namespace: security
  syncPolicy:
    automated:
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

1.2 Create LLDAP Secret

kubectl create namespace security

kubectl create secret generic lldap-secrets \
  --namespace security \
  --from-literal=jwt-secret=$(openssl rand -base64 32) \
  --from-literal=admin-password=$(openssl rand -base64 16)

# Store these values in Vault too (recommended):
# vault kv put homelab/lldap jwt-secret=<value> admin-password=<value>

1.3 Set Up Users

# Port-forward the LLDAP web UI
kubectl port-forward -n security svc/lldap 17170:17170

# Open http://localhost:17170
# Default admin user: admin / (the password from the secret above)

In the LLDAP web UI:

  1. Create a group: homelab-admins
  2. Create a group: homelab-users
  3. Create your primary user and add them to homelab-admins
  4. Create an Authelia service account: authelia-bind (member of homelab-users only)

Phase 2: Redis (Session Store Prerequisite)

Important

Authelia requires Redis for session storage. Do not skip this step — Authelia will crash-loop without it.

helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

helm install redis bitnami/redis \
  --namespace security \
  --set auth.enabled=false \
  --set replica.replicaCount=1 \
  --set master.persistence.enabled=true \
  --set master.persistence.storageClass=longhorn-default \
  --set master.persistence.size=2Gi \
  --wait

# Verify Redis is running
kubectl get pods -n security -l app.kubernetes.io/name=redis

Phase 3: Deploy Authelia

3.1 Create Authelia Secret

kubectl create secret generic authelia-secrets \
  --namespace security \
  --from-literal=jwt-secret=$(openssl rand -base64 32) \
  --from-literal=session-secret=$(openssl rand -base64 32) \
  --from-literal=storage-encryption-key=$(openssl rand -base64 32) \
  --from-literal=ldap-password=$(kubectl get secret lldap-secrets -n security \
      -o jsonpath='{.data.admin-password}' | base64 -d)

3.2 Create PostgreSQL Database for Authelia

# Exec into the PostgreSQL pod and create the database
kubectl exec -n postgresql -it $(kubectl get pods -n postgresql -l app.kubernetes.io/name=postgresql -o name | head -1) -- \
  psql -U postgres -c "CREATE DATABASE authelia;"

3.3 Create Authelia Values File

Create gitops-apps/security/authelia-values.yaml (add to .gitignore — contains references to secrets):

domain: homelab.local

authentication_backend:
  ldap:
    implementation: custom
    url: ldap://lldap.security.svc.cluster.local:3890
    base_dn: dc=homelab,dc=local
    username_attribute: uid
    additional_users_dn: ou=people
    users_filter: "(&({username_attribute}={input})(objectClass=person))"
    additional_groups_dn: ou=groups
    groups_filter: "(member={dn})"
    group_name_attribute: cn
    mail_attribute: mail
    display_name_attribute: displayName
    user: uid=authelia-bind,ou=people,dc=homelab,dc=local
    password:
      secret_name: authelia-secrets
      secret_key: ldap-password

access_control:
  default_policy: deny
  rules:
    # Allow unauthenticated access to the auth portal itself
    - domain: auth.homelab.local
      policy: bypass
    # Management services require 2FA
    - domain:
        - argocd.homelab.local
        - vault.homelab.local
      policy: two_factor
      subject: "group:homelab-admins"
    # Internal tools require 1FA
    - domain:
        - grafana.homelab.local
        - longhorn.homelab.local
        - "*.homelab.local"
      policy: one_factor
      subject: "group:homelab-users"

session:
  name: authelia_session
  domain: homelab.local
  same_site: lax
  expiration: 1h
  inactivity: 5m
  redis:
    host: redis-master.security.svc.cluster.local
    port: 6379

storage:
  postgres:
    host: postgresql.postgresql.svc.cluster.local
    port: 5432
    database: authelia
    schema: public
    username: postgres
    password:
      secret_name: authelia-secrets
      secret_key: storage-encryption-key

notifier:
  disable_startup_check: true
  filesystem:
    filename: /tmp/authelia-notifications.txt

identity_providers:
  oidc:
    hmac_secret:
      secret_name: authelia-secrets
      secret_key: session-secret
    issuer_private_key:
      path: /config/oidc.key
    clients:
      - id: gitea
        description: Gitea
        secret: "$plaintext$<generate-with: openssl rand -hex 32>"
        public: false
        authorization_policy: one_factor
        redirect_uris:
          - https://git.homelab.local/user/oauth2/authelia/callback
        scopes: [openid, profile, email, groups]
      - id: argocd
        description: ArgoCD
        secret: "$plaintext$<generate-with: openssl rand -hex 32>"
        public: false
        authorization_policy: two_factor
        redirect_uris:
          - https://argocd.homelab.local/auth/callback
        scopes: [openid, profile, email, groups]

3.4 Deploy Authelia

helm repo add authelia https://charts.authelia.com
helm repo update

helm install authelia authelia/authelia \
  --namespace security \
  --values gitops-apps/security/authelia-values.yaml \
  --set secret.existingSecret=authelia-secrets \
  --wait

# Verify
kubectl get pods -n security -l app.kubernetes.io/name=authelia

3.5 Create Ingress for Authelia

cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: authelia
  namespace: security
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
spec:
  ingressClassName: nginx
  rules:
    - host: auth.homelab.local
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: authelia
                port:
                  number: 80
EOF

Phase 4: Ingress-Nginx Forward-Auth Integration

Add these annotations to any Ingress object you want to protect with Authelia:

# Add to the Ingress metadata.annotations block of any service
annotations:
  nginx.ingress.kubernetes.io/auth-url: "https://auth.homelab.local/api/verify"
  nginx.ingress.kubernetes.io/auth-signin: "https://auth.homelab.local/?rd=$scheme://$host$request_uri"
  nginx.ingress.kubernetes.io/auth-response-headers: "Remote-User,Remote-Groups,Remote-Email,Remote-Name"
  nginx.ingress.kubernetes.io/proxy-buffer-size: "8k"

Example — protecting Grafana:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana
  namespace: monitoring
  annotations:
    nginx.ingress.kubernetes.io/auth-url: "https://auth.homelab.local/api/verify"
    nginx.ingress.kubernetes.io/auth-signin: "https://auth.homelab.local/?rd=$scheme://$host$request_uri"
    nginx.ingress.kubernetes.io/auth-response-headers: "Remote-User,Remote-Groups,Remote-Email,Remote-Name"
spec:
  ingressClassName: nginx
  rules:
    - host: grafana.homelab.local
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: kube-prometheus-stack-grafana
                port:
                  number: 80

Phase 5: Application OIDC Configuration

5.1 Configure Gitea OIDC

In Gitea web UI → Site Administration → Authentication Sources → Add Authentication Source:

Field Value
Type OAuth2
Name Authelia
OAuth2 Provider OpenID Connect
Client ID gitea
Client Secret (value from authelia-values.yaml)
OpenID Connect Auto-Discovery URL https://auth.homelab.local/.well-known/openid-configuration
Additional Scopes groups

5.2 Configure ArgoCD OIDC

Patch ArgoCD ConfigMap to add Authelia as an OIDC provider:

kubectl patch configmap argocd-cm -n argocd --type=merge -p '{
  "data": {
    "oidc.config": "name: Authelia\nissuer: https://auth.homelab.local\nclientID: argocd\nclientSecret: $oidc.authelia.clientSecret\nrequestedScopes: [\"openid\", \"profile\", \"email\", \"groups\"]\n",
    "url": "https://argocd.homelab.local"
  }
}'

# Store the OIDC client secret in ArgoCD secret
kubectl patch secret argocd-secret -n argocd --type=merge \
  -p '{"stringData": {"oidc.authelia.clientSecret": "<your-argocd-client-secret>"}}'

# Map the homelab-admins group to the ArgoCD admin role
kubectl patch configmap argocd-rbac-cm -n argocd --type=merge -p '{
  "data": {
    "policy.csv": "g, homelab-admins, role:admin\n",
    "policy.default": "role:readonly"
  }
}'

Verification

# 1. Check all security namespace pods are Running
kubectl get pods -n security

# 2. Test LLDAP is responding to LDAP queries
kubectl exec -n security deploy/lldap -- \
  ldapsearch -H ldap://localhost:3890 -x -b "dc=homelab,dc=local" -D "uid=admin,ou=people,dc=homelab,dc=local" "(objectClass=person)" uid

# 3. Test Authelia health endpoint
kubectl exec -n security deploy/authelia -- \
  wget -qO- http://localhost:9091/api/health
# Expected: {"status":"OK"}

End-to-end browser test:

  1. Open a Private/Incognito window
  2. Navigate to https://argocd.homelab.local
  3. Confirm redirect to https://auth.homelab.local
  4. Log in with your LLDAP user credentials
  5. Complete TOTP (if two_factor policy applies)
  6. Confirm redirect back to ArgoCD with your user logged in
  7. Navigate to https://grafana.homelab.local — confirm Authelia portal appears

Completion Checklist

  • security namespace created
  • LLDAP deployed and accessible via web UI (kubectl port-forward)
  • LLDAP users created: primary user + homelab-admins group + authelia-bind service account
  • Redis deployed in security namespace
  • authelia-secrets Secret created with JWT, session, and storage-encryption keys
  • authelia PostgreSQL database created
  • Authelia deployed and all pods Running
  • auth.homelab.local Ingress created and resolves correctly
  • At least one service (Grafana) protected with forward-auth annotations
  • OIDC configured in Gitea — "Login with Authelia" button visible
  • OIDC configured in ArgoCD — single sign-on working
  • End-to-end browser test passed (redirect → login → TOTP → access granted)

Source: docs/guides/12-cicd-pipeline-security.md


Guide 12: CI/CD Pipeline Security

Secure the software delivery pipeline with Gitea Actions, secret scanning, vulnerability gates, and automated security checks.


Overview

This guide builds a shift-left security pipeline using Gitea Actions. Every push and PR triggers automated security scans — secret detection, dependency scanning, container scanning, IaC validation, and Kubernetes manifest checks — before code reaches ArgoCD for deployment.

Time Required: ~90 minutes Prerequisites: Guide 07 (GitOps Stack) completed

                    CI/CD Security Pipeline
    ┌─────────────────────────────────────────────────┐
    │              Developer Push / PR                 │
    └──────────────────┬──────────────────────────────┘
                       ▼
    ┌─────────────────────────────────────────────────┐
    │            Gitea Actions Workflow                │
    │                                                  │
    │  Stage 1: Secret Scan (gitleaks)                 │
    │  Stage 2: Lint & Validate (yamllint, kubeval)   │
    │  Stage 3: IaC Scan (checkov, tfsec)              │
    │  Stage 4: Dependency Scan (trivy fs)             │
    │  Stage 5: Container Scan (trivy image)           │
    │  Stage 6: K8s Manifest Validation (conftest)     │
    │  Stage 7: Deploy Gate (ArgoCD sync)              │
    └──────────────────┬──────────────────────────────┘
                       │ (all gates pass)
                       ▼
    ┌─────────────────────────────────────────────────┐
    │        ArgoCD GitOps Auto-Sync                   │
    │   (only syncs if pipeline succeeded on branch)   │
    └─────────────────────────────────────────────────┘

Phase 1: Enable Gitea Actions

1.1 Enable Actions in Gitea

Gitea has built-in CI/CD compatible with GitHub Actions syntax.

# Enable actions in Gitea config
# SSH into the Gitea pod or edit via ConfigMap
kubectl edit configmap -n services gitea-config

Add or update in app.ini:

[actions]
ENABLED = true
DEFAULT_ACTIONS_URL = https://gitea.com

Restart Gitea:

kubectl rollout restart deployment -n services gitea
kubectl rollout status deployment -n services gitea

1.2 Enable Actions for the Repository

# Via Gitea API
GITEA_TOKEN="your-admin-token"
GITEA_URL="http://gitea.services.svc.cluster.local:3000"

# Enable actions for the homelab repo
curl -X PUT "${GITEA_URL}/api/v1/repos/homelab/gitops-apps/actions/enable" \
  -H "Authorization: token ${GITEA_TOKEN}"

Or via UI: Repository → Settings → Actions → Enable.

1.3 Create Runner Namespace

kubectl create namespace cicd
kubectl label namespace cicd environment=cicd

1.4 Deploy Gitea Act Runner

Create runner-values.yaml:

replicaCount: 1

runner:
  register: true
  config: |
    runner:
      labels:
        - "ubuntu-latest:docker://node:22-bookworm"
        - "self-hosted:kubernetes"

  # Register with Gitea instance
  name: "homelab-runner"
  token: ""  # Set via --set flag

image:
  repository: gitea/act_runner
  tag: "0.2.11"
  pullPolicy: IfNotPresent

resources:
  requests:
    cpu: 500m
    memory: 256Mi
  limits:
    cpu: "2"
    memory: 1Gi

persistence:
  enabled: true
  size: 10Gi
  storageClass: longhorn-default

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: false

env:
  GITEA_INSTANCE_URL: "http://gitea.services.svc.cluster.local:3000"
  GITEA_RUNNER_LABELS: "ubuntu-latest:docker://node:22-bookworm,self-hosted:kubernetes"

Install runner:

helm repo add gitea-charts https://dl.gitea.io/charts/
helm repo update

# Get runner token from Gitea UI: Site Administration → Actions → Runners → Register New Runner
helm install act-runner gitea-charts/act-runner \
  --namespace cicd \
  --set runner.token="YOUR_RUNNER_TOKEN" \
  -f runner-values.yaml \
  --wait

Verify:

kubectl get pods -n cicd
# Expected: act-runner-0 Running

Phase 2: Local Pre-Commit Hooks

2.1 Install Pre-Commit Framework

# On macOS workstation
pip3 install pre-commit gitleaks tfsec trivy yamllint

# Or via brew
brew install pre-commit gitleaks tfsec trivy yamllint

2.2 Create Pre-Commit Configuration

Create .pre-commit-config.yaml in the repository root:

repos:
  # Secret detection
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.21.2
    hooks:
      - id: gitleaks

  # YAML linting
  - repo: https://github.com/adrienverge/yamllint
    rev: v1.35.1
    hooks:
      - id: yamllint
        args: ['-d', '{extends: relaxed, rules: {line-length: {max: 120}}}']

  # Terraform formatting and validation
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.96.1
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_tflint
      - id: tfsec
        args: ['--force-all-dirs', '--no-color']

  # General checks
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v5.0.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
        args: ['--unsafe']
      - id: check-json
      - id: check-merge-conflict
      - id: detect-private-key
      - id: no-commit-to-branch
        args: ['--branch', 'main']

2.3 Install Hooks

cd /path/to/homelab
pre-commit install
pre-commit install --hook-type pre-push

# Run against all files to test
pre-commit run --all-files

Phase 3: Security Pipeline Workflow

3.1 Create Workflow Directory

mkdir -p gitops-apps/.gitea/workflows

3.2 Main Security Pipeline

Create gitops-apps/.gitea/workflows/security-pipeline.yaml:

name: Security Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  TRIVY_SEVERITY: "HIGH,CRITICAL"
  TRIVY_EXIT_CODE: "1"
  TRIVY_DB_REPOSITORY: "ghcr.io/aquasecurity/trivy-db"

jobs:
  # ── Stage 1: Secret Scanning ──────────────────────
  secret-scan:
    name: "🔍 Secret Detection"
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Gitleaks Scan
        uses: gitleaks/gitleaks-action@v2
        env:
          GITLEAKS_LICENSE: ""  # Community edition

  # ── Stage 2: Lint & Validate ─────────────────────
  lint:
    name: "📝 Lint & Validate"
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: YAML Lint
        run: |
          pip install yamllint
          yamllint -d '{extends: relaxed, rules: {line-length: {max: 120}}}' .

      - name: Validate K8s Manifests
        run: |
          # Install kubeval
          curl -L https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz | tar xz
          sudo mv kubeval /usr/local/bin/
          find . -name '*.yaml' -o -name '*.yml' | grep -v '.gitea' | xargs kubeval --strict --ignore-missing-schemas

  # ── Stage 3: IaC Security Scan ────────────────────
  iac-scan:
    name: "🏗️ IaC Security Scan"
    runs-on: ubuntu-latest
    needs: [lint]
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Run Checkov
        run: |
          pip install checkov
          checkov -d . --framework terraform,kubernetes --skip-check CKV_K8S_21 \
            --output cli --output junitxml --output-file-path console,checkov-results.xml
        continue-on-error: false

      - name: Run tfsec (Terraform only)
        run: |
          curl -L https://github.com/aquasecurity/tfsec/releases/latest/download/tfsec-linux-amd64 -o tfsec
          chmod +x tfsec
          ./tfsec terraform/ --format junit --out tfsec-results.xml
        continue-on-error: false

      - name: Upload Scan Results
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: iac-scan-results
          path: |
            checkov-results.xml
            tfsec-results.xml

  # ── Stage 4: Dependency Scanning ──────────────────
  dependency-scan:
    name: "📦 Dependency Scan"
    runs-on: ubuntu-latest
    needs: [lint]
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Install Trivy
        run: |
          curl -L https://github.com/aquasecurity/trivy/releases/latest/download/trivy_linux_amd64.tar.gz | tar xz
          sudo mv trivy /usr/local/bin/

      - name: Filesystem Scan
        run: |
          trivy fs --severity HIGH,CRITICAL --exit-code 1 --format table .

      - name: Generate SBOM
        run: |
          trivy fs --format spdx-json --output sbom.spdx.json .
        continue-on-error: true

      - name: Upload SBOM
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: dependency-scan-results
          path: |
            sbom.spdx.json

  # ── Stage 5: Kubernetes Manifest Validation ───────
  k8s-validate:
    name: "☸️ K8s Manifest Validation"
    runs-on: ubuntu-latest
    needs: [lint]
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Conftest Policy Check
        run: |
          curl -L https://github.com/open-policy-agent/conftest/releases/latest/download/conftest_linux_amd64.tar.gz | tar xz
          sudo mv conftest /usr/local/bin/
          conftest test --policy policies/ gitops-apps/

  # ── Stage 6: Security Gate Summary ────────────────
  security-gate:
    name: "🛡️ Security Gate"
    runs-on: ubuntu-latest
    needs: [secret-scan, iac-scan, dependency-scan, k8s-validate]
    if: always()
    steps:
      - name: Check All Scans Passed
        run: |
          echo "Secret Scan: ${{ needs.secret-scan.result }}"
          echo "IaC Scan: ${{ needs.iac-scan.result }}"
          echo "Dependency Scan: ${{ needs.dependency-scan.result }}"
          echo "K8s Validate: ${{ needs.k8s-validate.result }}"

          if [[ "${{ needs.secret-scan.result }}" == "failure" || \
                "${{ needs.iac-scan.result }}" == "failure" || \
                "${{ needs.dependency-scan.result }}" == "failure" || \
                "${{ needs.k8s-validate.result }}" == "failure" ]]; then
            echo "::error::Security gate FAILED — one or more security scans detected issues"
            exit 1
          fi

          echo "✅ All security gates passed"

3.3 Container Build & Scan Pipeline

Create gitops-apps/.gitea/workflows/container-pipeline.yaml:

name: Container Security Pipeline

on:
  push:
    paths:
      - 'container-images/**'
      - 'Dockerfile*'

jobs:
  build-and-scan:
    name: "🔨 Build & Scan Container"
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Build Image
        run: |
          docker build -t homelab-app:${{ github.sha }} .

      - name: Install Trivy
        run: |
          curl -L https://github.com/aquasecurity/trivy/releases/latest/download/trivy_linux_amd64.tar.gz | tar xz
          sudo mv trivy /usr/local/bin/

      - name: Container Vulnerability Scan
        run: |
          trivy image --severity HIGH,CRITICAL --exit-code 1 \
            --format table \
            --ignore-unfixed \
            homelab-app:${{ github.sha }}

      - name: Container Config Scan
        run: |
          trivy image --type config --severity HIGH,CRITICAL \
            homelab-app:${{ github.sha }}

      - name: Generate SBOM
        run: |
          trivy image --format spdx-json --output container-sbom.spdx.json \
            homelab-app:${{ github.sha }}

      - name: Generate SARIF Report
        run: |
          trivy image --format sarif --output trivy-results.sarif \
            homelab-app:${{ github.sha }}

      - name: Upload Results
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: container-scan-results
          path: |
            container-sbom.spdx.json
            trivy-results.sarif

Phase 4: Branch Protection

4.1 Configure via Gitea API

GITEA_URL="http://gitea.services.svc.cluster.local:3000"
GITEA_TOKEN="your-admin-token"
REPO="homelab/gitops-apps"

# Enable branch protection for main
curl -X PUT "${GITEA_URL}/api/v1/repos/${REPO}/branches/main/protection" \
  -H "Authorization: token ${GITEA_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "block_on_official_review_requests": true,
    "block_on_outdated_branch": true,
    "block_on_rejected_reviews": true,
    "dismiss_stale_reviews": true,
    "enable_push": false,
    "enable_status_check": true,
    "required_approvals_count": 1,
    "status_check_contexts": [
      "Security Pipeline / 🛡️ Security Gate"
    ],
    "required_signed_commits": true
  }'

4.2 Configure via UI

  1. Navigate to Repository → Settings → Branches
  2. Under Branch Protection for main:
    • Enable Require Pull Request (no direct pushes)
    • Enable Require Approval (minimum 1 reviewer)
    • Enable Require Signed Commits
    • Enable Require Status Checks: select Security Pipeline / 🛡️ Security Gate
    • Enable Dismiss Stale Reviews

Phase 5: Gitleaks Configuration

5.1 Create Gitleaks Config

Create .gitleaks.toml in the repository root:

title = "Homelab Gitleaks Configuration"

[extend]
# Use default rules as base
useDefault = true

# Allowlist patterns specific to the homelab
[[rules]]
id = "generic-api-key"
description = "Generic API Key"
regex = '''(?i)(?:key|api|token|secret|password|pwd|pw|auth)['"''\s]*(?::|=|\s+is\s+|->)\s*['"''']?[0-9a-zA-Z\-_.]{20,}'''

[[allowlist]]
description = "Allow Proxmox API URL (not a secret)"
regexes = ['''https://192\.168\.1\.11:8006''']

[[allowlist]]
description = "Allow K3s API server URL"
regexes = ['''https://10\.10\.10\.10:6443''']

[[allowlist]]
description = "Allow homelab domain references"
regexes = ['''homelab\.local''', '''\.svc\.cluster\.local''']

[[allowlist]]
paths = [
  '''^\.gitleaks\.toml$''',
  '''^docs/.*$''',
  '''^\.pre-commit-config\.yaml$'''
]

5.2 Test Gitleaks

# Scan entire repo
gitleaks detect --source . --verbose

# Scan with custom config
gitleaks detect --source . --config-path .gitleaks.toml

# Generate report
gitleaks detect --source . --report-format json --report-path gitleaks-report.json

Phase 6: Integration with ArgoCD

6.1 Conditional Sync Policy

Modify the ArgoCD root Application to only auto-sync when the pipeline passes:

# gitops-apps/argocd-apps/root-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root-application
  namespace: argocd
  annotations:
    # Only sync when security pipeline has passed on the commit
    argocd.argoproj.io/sync-options: ServerSideApply=true
spec:
  project: homelab
  source:
    repoURL: https://git.homelab.local/homelab/gitops-apps.git
    targetRevision: main
    path: argocd-apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      selfHeal: true
      prune: false
    syncOptions:
      - ServerSideApply=true
    # Retry with backoff on transient failures
    retry:
      limit: 3
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

6.2 Notification on Sync Failure

# Configure ArgoCD notifications for failed syncs
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-notifications-cm
  namespace: argocd
data:
  trigger.on-sync-failed: |
    - description: Application sync failed
      send:
        - app-sync-failed
      when: app.status.operationState.phase in ['Failed', 'Error']
  template.app-sync-failed: |
    message: |
      🔴 Application {{.app.metadata.name}} sync failed.
      Health: {{.app.status.health.status}}
      Sync Status: {{.app.status.sync.status}}
      Error: {{.app.status.operationState.message}}
EOF

Phase 7: GitOps Implementation Manifests

7.1 ArgoCD Application for Act Runner

Create gitops-apps/infrastructure/cicd/application.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: act-runner
  namespace: argocd
spec:
  project: homelab
  source:
    repoURL: https://git.homelab.local/homelab/gitops-apps.git
    targetRevision: main
    path: infrastructure/cicd
  destination:
    server: https://kubernetes.default.svc
    namespace: cicd
  syncPolicy:
    automated:
      selfHeal: true
      prune: false
    syncOptions:
      - ServerSideApply=true
      - CreateNamespace=true

7.2 Runner Deployment Manifest

Create gitops-apps/infrastructure/cicd/runner-deployment.yaml:

---
apiVersion: v1
kind: Secret
metadata:
  name: act-runner-token
  namespace: cicd
type: Opaque
stringData:
  token: ""  # Set after registration via: kubectl edit secret act-runner-token -n cicd
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: act-runner-data
  namespace: cicd
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: longhorn-default
  resources:
    requests:
      storage: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: act-runner
  namespace: cicd
  labels:
    app.kubernetes.io/name: act-runner
    environment: cicd
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: act-runner
  template:
    metadata:
      labels:
        app.kubernetes.io/name: act-runner
        environment: cicd
    spec:
      containers:
        - name: runner
          image: gitea/act_runner:0.2.11
          env:
            - name: GITEA_INSTANCE_URL
              value: "http://gitea.services.svc.cluster.local:3000"
            - name: GITEA_RUNNER_NAME
              value: "homelab-runner"
            - name: GITEA_RUNNER_LABELS
              value: "ubuntu-latest:docker://node:22-bookworm,self-hosted:kubernetes"
            - name: GITEA_RUNNER_REGISTRATION_TOKEN
              valueFrom:
                secretKeyRef:
                  name: act-runner-token
                  key: token
          resources:
            requests:
              cpu: 500m
              memory: 256Mi
            limits:
              cpu: "2"
              memory: 1Gi
          volumeMounts:
            - name: runner-data
              mountPath: /data
            - name: docker-sock
              mountPath: /var/run/docker.sock
            - name: workspace
              mountPath: /home/runner/_work
      volumes:
        - name: runner-data
          persistentVolumeClaim:
            claimName: act-runner-data
        - name: docker-sock
          hostPath:
            path: /var/run/docker.sock
        - name: workspace
          emptyDir: {}
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: act-runner
  namespace: cicd

Note

The runner needs Docker access to execute job containers. If Docker isn't available on K3s nodes, use DinD (Docker-in-Docker) sidecar or switch to kubernetes mode where each job runs as a separate pod.

7.3 Runner with Docker-in-Docker

Create gitops-apps/infrastructure/cicd/runner-dind-deployment.yaml (alternative):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: act-runner-dind
  namespace: cicd
  labels:
    app.kubernetes.io/name: act-runner
    environment: cicd
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: act-runner
  template:
    metadata:
      labels:
        app.kubernetes.io/name: act-runner
        environment: cicd
    spec:
      containers:
        - name: runner
          image: gitea/act_runner:0.2.11
          env:
            - name: GITEA_INSTANCE_URL
              value: "http://gitea.services.svc.cluster.local:3000"
            - name: GITEA_RUNNER_NAME
              value: "homelab-runner"
            - name: GITEA_RUNNER_LABELS
              value: "ubuntu-latest:docker://node:22-bookworm,self-hosted:kubernetes"
            - name: GITEA_RUNNER_REGISTRATION_TOKEN
              valueFrom:
                secretKeyRef:
                  name: act-runner-token
                  key: token
            - name: DOCKER_HOST
              value: "tcp://localhost:2376"
            - name: DOCKER_CERT_PATH
              value: "/certs/client"
            - name: DOCKER_TLS_VERIFY
              value: "1"
          resources:
            requests:
              cpu: 500m
              memory: 256Mi
            limits:
              cpu: "2"
              memory: 1Gi
          volumeMounts:
            - name: runner-data
              mountPath: /data
            - name: docker-certs
              mountPath: /certs/client
              readOnly: true
            - name: workspace
              mountPath: /home/runner/_work

        - name: dind
          image: docker:dind
          securityContext:
            privileged: true
          env:
            - name: DOCKER_TLS_CERTDIR
              value: "/certs"
          resources:
            requests:
              cpu: 500m
              memory: 512Mi
            limits:
              cpu: "2"
              memory: 2Gi
          volumeMounts:
            - name: docker-certs
              mountPath: /certs/client
            - name: docker-storage
              mountPath: /var/lib/docker

      volumes:
        - name: runner-data
          persistentVolumeClaim:
            claimName: act-runner-data
        - name: docker-certs
          emptyDir: {}
        - name: docker-storage
          emptyDir: {}
        - name: workspace
          emptyDir: {}

7.4 Register the Runner

After deploying, register the runner with Gitea:

# Get registration token from Gitea admin
# UI: Site Administration → Actions → Runners → Generate Registration Token
# Or API:
GITEA_URL="http://gitea.services.svc.cluster.local:3000"
ADMIN_TOKEN="your-admin-token"

REG_TOKEN=$(curl -s "${GITEA_URL}/api/v1/admin/runners/registration-token" \
  -H "Authorization: token ${ADMIN_TOKEN}" | jq -r '.token')

# Update the secret
kubectl create secret generic act-runner-token \
  --namespace cicd \
  --from-literal=token="${REG_TOKEN}" \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart runner to pick up the token
kubectl rollout restart deployment -n cicd act-runner

7.5 Repository Secrets

Store pipeline secrets in Gitea:

REPO="homelab/gitops-apps"

# Registry credentials (for container push/pull)
curl -X PUT "${GITEA_URL}/api/v1/repos/${REPO}/actions/secrets/REGISTRY_USER" \
  -H "Authorization: token ${ADMIN_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{"data": "admin"}'

curl -X PUT "${GITEA_URL}/api/v1/repos/${REPO}/actions/secrets/REGISTRY_TOKEN" \
  -H "Authorization: token ${ADMIN_TOKEN}" \
  -H "Content-Type: application/json" \
  -d "{\"data\": \"${GITEA_TOKEN}\"}"

# Vault token (for Cosign key retrieval)
curl -X PUT "${GITEA_URL}/api/v1/repos/${REPO}/actions/secrets/VAULT_TOKEN" \
  -H "Authorization: token ${ADMIN_TOKEN}" \
  -H "Content-Type: application/json" \
  -d "{\"data\": \"${VAULT_TOKEN}\"}"

7.6 Grafana Security Dashboard

7.1 Pipeline Metrics

Create a ConfigMap for pipeline metrics collection:

apiVersion: v1
kind: ConfigMap
metadata:
  name: pipeline-metrics-config
  namespace: monitoring
data:
  # Prometheus scrape config for Gitea Actions metrics
  # Gitea exposes /api/v1/admin/actions/metrics
  scrape-config.yaml: |
    - job_name: 'gitea-actions'
      static_configs:
        - targets: ['gitea.services.svc.cluster.local:3000']
      metrics_path: '/api/v1/admin/actions/metrics'
      bearer_token: '${GITEA_TOKEN}'

7.2 Import Dashboard

Import the following dashboard in Grafana (Dashboard → Import → JSON):

Use the Gitea Actions dashboard ID or create a custom dashboard tracking:

  • Pipeline success/failure rate
  • Average pipeline duration
  • Security scan findings by severity (HIGH/CRITICAL)
  • Secret detection events
  • IaC scan violations over time

Verification

# Verify Gitea Actions is enabled
curl -s "http://gitea.services.svc.cluster.local:3000/api/v1/repos/homelab/gitops-apps/actions/runs" \
  -H "Authorization: token ${GITEA_TOKEN}" | jq '.total_count'

# Verify runner is registered
kubectl get pods -n cicd
kubectl logs -n cicd -l app.kubernetes.io/name=act-runner --tail=20

# Test pre-commit hooks
echo "password=supersecret123" > test-secret.yaml
git add test-secret.yaml
git commit -m "test" 2>&1
# Expected: gitleaks should block this commit
rm test-secret.yaml

# Trigger pipeline manually
curl -X POST "${GITEA_URL}/api/v1/repos/homelab/gitops-apps/actions/runs" \
  -H "Authorization: token ${GITEA_TOKEN}"

# Check pipeline status
kubectl logs -n cicd -l app.kubernetes.io/name=act-runner --tail=50

Troubleshooting

Runner not connecting to Gitea

kubectl logs -n cicd -l app.kubernetes.io/name=act-runner
# Check: GITEA_INSTANCE_URL is correct, token is valid
# Verify: curl http://gitea.services.svc.cluster.local:3000 from runner pod
kubectl exec -n cicd deploy/act-runner -- curl -s http://gitea.services.svc.cluster.local:3000/api/v1/version

Pipeline fails on tfsec

# Run tfsec locally to reproduce
tfsec terraform/ --verbose
# Add exceptions in .tfsec.json for acceptable risks

Gitleaks false positives

# Update .gitleaks.toml allowlist
# Test: gitleaks detect --source . --config-path .gitleaks.toml --verbose

Branch protection blocking admin

# Admin can bypass in Gitea UI or use admin token
# Disable protection temporarily:
curl -X DELETE "${GITEA_URL}/api/v1/repos/${REPO}/branches/main/protection" \
  -H "Authorization: token ${GITEA_TOKEN}"

Completion Checklist

  • Gitea Actions enabled on the repository
  • Act Runner deployed in cicd namespace and registered
  • Runner registration token stored in Secret act-runner-token
  • Docker-in-Docker sidecar running (or host Docker socket mounted)
  • ArgoCD Application for act-runner deployed via GitOps
  • Repository secrets configured (REGISTRY_USER, REGISTRY_TOKEN, VAULT_TOKEN)
  • Pre-commit hooks installed locally (gitleaks, yamllint, tfsec)
  • .pre-commit-config.yaml configured
  • Security pipeline workflow created (.gitea/workflows/security-pipeline.yaml)
  • Secret scanning (gitleaks) running in pipeline
  • IaC scanning (checkov + tfsec) running in pipeline
  • Dependency scanning (trivy fs) running in pipeline
  • K8s manifest validation (conftest) running in pipeline
  • Container pipeline workflow created for image builds
  • Branch protection enabled on main (require PR, approvals, status checks, signed commits)
  • Gitleaks configuration tuned (.gitleaks.toml)
  • Pipeline fails on HIGH/CRITICAL findings
  • ArgoCD conditional sync configured
  • Grafana security dashboard created
  • Pipeline tested end-to-end with a sample push

Source: docs/guides/13-supply-chain-security.md


Guide 13: Software Supply Chain Security

Implement image signing, SBOM generation, and verification with Cosign, Syft, Grype, and Sigstore to secure the container supply chain.


Overview

This guide implements a complete software supply chain security pipeline. Every container image is scanned for vulnerabilities, documented with an SBOM, signed with Cosign, and verified at the Kubernetes admission stage before deployment.

Time Required: ~90 minutes Prerequisites: Guide 12 (CI/CD Pipeline Security) completed

               Software Supply Chain Security
    ┌─────────────────────────────────────────────┐
    │           Container Build Pipeline           │
    │                                              │
    │  Build → Grype Scan → Syft SBOM → Cosign    │
    │                        Sign                  │
    └──────────────────┬──────────────────────────┘
                       │
          ┌────────────┼────────────┐
          ▼            ▼            ▼
    ┌──────────┐ ┌──────────┐ ┌──────────────┐
    │ Registry │ │  SBOM    │ │ Transparency │
    │ (Gitea)  │ │  Store   │ │   Log        │
    └────┬─────┘ └──────────┘ │  (Rekor)     │
         │                    └──────────────┘
         ▼
    ┌─────────────────────────────────────────────┐
    │       Kubernetes Admission Control           │
    │                                              │
    │  Kyverno Policy: Verify Cosign Signature     │
    │  → Only signed images allowed in production  │
    └─────────────────────────────────────────────┘

Phase 1: Install Supply Chain Tools

1.1 Install on Workstation

# macOS
brew install cosign syft grype

# Verify versions
cosign version
syft version
grype version

1.2 Install in CI Runner

The tools will be installed inline in the Gitea Actions workflows. No separate deployment needed.


Phase 2: Container Registry Setup

2.1 Enable Gitea Container Registry

Gitea has a built-in OCI-compliant container registry.

# Verify container registry is enabled in Gitea config
# Edit via ConfigMap or Gitea admin UI
# In app.ini:
[packages]
ENABLED = true

# Registry URL: git.homelab.local (port 3000)
# Push format: git.homelab.local:3000/homelab/<image>:<tag>

2.2 Test Registry Access

# Login to Gitea container registry
docker login git.homelab.local:3000 -u admin -p "${GITEA_TOKEN}"

# Test push
docker pull alpine:latest
docker tag alpine:latest git.homelab.local:3000/homelab/alpine:latest
docker push git.homelab.local:3000/homelab/alpine:latest

# Verify
curl -s "http://gitea.services.svc.cluster.local:3000/api/v1/packages/homelab" \
  -H "Authorization: token ${GITEA_TOKEN}" | jq .

Phase 3: Cosign Image Signing

3.1 Generate Signing Key Pair

# Generate key pair — store private key securely
cosign generate-key-pair

# Files created:
#   cosign.key    (private key — NEVER commit to git)
#   cosign.pub    (public key  — safe to commit)

# Move private key to Vault
kubectl port-forward -n security svc/vault 8200:8200 &
VAULT_ADDR="http://127.0.0.1:8200"

# Store private key in Vault
vault kv put secret/supply-chain/cosign \
  private-key="$(cat cosign.key)"

# Commit public key to repository
cp cosign.pub gitops-apps/security/cosign/cosign.pub
rm cosign.key cosign.pub

3.2 Sign an Image

# Pull and push image to Gitea registry
export COSIGN_PASSWORD=""  # Set if key is password-protected

# Retrieve private key from Vault for signing
vault kv get -field=private-key secret/supply-chain/cosign > /tmp/cosign.key

# Sign the image
cosign sign --key /tmp/cosign.key \
  git.homelab.local:3000/homelab/alpine:latest

# Verify the signature
cosign verify --key gitops-apps/security/cosign/cosign.pub \
  git.homelab.local:3000/homelab/alpine:latest

# Clean up
rm /tmp/cosign.key

3.3 Keyless Signing with Sigstore (Optional)

For air-gapped or fully local environments, use key-based signing. For public/internet-connected setups, Sigstore Fulcio provides ephemeral key signing:

# Keyless signing (requires internet for Sigstore)
cosign sign \
  --yes \
  git.homelab.local:3000/homelab/alpine:latest

# Verify with keyless (checks Rekor transparency log)
cosign verify \
  git.homelab.local:3000/homelab/alpine:latest

Note

Keyless signing requires outbound internet to sigstore.dev. For the homelab's isolated network, key-based signing with Vault is the recommended approach.


Phase 4: SBOM Generation with Syft

4.1 Generate SBOM for Container Image

# SPDX format
syft git.homelab.local:3000/homelab/alpine:latest -o spdx-json > alpine-sbom.spdx.json

# CycloneDX format
syft git.homelab.local:3000/homelab/alpine:latest -o cyclonedx-json > alpine-sbom.cyclonedx.json

# Table format (human-readable)
syft git.homelab.local:3000/homelab/alpine:latest -o table

4.2 Attach SBOM to Image

# Attach SBOM as OCI artifact to the image in registry
cosign attach sbom --sbom alpine-sbom.spdx.json \
  git.homelab.local:3000/homelab/alpine:latest

# Verify SBOM is attached
cosign download sbom git.homelab.local:3000/homelab/alpine:latest

4.3 Sign the SBOM Attestation

# Create signed attestation for the SBOM
cosign attest --predicate alpine-sbom.spdx.json --type spdx \
  --key /tmp/cosign.key \
  git.homelab.local:3000/homelab/alpine:latest

# Verify the attestation
cosign verify-attestation --type spdx \
  --key gitops-apps/security/cosign/cosign.pub \
  git.homelab.local:3000/homelab/alpine:latest

Phase 5: Vulnerability Scanning with Grype

5.1 Scan Container Images

# Scan image directly
grype git.homelab.local:3000/homelab/alpine:latest

# Scan with severity filter
grype git.homelab.local:3000/homelab/alpine:latest --fail-on high

# Output SARIF format
grype git.homelab.local:3000/homelab/alpine:latest -o sarif > grype-results.sarif

# Output JSON for programmatic processing
grype git.homelab.local:3000/homelab/alpine:latest -o json > grype-results.json

5.2 Scan SBOMs Offline

# Scan from SBOM file (no registry access needed)
grype sbom:./alpine-sbom.spdx.json

# Useful for air-gapped scanning workflows
grype sbom:./alpine-sbom.spdx.json --fail-on critical

5.3 Grype Configuration

Create .grype.yaml:

# Fail pipeline on these severity levels
fail-on-severity: "high"

# Ignore specific vulnerabilities (with justification)
ignore:
  - vulnerability: CVE-2023-XXXXX
    fix-state: not-fixed
    reason: "No fix available; acceptable risk in homelab"

# Only show fixed vulnerabilities
only-fixed: false

# Registry auth for private registry
registry:
  auth:
    - authority: git.homelab.local:3000
      username: admin
      password: "${GITEA_TOKEN}"

Phase 6: CI/CD Pipeline Integration

6.1 Supply Chain Workflow

Create gitops-apps/.gitea/workflows/supply-chain.yaml:

name: Supply Chain Security

on:
  push:
    branches: [main]
    paths:
      - 'container-images/**'
      - 'Dockerfile*'

env:
  REGISTRY: git.homelab.local:3000
  IMAGE_NAME: homelab/app

jobs:
  build-scan-sign:
    name: "🔐 Build → Scan → Sign"
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Install Tools
        run: |
          # Cosign
          curl -L https://github.com/sigstore/cosign/releases/latest/download/cosign-linux-amd64 -o /usr/local/bin/cosign
          chmod +x /usr/local/bin/cosign
          # Syft
          curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
          # Grype
          curl -sSfL https://raw.githubusercontent.com/anchore/grype/main/install.sh | sh -s -- -b /usr/local/bin

      - name: Login to Registry
        run: |
          docker login ${REGISTRY} -u ${{ secrets.REGISTRY_USER }} -p ${{ secrets.REGISTRY_TOKEN }}

      - name: Build Image
        run: |
          docker build -t ${REGISTRY}/${IMAGE_NAME}:${{ github.sha }} .
          docker push ${REGISTRY}/${IMAGE_NAME}:${{ github.sha }}

      # Step 1: Vulnerability Scan
      - name: Grype Vulnerability Scan
        run: |
          grype ${REGISTRY}/${IMAGE_NAME}:${{ github.sha }} \
            --fail-on high \
            -o sarif > grype-results.sarif \
            -o json > grype-results.json

      # Step 2: Generate SBOM
      - name: Generate SBOM (SPDX)
        run: |
          syft ${REGISTRY}/${IMAGE_NAME}:${{ github.sha }} -o spdx-json > sbom.spdx.json
          syft ${REGISTRY}/${IMAGE_NAME}:${{ github.sha }} -o cyclonedx-json > sbom.cyclonedx.json

      # Step 3: Attach SBOM
      - name: Attach SBOM to Image
        run: |
          cosign attach sbom --sbom sbom.spdx.json \
            ${REGISTRY}/${IMAGE_NAME}:${{ github.sha }}

      # Step 4: Sign Image
      - name: Sign Image with Cosign
        run: |
          # Retrieve signing key from Vault
          export VAULT_ADDR="http://vault.security.svc.cluster.local:8200"
          export VAULT_TOKEN="${{ secrets.VAULT_TOKEN }}"
          vault kv get -field=private-key secret/supply-chain/cosign > /tmp/cosign.key

          cosign sign --key /tmp/cosign.key --yes \
            ${REGISTRY}/${IMAGE_NAME}:${{ github.sha }}

          rm -f /tmp/cosign.key

      # Step 5: Sign SBOM Attestation
      - name: Attest SBOM
        run: |
          export VAULT_ADDR="http://vault.security.svc.cluster.local:8200"
          export VAULT_TOKEN="${{ secrets.VAULT_TOKEN }}"
          vault kv get -field=private-key secret/supply-chain/cosign > /tmp/cosign.key

          cosign attest --predicate sbom.spdx.json --type spdx \
            --key /tmp/cosign.key \
            ${REGISTRY}/${IMAGE_NAME}:${{ github.sha }}

          rm -f /tmp/cosign.key

      - name: Upload Artifacts
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: supply-chain-artifacts
          path: |
            grype-results.sarif
            grype-results.json
            sbom.spdx.json
            sbom.cyclonedx.json

      - name: Verify Signature
        run: |
          cosign verify --key security/cosign/cosign.pub \
            ${REGISTRY}/${IMAGE_NAME}:${{ github.sha }}

6.2 Store Secrets in Gitea

# Add repository secrets for the pipeline
GITEA_URL="http://gitea.services.svc.cluster.local:3000"
REPO="homelab/gitops-apps"
TOKEN="your-admin-token"

curl -X PUT "${GITEA_URL}/api/v1/repos/${REPO}/actions/secrets/REGISTRY_USER" \
  -H "Authorization: token ${TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{"data": "admin"}'

curl -X PUT "${GITEA_URL}/api/v1/repos/${REPO}/actions/secrets/REGISTRY_TOKEN" \
  -H "Authorization: token ${TOKEN}" \
  -H "Content-Type: application/json" \
  -d "{\"data\": \"${GITEA_TOKEN}\"}"

Phase 7: Kyverno Signature Verification Policy

7.1 Create Verification Policy

Update gitops-apps/security/kyverno/policies.yaml to add signature verification:

---
# Only allow Cosign-signed images in production namespaces
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-image-signatures
  annotations:
    policies.kyverno.io/title: Verify Image Signatures
    policies.kyverno.io/category: Supply Chain Security
    policies.kyverno.io/severity: high
    policies.kyverno.io/description: >-
      Only allow container images that have been signed with the homelab
      Cosign key. Prevents deployment of unsigned or tampered images.
spec:
  validationFailureAction: Audit  # Change to Enforce after testing
  background: false
  rules:
    - name: verify-cosign-signature
      match:
        any:
          - resources:
              kinds:
                - Pod
              namespaces:
                - production
                - services
                - monitoring
                - security
      exclude:
        any:
          - resources:
              namespaces:
                - kube-system
                - longhorn-system
                - argocd
                - kyverno
                - falco
                - logging
                - cicd
      verifyImages:
        - imageReferences:
            - "git.homelab.local:3000/*"
          attestors:
            - entries:
                - keys:
                    publicKeys: |-
                      -----BEGIN PUBLIC KEY-----
                      # Paste contents of cosign.pub here
                      -----END PUBLIC KEY-----
          attestations:
            - type: https://spdx.dev/Document
              conditions:
                - all:
                    - key: "{{ contents[].SPDXID }}"
                      operator: NotEquals
                      value: ""

7.2 Apply the Policy

kubectl apply -f gitops-apps/security/kyverno/policies.yaml

# Verify policy is active
kubectl get clusterpolicy verify-image-signatures -o wide

7.3 Test the Policy

# Deploy an unsigned image — should be logged (audit mode)
kubectl run test-unsigned --image=git.homelab.local:3000/homelab/alpine:latest -n services

# Check Kyverno audit logs
kubectl logs -n kyverno -l app.kubernetes.io/name=kyverno --tail=20

# Deploy a signed image — should pass
# (After signing the image per Phase 3)
kubectl run test-signed --image=git.homelab.local:3000/homelab/alpine:latest -n services

# Switch to Enforce mode when ready
# Edit the policy: validationFailureAction: Enforce

Phase 8: Supply Chain Dashboard

8.1 Metrics Collection

Create a CronJob to periodically scan images and push metrics to Prometheus:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: supply-chain-scanner
  namespace: monitoring
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: scanner
              image: anchore/grype:latest
              command:
                - /bin/sh
                - -c
                - |
                  # Scan all running container images in the cluster
                  IMAGES=$(kubectl get pods -A -o json | jq -r '.items[].spec.containers[].image' | sort -u)
                  for IMG in $IMAGES; do
                    echo "Scanning: $IMG"
                    grype "$IMG" -o json >> /tmp/scan-results.json
                  done
              env:
                - name: DOCKER_CONFIG
                  value: /tmp/.docker
          restartPolicy: OnFailure

8.2 Grafana Dashboard

Create a Grafana dashboard showing:

  • Total images scanned vs signed
  • Vulnerability count by severity over time
  • SBOM coverage (images with/without SBOM)
  • Unsigned deployment attempts (from Kyverno audit logs)
  • Most vulnerable images (top 10)

Import via Grafana provisioning ConfigMap or UI.


Verification

# Verify Cosign installation
cosign version

# Verify signing key in Vault
vault kv get secret/supply-chain/cosign

# Verify public key in repo
cat gitops-apps/security/cosign/cosign.pub

# Test full pipeline: sign, verify, scan
export IMAGE="git.homelab.local:3000/homelab/alpine:latest"

# Sign
vault kv get -field=private-key secret/supply-chain/cosign > /tmp/cosign.key
cosign sign --key /tmp/cosign.key --yes ${IMAGE}
rm /tmp/cosign.key

# Verify
cosign verify --key gitops-apps/security/cosign/cosign.pub ${IMAGE}

# Generate SBOM
syft ${IMAGE} -o table

# Scan with Grype
grype ${IMAGE} --fail-on critical

# Verify Kyverno policy is active
kubectl get clusterpolicy verify-image-signatures

Troubleshooting

Cosign cannot reach registry

# Test registry connectivity
curl -v http://gitea.services.svc.cluster.local:3000/v2/
# Check: DNS resolves, port is correct, auth is configured

Kyverno blocking all images

# Check policy is in audit mode first
kubectl get clusterpolicy verify-image-signatures -o yaml | grep validationFailureAction
# Start with Audit, verify logs, then switch to Enforce

Grype database update fails

# Grype downloads vulnerability DB from github.com
# For air-gapped: use grype -db <path-to-local-db>
grype db update

SBOM too large for attestation

# Use CycloneDX (more compact) instead of SPDX
syft ${IMAGE} -o cyclonedx-json > sbom.json
# Or filter to packages only
syft ${IMAGE} -o spdx-json --exclude-patterns="**/test/**" > sbom.json

Completion Checklist

  • Cosign installed and key pair generated
  • Private key stored in Vault (secret/supply-chain/cosign)
  • Public key committed to gitops-apps/security/cosign/cosign.pub
  • Gitea container registry enabled and tested
  • Syft generates SBOMs in SPDX and CycloneDX formats
  • Grype scans images with HIGH severity gate
  • Supply chain CI workflow created (build → scan → SBOM → sign → attest)
  • Cosign signs images and attaches SBOM attestations
  • Kyverno policy verify-image-signatures deployed in Audit mode
  • Unsigned image deployment detected by Kyverno
  • Signed image verification passes
  • Supply chain artifacts uploaded to Gitea Actions
  • Grafana supply chain dashboard created
  • CronJob for periodic image scanning deployed
  • .grype.yaml configuration committed

Source: docs/guides/14-iac-security.md


Guide 14: Infrastructure as Code Security

Scan Terraform and Kubernetes manifests with tfsec, Checkov, and Terrascan to catch misconfigurations before deployment.


Overview

This guide implements shift-left IaC security scanning. Every Terraform plan and Kubernetes manifest is scanned for misconfigurations, compliance violations, and security risks — both locally via pre-commit hooks and in the CI/CD pipeline.

Time Required: ~60 minutes Prerequisites: Guide 03 (Terraform Infrastructure), Guide 12 (CI/CD Pipeline Security) completed

               IaC Security Scanning Pipeline
    ┌───────────────────────────────────────────────┐
    │           Developer Push / PR                  │
    └──────────────────┬────────────────────────────┘
                       ▼
    ┌───────────────────────────────────────────────┐
    │              Pre-Commit Hooks                  │
    │   tfsec · Checkov · Terrascan · Conftest      │
    └──────────────────┬────────────────────────────┘
                       │ (local pass)
                       ▼
    ┌───────────────────────────────────────────────┐
    │           Gitea Actions Pipeline               │
    │                                                │
    │  tfsec → Terraform scanning                    │
    │  Checkov → Terraform + K8s scanning            │
    │  Terrascan → OPA-based IaC scanning            │
    │  Conftest → K8s manifest policy checks         │
    └──────────────────┬────────────────────────────┘
                       │ (all pass)
                       ▼
    ┌───────────────────────────────────────────────┐
    │         ArgoCD GitOps Deployment               │
    └───────────────────────────────────────────────┘

Phase 1: Install IaC Security Tools

1.1 Workstation Installation

# macOS
brew install tfsec checkov terrascan conftest

# Verify installations
tfsec --version
checkov --version
terrascan version
conftest --version

1.2 Docker Alternatives (Optional)

# Run via Docker if preferred
docker run --rm -v $(pwd):/src aquasec/tfsec /src
docker run --rm -v $(pwd):/src bridgecrew/checkov -d /src
docker run --rm -v $(pwd):/src accurics/terrascan scan -d /src
docker run --rm -v $(pwd)/policies:/policies -v $(pwd)/manifests:/manifests openpolicyagent/conftest test /manifests

Phase 2: tfsec — Terraform Security Scanner

2.1 Scan the Homelab Terraform

cd /path/to/homelab

# Scan all Terraform code
tfsec terraform/ --verbose

# Output in JSON format
tfsec terraform/ --format json --out tfsec-results.json

# Output in SARIF (for CI integration)
tfsec terraform/ --format sarif --out tfsec-results.sarif

2.2 Configure tfsec

Create terraform/.tfsec.json:

{
  "exclude": [
    "GEN001",
    "GEN003"
  ],
  "severity_overrides": {
    "DS002": "HIGH",
    "AWS001": "CRITICAL"
  }
}

2.3 Suppress False Positives

Add inline suppressions to Terraform code where needed:

# tfsec:ignore:GEN001 Proxmox local-only — no remote state backend needed
terraform {
  required_providers {
    proxmox = {
      source  = "bpg/proxmox"
      version = "~> 0.73"
    }
  }
}

# tfsec:ignore:DS002 Using HTTP for internal Proxmox API — no public exposure
provider "proxmox" {
  endpoint = "https://192.168.1.11:8006"
  insecure = true
}

2.4 Custom tfsec Checks (Optional)

Create terraform/.tfsec/custom_checks/ directory with custom checks specific to the homelab:

# terraform/.tfsec/custom_checks/require_longhorn_storage.yaml
checks:
  - code: HOMELAB001
    description: "K3s VMs must use Longhorn storage class"
    requiredTypes:
      - resource
    requiredLabels:
      - proxmox_vm_qemu
    matchSpec:
      name: disk
      action: contains
      value: "longhorn"
    severity: MEDIUM
    relatedLinks:
      - "https://homelab.local/docs/storage"

Phase 3: Checkov — Multi-Framework Scanner

3.1 Scan Terraform

# Scan Terraform code
checkov -d terraform/ --framework terraform --output cli

# Scan with all frameworks
checkov -d . --output cli

# Output in JSON
checkov -d terraform/ --output json --output-file-path checkov-tf.json

# Output in SARIF
checkov -d terraform/ --output sarif --output-file-path checkov-tf.sarif

3.2 Scan Kubernetes Manifests

# Scan K8s manifests
checkov -d gitops-apps/ --framework kubernetes --output cli

# Scan specific directory
checkov -d gitops-apps/security/ --framework kubernetes

3.3 Configure Checkov

Create .checkov.yaml in the repository root:

# Checkov configuration for homelab
branch: main

# Skip specific checks that don't apply to the homelab
skip-check:
  - CKV_K8S_21  # "The default namespace should not be used" — acceptable for system namespaces
  - CKV_K8S_38  # "Ensure that Service Account Tokens are only mounted where necessary"
  - CKV_TF_1    # "Use HTTPS for Proxmox API" — internal network, self-signed cert
  - CKV_TF_2    # "Use remote state" — local state acceptable for homelab

# Soft fail — report but don't fail the pipeline on these
soft-fail-on:
  - CKV_K8S_14  # "Image tag should be specified" — some test images use latest
  - CKV_K8S_43  # "Image pull policy should be 'Always'"

# Framework-specific settings
framework:
  - terraform
  - kubernetes

# Output directory for results
output: cli
compact: true

3.4 Suppress False Positives in Code

In Terraform files:

# checkov:skip=CKV_TF_1 Internal Proxmox API on isolated network
provider "proxmox" {
  endpoint = "https://192.168.1.11:8006"
  insecure = true
}

In Kubernetes manifests:

metadata:
  annotations:
    checkov.io/skip: "CKV_K8S_21=System namespace, CKV_K8S_38=Service account token required for Vault auth"

3.5 Custom Checkov Policies

Create policies/checkov/ directory with custom Python policies:

# policies/checkov/require_longhorn_storage.py
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck

class RequireLonghornStorage(BaseResourceCheck):
    def __init__(self):
        name = "Ensure K8s PVCs use Longhorn storage classes"
        id = "CKV_HOMELAB_1"
        supported_resources = ["kubernetes_persistent_volume_claim"]
        categories = ["storage"]
        super().__init__(name=name, id=id, categories=categories, supported_resources=supported_resources)

    def scan_resource_conf(self, conf):
        storage_class = conf.get("spec", [{}])[0].get("storage_class_name", "")
        allowed = ["longhorn-critical", "longhorn-default", "longhorn-ephemeral"]
        if storage_class in allowed:
            return CheckResult.PASSED
        return CheckResult.FAILED

check = RequireLonghornStorage()

Phase 4: Terrascan — OPA-Based Scanner

4.1 Scan the Repository

# Scan Terraform code
terrascan scan -d terraform/ -t terraform

# Scan Kubernetes manifests
terrascan scan -d gitops-apps/ -t kubernetes

# Output in JSON
terrascan scan -d terraform/ -t terraform -o json > terrascan-results.json

# Output in YAML
terrascan scan -d gitops-apps/ -t kubernetes -o yaml > terrascan-k8s.yaml

4.2 Configure Terrascan

Create terrascan-config.toml:

[notifications]

[rules]
# Skip rules not applicable to homelab
skip-rules = [
    "AC_K8S_37",  # Require resource limits — handled by Kyverno
    "AC_K8S_38",  # Service account token mounting
]

# Severity threshold
severity = "medium"

# Categories to scan
categories = [
    "Security",
    "Compliance",
]

4.3 Custom Terrascan Policies

Create custom Rego policies in policies/terrascan/:

# policies/terrascan/require_homelab_labels.rego
package custom.kubernetes.require_labels

import future.keywords.in

__rego_metadata__ := {
    "id": "HOMELAB_001",
    "avd_id": "AVD-HOMELAB-0001",
    "title": "Resources must have homelab labels",
    "short_code": "require-labels",
    "version": "v1.0.0",
    "severity": "LOW",
    "type": "Kubernetes",
    "description": "All Kubernetes resources must have app.kubernetes.io/name label",
    "recommended_actions": "Add app.kubernetes.io/name label to resources",
    "url": "https://homelab.local/docs/labels",
}

deny[cause] {
    resource := input.resource
    not has_required_label(resource)
    cause := sprintf("Resource '%s' missing required label 'app.kubernetes.io/name'", [resource.metadata.name])
}

has_required_label(resource) {
    resource.metadata.labels["app.kubernetes.io/name"]
}

Phase 5: Conftest — K8s Manifest Policy Engine

5.1 Create Policy Library

Create policies/conftest/ directory:

# policies/conftest/require_labels.rego
package main

deny[msg] {
    input.kind == "Deployment"
    not input.metadata.labels["app.kubernetes.io/name"]
    msg := sprintf("Deployment '%s' must have app.kubernetes.io/name label", [input.metadata.name])
}

deny[msg] {
    input.kind == "Deployment"
    not input.spec.template.spec.containers[_].resources.limits.cpu
    msg := sprintf("Deployment '%s' containers must have CPU limits", [input.metadata.name])
}

deny[msg] {
    input.kind == "Deployment"
    not input.spec.template.spec.containers[_].resources.limits.memory
    msg := sprintf("Deployment '%s' containers must have memory limits", [input.metadata.name])
}
# policies/conftest/disallow_latest.rego
package main

deny[msg] {
    input.kind == "Deployment"
    container := input.spec.template.spec.containers[_]
    endswith(container.image, ":latest")
    msg := sprintf("Container '%s' in Deployment '%s' uses :latest tag — use specific version", [container.name, input.metadata.name])
}
# policies/conftest/require_storage_class.rego
package main

deny[msg] {
    input.kind == "PersistentVolumeClaim"
    not input.spec.storageClassName
    msg := sprintf("PVC '%s' must specify a storageClassName — use longhorn-default or longhorn-critical", [input.metadata.name])
}

deny[msg] {
    input.kind == "PersistentVolumeClaim"
    allowed := {"longhorn-critical", "longhorn-default", "longhorn-ephemeral"}
    not input.spec.storageClassName in allowed
    msg := sprintf("PVC '%s' uses unsupported storageClass '%s' — use Longhorn classes", [input.metadata.name, input.spec.storageClassName])
}

5.2 Test Policies

# Test all K8s manifests against policies
conftest test --policy policies/conftest/ gitops-apps/

# Test specific file
conftest test --policy policies/conftest/ gitops-apps/security/kyverno/policies.yaml

# Output in JSON
conftest test --policy policies/conftest/ --output json gitops-apps/

# Combine with Kubernetes rendering (if using Helm)
helm template my-chart ./chart | conftest test --policy policies/conftest/ -

5.3 Test the Policies (Unit Tests)

Create policies/conftest/require_labels_test.rego:

package main

test_deployment_has_label {
    allow with input as {
        "kind": "Deployment",
        "metadata": {
            "name": "test",
            "labels": {"app.kubernetes.io/name": "test"}
        }
    }
}

test_deployment_missing_label {
    deny[msg] with input as {
        "kind": "Deployment",
        "metadata": {"name": "test"}
    }
}
# Run policy unit tests
conftest verify --policy policies/conftest/

Phase 6: Pipeline Integration

6.1 IaC Scanning Workflow

Create gitops-apps/.gitea/workflows/iac-security.yaml:

name: IaC Security Scan

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  tfsec:
    name: "🔒 tfsec"
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run tfsec
        run: |
          curl -L https://github.com/aquasecurity/tfsec/releases/latest/download/tfsec-linux-amd64 -o tfsec
          chmod +x tfsec
          ./tfsec terraform/ --format json --out tfsec-results.json --no-color

      - name: Upload Results
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: tfsec-results
          path: tfsec-results.json

  checkov:
    name: "🛡️ Checkov"
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Checkov
        run: |
          pip install checkov
          checkov -d . --framework terraform,kubernetes \
            --output cli \
            --output junitxml \
            --output-file-path console,checkov-results.xml \
            --compact \
            --config-file .checkov.yaml

      - name: Upload Results
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: checkov-results
          path: checkov-results.xml

  terrascan:
    name: "🔍 Terrascan"
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Terrascan
        run: |
          curl -L "$(curl -s https://api.github.com/repos/accurics/terrascan/releases/latest | grep -o -E 'https://.+?_Linux_x86_64.tar.gz')" -o terrascan.tar.gz
          tar -xf terrascan.tar.gz terrascan && rm terrascan.tar.gz
          ./terrascan scan -d . -t terraform,kubernetes -o json > terrascan-results.json

      - name: Upload Results
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: terrascan-results
          path: terrascan-results.json

  conftest:
    name: "📋 Conftest"
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Conftest
        run: |
          curl -L https://github.com/open-policy-agent/conftest/releases/latest/download/conftest_linux_amd64.tar.gz | tar xz
          sudo mv conftest /usr/local/bin/
          conftest test --policy policies/conftest/ --output table gitops-apps/

      - name: Conftest Verify (Unit Tests)
        run: |
          conftest verify --policy policies/conftest/

Phase 7: Pre-Commit Integration

7.1 Update Pre-Commit Config

Add IaC scanners to .pre-commit-config.yaml:

repos:
  # Terraform security
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.96.1
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_tflint
      - id: tfsec
        args: ['--force-all-dirs', '--no-color', '--config-file', 'terraform/.tfsec.json']

  # Checkov
  - repo: https://github.com/bridgecrewio/checkov
    rev: "3.2.232"
    hooks:
      - id: checkov
        args: ['--directory', '.', '--framework', 'terraform,kubernetes', '--compact', '--quiet']

  # YAML linting
  - repo: https://github.com/adrienverge/yamllint
    rev: v1.35.1
    hooks:
      - id: yamllint
        args: ['-d', '{extends: relaxed, rules: {line-length: {max: 120}}}']

Phase 8: Reporting

8.1 Centralized Results Storage

# Store scan results in Gitea Packages as generic artifacts
GITEA_URL="http://gitea.services.svc.cluster.local:3000"
TOKEN="your-token"

# Upload scan results after each pipeline run
for REPORT in tfsec-results.json checkov-results.xml terrascan-results.json; do
  curl -X PUT "${GITEA_URL}/api/v1/packages/homelab/generic/iac-reports/${GITHUB_SHA}/${REPORT}" \
    -H "Authorization: token ${TOKEN}" \
    --data-binary "@${REPORT}"
done

8.2 Historical Tracking

Create a simple script to track scan results over time:

#!/bin/bash
# scripts/track-iac-metrics.sh
# Push IaC scan metrics to Prometheus Pushgateway for Grafana visualization

PUSHGATEWAY="http://kube-prometheus-stack-prometheus-pushgateway.monitoring.svc.cluster.local:9091"

# Count findings by severity
TFSEC_HIGH=$(jq '[.results[] | select(.severity=="HIGH")] | length' tfsec-results.json 2>/dev/null || echo "0")
TFSEC_MED=$(jq '[.results[] | select(.severity=="MEDIUM")] | length' tfsec-results.json 2>/dev/null || echo "0")
TFSEC_LOW=$(jq '[.results[] | select(.severity=="LOW")] | length' tfsec-results.json 2>/dev/null || echo "0")

# Push to Pushgateway
cat <<EOF | curl --data-binary @- ${PUSHGATEWAY}/metrics/job/iac_security
iac_tfsec_findings{severity="high"} ${TFSEC_HIGH}
iac_tfsec_findings{severity="medium"} ${TFSEC_MED}
iac_tfsec_findings{severity="low"} ${TFSEC_LOW}
EOF

Verification

# Run all scanners locally and verify they produce results
echo "=== tfsec ==="
tfsec terraform/ --no-color
echo ""
echo "=== Checkov (Terraform) ==="
checkov -d terraform/ --framework terraform --compact
echo ""
echo "=== Checkov (Kubernetes) ==="
checkov -d gitops-apps/ --framework kubernetes --compact
echo ""
echo "=== Terrascan ==="
terrascan scan -d terraform/ -t terraform
echo ""
echo "=== Conftest ==="
conftest test --policy policies/conftest/ gitops-apps/

# Verify config files exist
ls -la .checkov.yaml terraform/.tfsec.json terrascan-config.toml

# Verify policy directories
ls -la policies/conftest/
ls -la policies/checkov/

# Run pre-commit
pre-commit run --all-files

Troubleshooting

tfsec reports no results

# Ensure Terraform files are valid
cd terraform && terraform init && terraform validate
# tfsec scans valid Terraform — if files have syntax errors, tfsec skips them

Checkov too noisy

# Use --compact flag and configure .checkov.yaml skip-check list
checkov -d . --compact --skip-check CKV_K8S_21,CKV_K8S_38
# Add permanent skips to .checkov.yaml

Terrascan fails to parse

# Terrascan requires valid syntax — validate files first
terrascan scan -d . -t terraform --verbose
# Use --skip-rules for non-applicable checks
terrascan scan -d . -t terraform --skip-rules="AC_K8S_37"

Conftest policy errors

# Test Rego syntax
conftest parse gitops-apps/argocd-apps/root-application.yaml
# Verify policy with unit tests
conftest verify --policy policies/conftest/
# Debug with trace flag
conftest test --policy policies/conftest/ --trace gitops-apps/

Completion Checklist

  • tfsec installed and scanning Terraform code
  • .tfsec.json configuration created with homelab-specific exceptions
  • Checkov scanning both Terraform and Kubernetes manifests
  • .checkov.yaml configuration created
  • Terrascan installed and scanning with OPA policies
  • Terrascan config (terrascan-config.toml) created
  • Conftest policies created in policies/conftest/
  • Policy unit tests pass (conftest verify)
  • Custom Rego policies for homelab requirements (labels, storage class)
  • IaC scanning pipeline integrated into Gitea Actions
  • Pre-commit hooks include all IaC scanners
  • False positives suppressed with inline annotations
  • Scan results uploaded as CI artifacts
  • Metrics pushed to Prometheus Pushgateway
  • Grafana IaC security dashboard created

Source: docs/guides/15-certificate-management.md


Guide 15: Certificate Management

Automate TLS certificate provisioning with cert-manager and a private CA for all homelab services.


Overview

This guide sets up cert-manager with a self-signed root CA and a CA issuer to automatically provision TLS certificates for every *.homelab.local service. Certificates are auto-renewed, monitored in Grafana, and trusted across all workstations.

Time Required: ~60 minutes Prerequisites: Guide 07 (GitOps Stack) completed

              Certificate Management Architecture
    ┌──────────────────────────────────────────────────┐
    │                  cert-manager                      │
    │                                                    │
    │  ┌─────────────┐    ┌──────────────────────────┐  │
    │  │ Self-Signed │───>│ CA Issuer                │  │
    │  │ Root CA     │    │ (signed by Root CA)      │  │
    │  └─────────────┘    └────────────┬─────────────┘  │
    │                                  │                 │
    │                    ┌─────────────┼────────────┐    │
    │                    ▼             ▼            ▼    │
    │              ┌─────────┐  ┌─────────┐  ┌───────┐  │
    │              │ Gitea   │  │ ArgoCD  │  │Vault  │  │
    │              │ TLS crt │  │ TLS crt │  │TLS crt│  │
    │              └─────────┘  └─────────┘  └───────┘  │
    │              ┌─────────┐  ┌─────────┐  ┌───────┐  │
    │              │ Grafana │  │Authelia │  │Longhor│  │
    │              │ TLS crt │  │ TLS crt │  │n TLS  │  │
    │              └─────────┘  └─────────┘  └───────┘  │
    └──────────────────────────────────────────────────┘
                       │
                       ▼
    ┌──────────────────────────────────────────────────┐
    │           ingress-nginx (TLS Termination)         │
    │    *.homelab.local → wildcard TLS certificate     │
    └──────────────────────────────────────────────────┘

Phase 1: Verify cert-manager

1.1 Check if cert-manager is Installed

cert-manager is included in kube-prometheus-stack. Verify:

kubectl get pods -n cert-manager
# If not found, install separately:

helm repo add jetstack https://charts.jetstack.io
helm repo update

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set crds.enabled=true \
  --set replicaCount=1 \
  --set prometheus.enabled=true \
  --set prometheus.servicemonitor.enabled=true \
  --set webhook.timeoutSeconds=10 \
  --wait

1.2 Verify CRDs

kubectl get crd | grep cert-manager
# Expected: certificates.cert-manager.io
#           certificaterequests.cert-manager.io
#           challenges.acme.cert-manager.io
#           clusterissuers.cert-manager.io
#           issuers.cert-manager.io
#           orders.acme.cert-manager.io

Phase 2: Create Root CA

2.1 Generate Root CA Certificate

# Generate Root CA key and certificate
openssl genrsa -out homelab-root-ca.key 4096

openssl req -x509 -new -nodes -key homelab-root-ca.key \
  -sha256 -days 3650 \
  -subj "/C=US/ST=Homelab/L=Homelab/O=Homelab CA/CN=Homelab Root CA" \
  -addext "basicConstraints=critical,CA:TRUE" \
  -addext "keyUsage=critical,keyCertSign,cRLSign" \
  -out homelab-root-ca.crt

2.2 Create Root CA Secret

kubectl create namespace security

kubectl create secret tls homelab-root-ca \
  --namespace security \
  --cert=homelab-root-ca.crt \
  --key=homelab-root-ca.key \
  --dry-run=client -o yaml | kubectl apply -f -

2.3 Create Self-Signed ClusterIssuer

Create gitops-apps/infrastructure/cert-manager/cluster-issuer.yaml:

---
# Self-signed issuer to bootstrap the Root CA
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned-issuer
spec:
  selfSigned: {}

---
# Root CA certificate (self-signed, 10-year validity)
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: homelab-root-ca
  namespace: security
spec:
  isCA: true
  commonName: Homelab Root CA
  secretName: homelab-root-ca-tls
  duration: 87600h  # 10 years
  renewBefore: 720h  # Renew 30 days before expiry
  subject:
    organizations:
      - Homelab
    organizationalUnits:
      - Certificate Authority
  dnsNames:
    - homelab-root-ca
  issuerRef:
    name: selfsigned-issuer
    kind: ClusterIssuer
    group: cert-manager.io

---
# CA Issuer backed by the Root CA certificate
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: homelab-ca-issuer
spec:
  ca:
    secretName: homelab-root-ca-tls

2.4 Apply the Issuer

kubectl apply -f gitops-apps/infrastructure/cert-manager/cluster-issuer.yaml

# Wait for CA certificate to be issued
kubectl wait --for=condition=Ready certificate/homelab-root-ca -n security --timeout=60s

# Verify issuers
kubectl get clusterissuer
# Expected: selfsigned-issuer   Ready
#           homelab-ca-issuer   Ready

Phase 3: Wildcard Certificate

3.1 Issue *.homelab.local Wildcard Certificate

# gitops-apps/infrastructure/cert-manager/wildcard-certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: homelab-wildcard
  namespace: security
spec:
  secretName: homelab-wildcard-tls
  duration: 8760h    # 1 year
  renewBefore: 360h   # Renew 15 days before expiry
  subject:
    organizations:
      - Homelab
  dnsNames:
    - homelab.local
    - "*.homelab.local"
  issuerRef:
    name: homelab-ca-issuer
    kind: ClusterIssuer
    group: cert-manager.io
kubectl apply -f gitops-apps/infrastructure/cert-manager/wildcard-certificate.yaml

# Wait for issuance
kubectl wait --for=condition=Ready certificate/homelab-wildcard -n security --timeout=60s

# Verify
kubectl get certificate -n security
kubectl describe certificate homelab-wildcard -n security

Phase 4: Ingress TLS Configuration

4.1 Configure ingress-nginx to Use Wildcard Certificate

Create a TLS secret in the ingress namespace by copying from security:

# gitops-apps/infrastructure/ingress/tls-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: homelab-wildcard-tls
  namespace: ingress-nginx
  annotations:
    # Tell cert-manager to replicate this secret
    replicator.v1.mittwald.de/replicate-to: "ingress-nginx,argocd,services,monitoring,security,logging"
type: kubernetes.io/tls
data: {}

4.2 Use kubed or cert-manager replication

Install cert-manager-trust or use a simple CronJob to sync the wildcard secret to all namespaces:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: sync-tls-secret
  namespace: security
spec:
  schedule: "*/15 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: secret-syncer
          containers:
            - name: sync
              image: bitnami/kubectl:latest
              command:
                - /bin/bash
                - -c
                - |
                  NAMESPACES="ingress-nginx argocd services monitoring security logging"
                  for NS in $NAMESPACES; do
                    kubectl get secret homelab-wildcard-tls -n security -o yaml \
                      | sed "s/namespace: security/namespace: $NS/" \
                      | kubectl apply -f -
                  done
          restartPolicy: OnFailure

4.3 Default TLS for Ingress

Patch ingress-nginx to use the wildcard certificate as default:

helm upgrade ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --reuse-values \
  --set controller.extraArgs.default-ssl-certificate="security/homelab-wildcard-tls"

4.4 Example Ingress with TLS

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: gitea
  namespace: services
  annotations:
    cert-manager.io/cluster-issuer: homelab-ca-issuer
    nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - git.homelab.local
      secretName: gitea-tls
  rules:
    - host: git.homelab.local
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: gitea
                port:
                  number: 3000

Phase 5: Service Certificates

5.1 Create Certificates for Each Service

Create gitops-apps/infrastructure/cert-manager/service-certificates.yaml:

---
# Gitea
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: gitea-tls
  namespace: services
spec:
  secretName: gitea-tls
  duration: 8760h
  renewBefore: 360h
  dnsNames:
    - git.homelab.local
    - gitea.services.svc.cluster.local
  issuerRef:
    name: homelab-ca-issuer
    kind: ClusterIssuer

---
# ArgoCD
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: argocd-tls
  namespace: argocd
spec:
  secretName: argocd-tls
  duration: 8760h
  renewBefore: 360h
  dnsNames:
    - argocd.homelab.local
    - argocd-server.argocd.svc.cluster.local
  issuerRef:
    name: homelab-ca-issuer
    kind: ClusterIssuer

---
# Vault
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: vault-tls
  namespace: security
spec:
  secretName: vault-tls
  duration: 8760h
  renewBefore: 360h
  dnsNames:
    - vault.homelab.local
    - vault.security.svc.cluster.local
    - vault-internal.security.svc.cluster.local
  issuerRef:
    name: homelab-ca-issuer
    kind: ClusterIssuer

---
# Grafana
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: grafana-tls
  namespace: monitoring
spec:
  secretName: grafana-tls
  duration: 8760h
  renewBefore: 360h
  dnsNames:
    - grafana.homelab.local
    - kube-prometheus-stack-grafana.monitoring.svc.cluster.local
  issuerRef:
    name: homelab-ca-issuer
    kind: ClusterIssuer

---
# Authelia
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: authelia-tls
  namespace: security
spec:
  secretName: authelia-tls
  duration: 8760h
  renewBefore: 360h
  dnsNames:
    - auth.homelab.local
    - authelia.security.svc.cluster.local
  issuerRef:
    name: homelab-ca-issuer
    kind: ClusterIssuer

---
# Longhorn UI
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: longhorn-tls
  namespace: longhorn-system
spec:
  secretName: longhorn-tls
  duration: 8760h
  renewBefore: 360h
  dnsNames:
    - longhorn.homelab.local
    - longhorn-frontend.longhorn-system.svc.cluster.local
  issuerRef:
    name: homelab-ca-issuer
    kind: ClusterIssuer
kubectl apply -f gitops-apps/infrastructure/cert-manager/service-certificates.yaml

# Wait for all certificates
kubectl get certificates -A

Phase 6: Store CA in Vault

6.1 Distribute Root CA via Vault

# Store Root CA cert in Vault for distribution
vault kv put secret/certificates/root-ca \
  certificate="$(kubectl get secret homelab-root-ca-tls -n security -o jsonpath='{.data.ca\.crt}' | base64 -d)"

# Retrieve from any machine
vault kv get -field=certificate secret/certificates/root-ca > homelab-ca.crt

Phase 7: Trust the CA

7.1 macOS Workstation

# Get the Root CA certificate
kubectl get secret homelab-root-ca-tls -n security -o jsonpath='{.data.ca\.crt}' | base64 -d > homelab-ca.crt

# Add to macOS System Keychain (requires admin password)
sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain homelab-ca.crt

# Verify
security find-certificate -c "Homelab Root CA" /Library/Keychains/System.keychain

7.2 Linux VMs (K3s Nodes)

# Via Ansible — add to all K3s nodes
cat > ansible/playbooks/trust-ca.yml <<'EOF'
---
- name: Trust Homelab CA on all nodes
  hosts: k3s_cluster
  become: true
  tasks:
    - name: Copy CA certificate
      ansible.builtin.copy:
        src: homelab-ca.crt
        dest: /usr/local/share/ca-certificates/homelab-ca.crt
        mode: '0644'

    - name: Update CA certificates
      ansible.builtin.command: update-ca-certificates
      changed_when: "'Added' in ca_update.stdout"
      register: ca_update
EOF

ansible-playbook -i ansible/inventories/homelab/hosts.yml ansible/playbooks/trust-ca.yml

7.3 Kubernetes Pods (via ConfigMap)

# Create ConfigMap with CA cert for pods that need it
kubectl create configmap homelab-ca \
  --from-file=ca.crt=homelab-ca.crt \
  --namespace=security \
  --dry-run=client -o yaml | kubectl apply -f -

Phase 8: Monitoring & Alerts

8.1 Certificate Expiry Alerts

cert-manager exposes Prometheus metrics. Create a PrometheusRule:

# gitops-apps/monitoring/cert-manager-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cert-manager-alerts
  namespace: monitoring
spec:
  groups:
    - name: cert-manager
      rules:
        - alert: CertificateExpirySoon
          expr: certmanager_certificate_expiration_timestamp_seconds - time() < 86400 * 30
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: "Certificate {{ $labels.name }} expires in less than 30 days"
            description: "Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} expires on {{ $value | humanizeTimestamp }}"

        - alert: CertificateExpiryCritical
          expr: certmanager_certificate_expiration_timestamp_seconds - time() < 86400 * 7
          for: 1h
          labels:
            severity: critical
          annotations:
            summary: "Certificate {{ $labels.name }} expires in less than 7 days"
            description: "Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} expires on {{ $value | humanizeTimestamp }}"

        - alert: CertificateFailedIssuance
          expr: certmanager_certificate_ready_status{condition="False"} == 1
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "Certificate {{ $labels.name }} failed to issue"
            description: "Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} is not ready"

8.2 Grafana Dashboard

Import the official cert-manager dashboard (Dashboard ID: 20842) or create a custom one showing:

  • Certificate status (Ready / Not Ready)
  • Time until expiry per certificate
  • Issuance success/failure rate
  • CA issuer health
# Import via Grafana UI: Dashboards → Import → 20842

Phase 9: ArgoCD Application

9.1 GitOps Deployment

Create gitops-apps/infrastructure/cert-manager/application.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cert-manager
  namespace: argocd
spec:
  project: homelab
  source:
    repoURL: https://git.homelab.local/homelab/gitops-apps.git
    targetRevision: main
    path: infrastructure/cert-manager
  destination:
    server: https://kubernetes.default.svc
    namespace: cert-manager
  syncPolicy:
    automated:
      selfHeal: true
      prune: false
    syncOptions:
      - ServerSideApply=true
      - CreateNamespace=true

Verification

# Check cert-manager is running
kubectl get pods -n cert-manager

# Check ClusterIssuers
kubectl get clusterissuer
# Expected: selfsigned-issuer   Ready, homelab-ca-issuer   Ready

# Check all certificates
kubectl get certificates -A
# All should show Ready: True

# Check certificate details
kubectl describe certificate homelab-wildcard -n security

# Verify TLS on a service
curl -v --cacert homelab-ca.crt https://git.homelab.local
# Expected: TLS handshake succeeds, certificate verified

# Check certificate expiry
kubectl get certificate -A -o custom-columns=NAME:.metadata.name,NS:.metadata.namespace,NOT_AFTER:.status.notAfter,READY:.status.conditions[0].status

# Verify Prometheus metrics
kubectl exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/query?query=certmanager_certificate_expiration_timestamp_seconds' | jq .

Troubleshooting

Certificate stuck in Pending

kubectl describe certificate <name> -n <namespace>
# Check Events section for issuance errors
# Common: ClusterIssuer not Ready, CA secret missing

ClusterIssuer not Ready

kubectl describe clusterissuer homelab-ca-issuer
# Check: secret homelab-root-ca-tls exists in security namespace
kubectl get secret homelab-root-ca-tls -n security

CA not trusted in browser

# Re-import the CA certificate
# macOS: Keychain Access → System → Certificates → import homelab-ca.crt
# Set trust to "Always Trust"
# Restart browser after import

Certificate auto-renewal not working

# Check cert-manager logs
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager --tail=50
# Verify renewBefore is set (default is 2/3 of duration)

Completion Checklist

  • cert-manager installed with CRDs
  • Self-signed Root CA generated (10-year validity)
  • CA ClusterIssuer (homelab-ca-issuer) created and Ready
  • Wildcard certificate *.homelab.local issued
  • ingress-nginx configured with default TLS certificate
  • TLS certificates issued for: Gitea, ArgoCD, Vault, Grafana, Authelia, Longhorn
  • Root CA stored in Vault (secret/certificates/root-ca)
  • Root CA trusted on macOS workstation
  • Root CA trusted on K3s VMs (via Ansible)
  • Certificate expiry alerts configured (30-day warning, 7-day critical)
  • cert-manager Grafana dashboard imported
  • Secret sync CronJob replicating TLS to all namespaces
  • ArgoCD Application for cert-manager deployed
  • All services accessible via HTTPS with valid certificates

Source: docs/guides/16-backup-disaster-recovery.md


Guide 16: Backup & Disaster Recovery

Implement Velero for Kubernetes backups, Proxmox Backup Server for VM-level snapshots, and documented disaster recovery procedures.


Overview

This guide implements a three-tier backup strategy: Kubernetes resources and volumes via Velero, VM-level snapshots via Proxmox Backup Server, and application-specific backups for Gitea and Vault. Includes tested restore procedures and RPO/RTO targets.

Time Required: ~90 minutes Prerequisites: Guide 06 (Longhorn Storage) completed

              Backup & Recovery Architecture
    ┌─────────────────────────────────────────────────┐
    │                Backup Sources                     │
    │                                                  │
    │  ┌─────────┐  ┌─────────┐  ┌─────────────────┐  │
    │  │ Velero  │  │  PBS    │  │ App-Specific    │  │
    │  │ (K8s)   │  │ (VMs)   │  │ Vault/Gitea     │  │
    │  └────┬────┘  └────┬────┘  └───────┬─────────┘  │
    └───────┼────────────┼───────────────┼─────────────┘
            │            │               │
            ▼            ▼               ▼
    ┌─────────────────────────────────────────────────┐
    │              Storage Targets                      │
    │                                                  │
    │  ┌──────────────┐  ┌────────────────────────┐   │
    │  │ Longhorn NFS │  │ Proxmox Backup Server  │   │
    │  │ / S3 Bucket  │  │ (deduplicated, Zstd)   │   │
    │  └──────────────┘  └────────────────────────┘   │
    └─────────────────────────────────────────────────┘

RPO/RTO Targets

Component RPO (max data loss) RTO (max downtime) Backup Frequency
K3s cluster resources 1 hour 30 minutes Hourly
Monitoring data (Prometheus/Loki) 24 hours 2 hours Daily
Gitea (repos + DB) 6 hours 1 hour Every 6 hours
Vault (secrets) 1 hour 15 minutes Hourly
Proxmox VMs (full) 24 hours 1 hour Daily
Longhorn volumes 24 hours 30 minutes Daily

Phase 1: Deploy Velero

1.1 Prepare Backup Storage

Create an NFS share on pve-vader for Velero backups:

# On pve-vader
mkdir -p /mnt/data/velero-backups
# Export via NFS (add to /etc/exports)
echo "/mnt/data/velero-backups 10.10.10.0/24(rw,sync,no_subtree_check,no_root_squash)" >> /etc/exports
exportfs -arv

Or use Longhorn to provision a PVC for backups.

1.2 Install Velero

helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update

# Create Velero namespace
kubectl create namespace velero
kubectl label namespace velero environment=infrastructure

# Install Velero with NFS storage
cat > velero-values.yaml <<'EOF'
configuration:
  backupStorageLocation:
    - name: default
      provider: aws
      bucket: velero
      default: true
      config:
        region: minio
        s3ForcePathStyle: true
        publicUrl: http://minio.services.svc.cluster.local:9000
      credential:
        name: velero-cloud-credentials
        namespace: velero
  volumeSnapshotLocation:
    - name: default
      provider: aws
      config:
        region: minio

# Use local storage (NFS via restic/fs-backup)
# Alternative: use filesystem-based backup
initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.10.0
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins

deployNodeAgent: true

metrics:
  enabled: true
  serviceMonitor:
    enabled: true

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

# For local/NFS backup without S3, use restic
# Or configure with local provider
EOF

For a simpler setup using local filesystem backup:

# Install with restic for volume backup
helm install velero vmware-tanzu/velero \
  --namespace velero \
  --set configuration.backupStorageLocation[0].name=default \
  --set configuration.backupStorageLocation[0].provider=aws \
  --set configuration.backupStorageLocation[0].bucket=velero \
  --set configuration.backupStorageLocation[0].default=true \
  --set configuration.volumeSnapshotLocation[0].name=default \
  --set configuration.volumeSnapshotLocation[0].provider=aws \
  --set deployNodeAgent=true \
  --set metrics.enabled=true \
  --set metrics.serviceMonitor.enabled=true \
  --set snapshotsEnabled=true \
  --wait

1.3 Verify Velero

kubectl get pods -n velero
# Expected: velero Running, node-agent DaemonSet Running on all nodes

velero version
velero backup-location get
# Expected: default   Available

Phase 2: Backup Schedules

2.1 Create Backup Schedules

# Hourly: Critical namespaces (Vault, ArgoCD config)
velero schedule create critical-hourly \
  --include-namespaces security,argocd \
  --include-cluster-scopes=true \
  --schedule="0 * * * *" \
  --ttl=72h \
  --snapshot-volumes=true

# Daily: Full cluster backup
velero schedule create full-daily \
  --include-namespaces '*' \
  --exclude-namespaces velero,kube-system \
  --include-cluster-scopes=true \
  --schedule="15 2 * * *" \
  --ttl=168h \
  --snapshot-volumes=true

# Weekly: Long-term retention
velero schedule create weekly-archive \
  --include-namespaces '*' \
  --exclude-namespaces velero,kube-system \
  --include-cluster-scopes=true \
  --schedule="0 3 * * 0" \
  --ttl=720h \
  --snapshot-volumes=true

2.2 Verify Schedules

velero schedule get
# Expected:
# critical-hourly   0 * * * *        72h
# full-daily        15 2 * * *       168h
# weekly-archive    0 3 * * 0        720h

2.3 Create a Manual Backup

# Test with a manual backup
velero backup create test-backup \
  --include-namespaces monitoring \
  --snapshot-volumes=true \
  --wait

# Check status
velero backup describe test-backup --details
velero backup logs test-backup

Phase 3: Restore Procedures

3.1 Full Cluster Restore

Use this after total cluster loss (all nodes down):

# Step 1: Rebuild K3s cluster (Guide 05)
# Step 2: Install Velero (this guide, Phase 1)
# Step 3: Restore from backup

# List available backups
velero backup get

# Restore full cluster from latest daily backup
velero restore create full-restore \
  --from-backup full-daily-<TIMESTAMP> \
  --wait

# Monitor restore progress
velero restore describe full-restore --details

3.2 Namespace-Level Restore

Use this for accidental deletion of a namespace:

# List backups containing the namespace
velero backup get -o wide

# Restore specific namespace
velero restore create restore-monitoring \
  --from-backup full-daily-<TIMESTAMP> \
  --include-namespaces monitoring \
  --wait

# Verify restored resources
velero restore describe restore-monitoring --details

3.3 Single Resource Restore

# Extract specific resource from backup
velero restore create restore-single-deployment \
  --from-backup full-daily-<TIMESTAMP> \
  --include-namespaces services \
  --include-resources deployments \
  --selector app=gitea \
  --wait

Phase 4: Proxmox Backup Server

4.1 Deploy PBS as VM on pve-vader

# Download Proxmox Backup Server ISO
# Create VM via Terraform or Proxmox UI:
#   - 2 CPU, 4GB RAM
#   - 100GB disk (use NVMe storage)
#   - Attached to vnet-homelab (10.10.10.0/24)
#   - IP: 10.10.10.5

# Add to Terraform if desired:
# terraform/environments/homelab/main.tf

4.2 Configure PBS

# After PBS installation, access web UI at https://10.10.10.5:8007

# Create datastore for backups
# UI: Datastore → Add → name: "homelab-backups", path: /mnt/data/homelab-backups

# Create backup user
# UI: Configuration → Access Control → Add user: backup@pbs
# Assign DatastoreBackup role on homelab-backups datastore

4.3 Configure Proxmox VE to Backup to PBS

On each Proxmox node:

# Add PBS as backup storage
# On pve-vader:
cat >> /etc/pve/storage.cfg <<'EOF'
pbs: pbs-backup
    server 10.10.10.5
    datastore homelab-backups
    username backup@pbs
    password <pbs-password>
    fingerprint <pbs-fingerprint>
    content backup
EOF

# Add to all nodes (or via Proxmox cluster config sync)

4.4 Create VM Backup Jobs

# On any Proxmox node:
# Daily backup of critical VMs (Vader)
vzdump 100,200 --mode snapshot --storage pbs-backup \
  --compress zstd --mailto root --mailto failure-only \
  --schedule "02:00"

# Daily backup of worker VMs (Sidious)
vzdump 201 --mode snapshot --storage pbs-backup \
  --compress zstd --schedule "03:00"

# Maul (hack box) — weekly only
vzdump 800 --mode snapshot --storage pbs-backup \
  --compress zstd --schedule "Sun 04:00"

# Verify backups
proxmox-backup-client snapshot list --repository backup@pbs@10.10.10.5:homelab-backups

4.5 PBS Pruning Schedule

# In PBS UI: Datastore → homelab-backups → Prune Options
# Keep:
#   - Last 7 daily backups
#   - Last 4 weekly backups
#   - Last 3 monthly backups

Phase 5: Application-Specific Backups

5.1 Vault Backup

# Vault uses Raft storage — snapshot the Raft data
kubectl port-forward -n security svc/vault 8200:8200 &
VAULT_ADDR="http://127.0.0.1:8200"

# Create Raft snapshot
vault operator raft snapshot save vault-snapshot-$(date +%Y%m%d).snap

# Automated via CronJob:
cat <<'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
  name: vault-backup
  namespace: security
spec:
  schedule: "0 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: vault-backup
              image: hashicorp/vault:latest
              command:
                - /bin/sh
                - -c
                - |
                  export VAULT_ADDR="http://vault.security.svc.cluster.local:8200"
                  export VAULT_TOKEN="$(cat /vault/token)"
                  vault operator raft snapshot save /backups/vault-$(date +%Y%m%d-%H%M).snap
              volumeMounts:
                - name: token
                  mountPath: /vault
                - name: backups
                  mountPath: /backups
          volumes:
            - name: token
              secret:
                secretName: vault-root-token
            - name: backups
              persistentVolumeClaim:
                claimName: vault-backups-pvc
          restartPolicy: OnFailure
EOF

Create the PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vault-backups-pvc
  namespace: security
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: longhorn-default
  resources:
    requests:
      storage: 5Gi

5.2 Gitea Backup

# Gitea dump (repos + database + config)
cat <<'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
  name: gitea-backup
  namespace: services
spec:
  schedule: "0 */6 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: gitea-backup
              image: gitea/gitea:latest
              command:
                - /bin/sh
                - -c
                - |
                  /usr/local/bin/gitea dump \
                    -c /data/gitea/conf/app.ini \
                    --file /backups/gitea-dump-$(date +%Y%m%d-%H%M).zip
              volumeMounts:
                - name: gitea-data
                  mountPath: /data
                - name: backups
                  mountPath: /backups
          volumes:
            - name: gitea-data
              persistentVolumeClaim:
                claimName: gitea-data
            - name: backups
              persistentVolumeClaim:
                claimName: gitea-backups-pvc
          restartPolicy: OnFailure
EOF

5.3 Longhorn Recurring Backups

# Configure Longhorn recurring backups via UI or kubectl
# Settings → General → Default Backup Store
# Set backup target: nfs://10.10.10.5:/mnt/data/longhorn-backups

# Create recurring backup job for critical volumes
cat <<'EOF' | kubectl apply -f -
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
  name: daily-backup
  namespace: longhorn-system
spec:
  cron: "0 3 * * *"
  task: backup
  groups:
    - default
  retain: 7
  concurrency: 2
EOF

Phase 6: Disaster Recovery Runbooks

6.1 Total Loss (All 3 Nodes Down)

## DR-001: Total Cluster Loss

### Trigger
- All 3 Proxmox nodes powered off or hardware failure
- Complete data center loss (e.g., power outage + UPS failure)

### Steps
1. Power on pve-vader first, then pve-sidious
2. Wait for Proxmox cluster quorum (2/3 nodes)
3. Restore VMs from PBS:
   - Restore pfSense VM (ID 100) from latest backup
   - Restore K3s master VM (ID 200) from latest backup
   - Restore K3s worker VM (ID 201) from latest backup
4. Wait for K3s cluster to stabilize
5. Install Velero on new cluster
6. Point Velero to backup storage location
7. Restore from latest full-daily backup:
   velero restore create full-restore --from-backup full-daily-<LATEST>
8. Verify all namespaces and workloads
9. Restore Vault from Raft snapshot if needed
10. Check all services: Gitea, ArgoCD, Grafana, Authelia

### RTO Target: 2-4 hours

6.2 Single Node Loss

## DR-002: Single Node Failure (pve-vader or pve-sidious)

### Trigger
- One Proxmox node becomes unresponsive
- Hardware failure on a single node

### Steps (if pve-sidious fails):
1. K3s worker pods reschedule to master (if resources allow)
2. Longhorn replicas on failed node become degraded
3. Replace/repair hardware
4. Reboot node — Proxmox rejoins cluster automatically
5. VMs restart with `onboot=yes`
6. Longhorn rebuilds replicas from healthy copies

### Steps (if pve-vader fails):
1. K3s master is down — cluster is read-only
2. Longhorn 2/3 replicas remain (on sidious + virtual disks)
3. Repair/reboot vader
4. K3s master resumes — pods reschedule
5. Longhorn rebuilds vader replicas

### RTO Target: 30-60 minutes (reboot) / 2-4 hours (hardware)

6.3 Accidental Namespace Deletion

## DR-003: Namespace Deletion

### Trigger
- kubectl delete namespace <name> run accidentally
- ArgoCD sync removes resources incorrectly

### Steps
1. Identify the deleted namespace and timestamp
2. List available backups:
   velero backup get
3. Restore from most recent backup:
   velero restore create restore-<namespace> \
     --from-backup <latest-backup> \
     --include-namespaces <namespace> \
     --wait
4. Verify restored resources:
   kubectl get all -n <namespace>
5. Check ArgoCD sync status

### RTO Target: 15-30 minutes

Phase 7: Backup Monitoring

7.1 Velero Metrics Dashboard

Velero exposes Prometheus metrics. Import the Velero dashboard (Dashboard ID: 16871) in Grafana.

7.2 Backup Alerts

# gitops-apps/monitoring/backup-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: backup-alerts
  namespace: monitoring
spec:
  groups:
    - name: backup
      rules:
        - alert: VeleroBackupFailed
          expr: velero_backup_attempt_total - velero_backup_success_total > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Velero backup {{ $labels.schedule }} failed"
            description: "Backup {{ $labels.backup_name }} has failed. Check Velero logs."

        - alert: VeleroBackupStale
          expr: time() - velero_backup_last_successful_timestamp_seconds > 86400 * 2
          for: 1h
          labels:
            severity: critical
          annotations:
            summary: "No successful Velero backup in 48 hours"
            description: "Schedule {{ $labels.schedule }} has not had a successful backup in over 48 hours."

        - alert: ProxmoxBackupFailed
          expr: increase(pbs_backup_failed_total[24h]) > 0
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: "Proxmox backup job failed"

7.3 Monthly Restore Drill

# Test restore procedure monthly
# scripts/monthly-restore-drill.sh

#!/bin/bash
echo "=== Monthly Restore Drill: $(date) ==="

# 1. Create a test namespace with resources
kubectl create namespace drill-test
kubectl run nginx --image=nginx -n drill-test
kubectl expose pod nginx --port=80 -n drill-test

# 2. Wait for backup to capture it
echo "Waiting for next scheduled backup..."
sleep 3600

# 3. Delete the namespace
kubectl delete namespace drill-test

# 4. Restore from Velero
LATEST=$(velero backup get --sort-by=.metadata.creationTimestamp -o json | jq -r '.items[-1].metadata.name')
velero restore create drill-restore --from-backup "$LATEST" --include-namespaces drill-test --wait

# 5. Verify
kubectl get pods -n drill-test
kubectl get svc -n drill-test

# 6. Clean up
kubectl delete namespace drill-test

echo "=== Drill Complete ==="

Phase 8: ArgoCD Application

# gitops-apps/infrastructure/velero/application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: velero
  namespace: argocd
spec:
  project: homelab
  source:
    repoURL: https://git.homelab.local/homelab/gitops-apps.git
    targetRevision: main
    path: infrastructure/velero
  destination:
    server: https://kubernetes.default.svc
    namespace: velero
  syncPolicy:
    automated:
      selfHeal: true
      prune: false
    syncOptions:
      - ServerSideApply=true
      - CreateNamespace=true

Verification

# Velero status
velero version
velero backup-location get
velero schedule get

# Run a manual backup
velero backup create verification-backup \
  --include-namespaces security \
  --snapshot-volumes=true \
  --wait

# Check backup
velero backup describe verification-backup --details

# Test restore
velero restore create verification-restore \
  --from-backup verification-backup \
  --include-namespaces security \
  --wait

velero restore describe verification-restore

# PBS status (from Proxmox node)
proxmox-backup-client snapshot list --repository backup@pbs@10.10.10.5:homelab-backups

# Vault backup
vault operator raft snapshot save /tmp/test-snapshot.snap
ls -la /tmp/test-snapshot.snap

Troubleshooting

Velero backup fails

velero backup describe <backup-name> --details
velero backup logs <backup-name>
# Common: storage location unreachable, credentials invalid

Restic volume backup stuck

kubectl get pods -n velero -l name=node-agent
kubectl logs -n velero -l name=node-agent --tail=50
# Check: node-agent DaemonSet is running on all nodes

PBS connection refused

# Verify PBS is running
curl -k https://10.10.10.5:8007/api2/json/status
# Check PBS service: systemctl status proxmox-backup-server

Vault restore fails

# Ensure Vault is unsealed before restore
vault status
# Restore: vault operator raft snapshot restore <snapshot-file>

Completion Checklist

  • Velero installed in velero namespace
  • Backup storage location configured and Available
  • Node agent (restic) DaemonSet running on all nodes
  • Hourly backup schedule for critical namespaces (Vault, ArgoCD)
  • Daily full cluster backup schedule
  • Weekly long-term retention schedule
  • Manual backup and restore tested successfully
  • Proxmox Backup Server deployed (VM on vader, 10.10.10.5)
  • PBS datastore configured with pruning policy
  • Proxmox VM backup jobs scheduled (daily critical, weekly hack box)
  • Vault backup CronJob running hourly (Raft snapshots)
  • Gitea backup CronJob running every 6 hours
  • Longhorn recurring backup job configured
  • DR runbooks written: total loss, single node, namespace deletion
  • Backup alerts configured in Prometheus
  • Velero Grafana dashboard imported
  • Monthly restore drill script created
  • RPO/RTO targets documented
  • ArgoCD Application for Velero deployed

Source: docs/guides/17-compliance-hardening.md


Guide 17: Compliance & Hardening

Run CIS benchmarks with kube-bench, harden K3s nodes, and automate compliance scanning with OpenSCAP and Kyverno policies.


Overview

This guide hardens every layer of the homelab stack: Proxmox hosts, K3s nodes, and Kubernetes workloads. Uses CIS benchmarks as the compliance standard, with automated scanning and remediation through Ansible, kube-bench, and Kyverno.

Time Required: ~90 minutes Prerequisites: Guide 05 (K3s Cluster), Guide 08 (Security Tooling) completed

              Compliance & Hardening Stack
    ┌─────────────────────────────────────────────────┐
    │              Compliance Layers                    │
    │                                                  │
    │  Layer 1: Proxmox Host Hardening                 │
    │    SSH · fail2ban · auditd · sysctl · firewall   │
    │                                                  │
    │  Layer 2: K3s Node Hardening                     │
    │    CIS Benchmark · kube-bench · kernel params    │
    │                                                  │
    │  Layer 3: Kubernetes Workload Policies            │
    │    Kyverno · Pod Security · RBAC · NetworkPolicy │
    │                                                  │
    │  Layer 4: Compliance Reporting                    │
    │    OpenSCAP · kube-bench → Grafana dashboards    │
    └─────────────────────────────────────────────────┘

Phase 1: Proxmox Host Hardening

1.1 SSH Hardening

Create ansible/playbooks/harden-hosts.yml:

---
- name: Harden Proxmox Hosts
  hosts: proxmox
  become: true
  tasks:
    - name: Configure SSH daemon
      ansible.builtin.lineinfile:
        path: /etc/ssh/sshd_config
        regexp: "{{ item.regexp }}"
        line: "{{ item.line }}"
        validate: "sshd -t -f %s"
      loop:
        - { regexp: "^#?PermitRootLogin", line: "PermitRootLogin prohibit-password" }
        - { regexp: "^#?PasswordAuthentication", line: "PasswordAuthentication no" }
        - { regexp: "^#?PubkeyAuthentication", line: "PubkeyAuthentication yes" }
        - { regexp: "^#?X11Forwarding", line: "X11Forwarding no" }
        - { regexp: "^#?MaxAuthTries", line: "MaxAuthTries 3" }
        - { regexp: "^#?ClientAliveInterval", line: "ClientAliveInterval 300" }
        - { regexp: "^#?ClientAliveCountMax", line: "ClientAliveCountMax 2" }
        - { regexp: "^#?LoginGraceTime", line: "LoginGraceTime 30" }
        - { regexp: "^#?PermitEmptyPasswords", line: "PermitEmptyPasswords no" }
        - { regexp: "^#?AllowAgentForwarding", line: "AllowAgentForwarding no" }
        - { regexp: "^#?AllowTcpForwarding", line: "AllowTcpForwarding no" }
      notify: Restart SSHD

    - name: Configure SSH ciphers
      ansible.builtin.blockinfile:
        path: /etc/ssh/sshd_config
        block: |
          Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com,aes128-gcm@openssh.com
          MACs hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com
          KexAlgorithms curve25519-sha256,curve25519-sha256@libssh.org
      notify: Restart SSHD

  handlers:
    - name: Restart SSHD
      ansible.builtin.systemd:
        name: sshd
        state: restarted

1.2 Install and Configure fail2ban

Add to ansible/playbooks/harden-hosts.yml:

    - name: Install fail2ban
      ansible.builtin.apt:
        name: fail2ban
        state: present
        update_cache: true

    - name: Configure fail2ban for SSH
      ansible.builtin.copy:
        dest: /etc/fail2ban/jail.local
        content: |
          [DEFAULT]
          bantime = 3600
          findtime = 600
          maxretry = 3
          backend = systemd

          [sshd]
          enabled = true
          port = ssh
          filter = sshd
          logpath = /var/log/auth.log
          maxretry = 3
          bantime = 3600
        mode: '0644'
      notify: Restart fail2ban

  handlers:
    - name: Restart fail2ban
      ansible.builtin.systemd:
        name: fail2ban
        state: restarted

1.3 Kernel Hardening (sysctl)

    - name: Kernel hardening via sysctl
      ansible.posix.sysctl:
        name: "{{ item.key }}"
        value: "{{ item.value }}"
        sysctl_set: true
        state: present
        reload: true
      loop:
        - { key: "net.ipv4.ip_forward", value: "1" }           # Required for Proxmox SDN
        - { key: "net.ipv4.conf.all.send_redirects", value: "0" }
        - { key: "net.ipv4.conf.default.send_redirects", value: "0" }
        - { key: "net.ipv4.conf.all.accept_redirects", value: "0" }
        - { key: "net.ipv4.conf.default.accept_redirects", value: "0" }
        - { key: "net.ipv4.conf.all.accept_source_route", value: "0" }
        - { key: "net.ipv4.conf.default.accept_source_route", value: "0" }
        - { key: "net.ipv6.conf.all.accept_redirects", value: "0" }
        - { key: "kernel.dmesg_restrict", value: "1" }
        - { key: "kernel.kptr_restrict", value: "2" }
        - { key: "kernel.unprivileged_bpf_disabled", value: "1" }
        - { key: "fs.suid_dumpable", value: "0" }
        - { key: "net.core.bpf_jit_harden", value: "2" }

1.4 Configure auditd

    - name: Install auditd
      ansible.builtin.apt:
        name:
          - auditd
          - audispd-plugins
        state: present

    - name: Configure auditd rules
      ansible.builtin.copy:
        dest: /etc/audit/rules.d/homelab.rules
        content: |
          # Monitor privileged commands
          -a always,exit -F arch=b64 -S execve -F euid=0 -F auid>=1000 -F auid!=4294967295 -k privileged
          # Monitor SSH config changes
          -w /etc/ssh/sshd_config -p wa -k ssh_config
          # Monitor sudoers changes
          -w /etc/sudoers -p wa -k sudoers
          # Monitor Proxmox config changes
          -w /etc/pve/ -p wa -k proxmox_config
          # Monitor login events
          -w /var/log/auth.log -p wa -k logins
          # Monitor cron jobs
          -w /etc/cron* -p wa -k cron
          # Monitor system time changes
          -a exit,always -F arch=b64 -S clock_settime -k time_change
        mode: '0600'
      notify: Restart auditd

  handlers:
    - name: Restart auditd
      ansible.builtin.command: augenrules --load
      changed_when: true

1.5 Apply the Hardening Playbook

ansible-playbook -i ansible/inventories/homelab/hosts.yml ansible/playbooks/harden-hosts.yml

Phase 2: K3s Node Hardening

2.1 Create K3s CIS Config

Create ansible/inventories/homelab/group_vars/k3s.yml additions:

# CIS Benchmark compliance settings
k3s_server_args:
  - "--kube-apiserver-arg=enable-admission-plugins=NodeRestriction,PodSecurityPolicy"
  - "--kube-apiserver-arg=audit-log-path=/var/log/k3s/audit.log"
  - "--kube-apiserver-arg=audit-log-maxage=30"
  - "--kube-apiserver-arg=audit-log-maxbackup=10"
  - "--kube-apiserver-arg=audit-log-maxsize=100"
  - "--kube-apiserver-arg=authorization-mode=Node,RBAC"
  - "--kube-apiserver-arg=profiling=false"
  - "--kube-controller-manager-arg=profiling=false"
  - "--kube-scheduler-arg=profiling=false"
  - "--kubelet-arg=protect-kernel-defaults=true"
  - "--kubelet-arg=rotate-server-certificates=true"
  - "--write-kubeconfig-mode=644"

2.2 Apply K3s Hardening via Ansible

Create ansible/playbooks/harden-k3s.yml:

---
- name: Harden K3s Nodes
  hosts: k3s_cluster
  become: true
  tasks:
    # Kernel parameters required by CIS
    - name: CIS-required kernel parameters
      ansible.posix.sysctl:
        name: "{{ item.key }}"
        value: "{{ item.value }}"
        sysctl_set: true
        state: present
      loop:
        - { key: "vm.overcommit_memory", value: "1" }
        - { key: "vm.panic_on_oom", value: "0" }
        - { key: "kernel.panic", value: "10" }
        - { key: "kernel.panic_on_oops", value: "1" }
        - { key: "net.ipv4.tcp_max_syn_backlog", value: "12800" }

    # Create audit log directory
    - name: Create K3s audit log directory
      ansible.builtin.file:
        path: /var/log/k3s
        state: directory
        mode: '0700'

    # etcd data directory permissions
    - name: Secure etcd data directory
      ansible.builtin.file:
        path: /var/lib/rancher/k3s/server/db/etcd
        mode: '0700'
      when: "'k3s_master' in group_names"

    # K3s config file permissions
    - name: Secure K3s config
      ansible.builtin.file:
        path: /etc/rancher/k3s/config.yaml
        mode: '0600'
      ignore_errors: true
ansible-playbook -i ansible/inventories/homelab/hosts.yml ansible/playbooks/harden-k3s.yml

Phase 3: kube-bench — CIS Kubernetes Benchmark

3.1 Run kube-bench

# Run as a Job in the cluster
kubectl apply -f - <<'EOF'
apiVersion: batch/v1
kind: Job
metadata:
  name: kube-bench
  namespace: monitoring
spec:
  template:
    spec:
      hostPID: true
      containers:
        - name: kube-bench
          image: docker.io/aquasec/kube-bench:latest
          command: ["kube-bench", "run", "--targets", "master,node", "--benchmark", "k3s-cis-1.8"]
          volumeMounts:
            - name: var-lib-etcd
              mountPath: /var/lib/etcd
            - name: var-lib-kubelet
              mountPath: /var/lib/kubelet
            - name: etc-systemd
              mountPath: /etc/systemd
            - name: etc-kubernetes
              mountPath: /etc/kubernetes
            - name: usr-bin
              mountPath: /usr/local/bin
      volumes:
        - name: var-lib-etcd
          hostPath:
            path: /var/lib/rancher/k3s/server/db/etcd
        - name: var-lib-kubelet
          hostPath:
            path: /var/lib/kubelet
        - name: etc-systemd
          hostPath:
            path: /etc/systemd
        - name: etc-kubernetes
          hostPath:
            path: /etc/rancher/k3s
        - name: usr-bin
          hostPath:
            path: /usr/local/bin
      restartPolicy: Never
      nodeSelector:
        node-role.kubernetes.io/control-plane: "true"
EOF

# Wait and get results
kubectl wait --for=condition=complete job/kube-bench -n monitoring --timeout=120s
kubectl logs job/kube-bench -n monitoring

3.2 Schedule Periodic Scans

apiVersion: batch/v1
kind: CronJob
metadata:
  name: kube-bench-weekly
  namespace: monitoring
spec:
  schedule: "0 3 * * 1"  # Weekly Monday 3 AM
  jobTemplate:
    spec:
      template:
        spec:
          hostPID: true
          containers:
            - name: kube-bench
              image: docker.io/aquasec/kube-bench:latest
              command:
                - /bin/sh
                - -c
                - |
                  kube-bench run --targets master,node --benchmark k3s-cis-1.8 \
                    --json > /results/kube-bench-$(date +%Y%m%d).json
              volumeMounts:
                - name: results
                  mountPath: /results
                - name: var-lib-etcd
                  mountPath: /var/lib/etcd
                - name: var-lib-kubelet
                  mountPath: /var/lib/kubelet
                - name: etc-systemd
                  mountPath: /etc/systemd
                - name: etc-kubernetes
                  mountPath: /etc/kubernetes
          volumes:
            - name: results
              persistentVolumeClaim:
                claimName: kube-bench-results
            - name: var-lib-etcd
              hostPath:
                path: /var/lib/rancher/k3s/server/db/etcd
            - name: var-lib-kubelet
              hostPath:
                path: /var/lib/kubelet
            - name: etc-systemd
              hostPath:
                path: /etc/systemd
            - name: etc-kubernetes
              hostPath:
                path: /etc/rancher/k3s
          restartPolicy: Never

3.3 kube-bench Grafana Dashboard

Parse kube-bench JSON results and push metrics to Prometheus Pushgateway:

# Script to push kube-bench results to Prometheus
RESULTS=$(kubectl logs job/kube-bench -n monitoring --tail=-1)

# Extract pass/fail counts
PASS=$(echo "$RESULTS" | jq '[.Controls[].tests[].results[] | select(.status=="PASS")] | length')
FAIL=$(echo "$RESULTS" | jq '[.Controls[].tests[].results[] | select(.status=="FAIL")] | length')
WARN=$(echo "$RESULTS" | jq '[.Controls[].tests[].results[] | select(.status=="WARN")] | length')
INFO=$(echo "$RESULTS" | jq '[.Controls[].tests[].results[] | select(.status=="INFO")] | length')

# Push to Pushgateway
PUSHGATEWAY="http://kube-prometheus-stack-prometheus-pushgateway.monitoring.svc.cluster.local:9091"
cat <<EOF | curl --data-binary @- ${PUSHGATEWAY}/metrics/job/kube-bench
cis_benchmark_results{status="pass"} ${PASS}
cis_benchmark_results{status="fail"} ${FAIL}
cis_benchmark_results{status="warn"} ${WARN}
cis_benchmark_results{status="info"} ${INFO}
EOF

Phase 4: OpenSCAP VM Scanning

4.1 Install OpenSCAP on K3s VMs

# Add to ansible/playbooks/harden-k3s.yml
    - name: Install OpenSCAP
      ansible.builtin.apt:
        name:
          - openscap-scanner
          - scap-security-guide
        state: present
        update_cache: true

    - name: Run OpenSCAP CIS scan
      ansible.builtin.command: >
        oscap xccdf eval
        --profile xccdf_org.ssgproject.content_profile_cis
        --results /tmp/oscap-results.xml
        --report /tmp/oscap-report.html
        /usr/share/xml/scap/ssg/content/ssg-ubuntu2204-ds.xml
      register: oscap_scan
      failed_when: false
      changed_when: false

    - name: Fetch OpenSCAP report
      ansible.builtin.fetch:
        src: /tmp/oscap-report.html
        dest: "reports/oscap-{{ inventory_hostname }}.html"
        flat: true

4.2 Run the Scan

mkdir -p reports
ansible-playbook -i ansible/inventories/homelab/hosts.yml ansible/playbooks/oscap-scan.yml

# View reports
open reports/oscap-k3s-master-01.html
open reports/oscap-k3s-worker-01.html

Phase 5: Kyverno Compliance Policies

5.1 Additional Compliance Policies

Add to gitops-apps/security/kyverno/policies.yaml:

---
# Require app.kubernetes.io labels on all workloads
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-kubernetes-labels
  annotations:
    policies.kyverno.io/title: Require Kubernetes Standard Labels
    policies.kyverno.io/category: Compliance
spec:
  validationFailureAction: Audit
  rules:
    - name: require-app-label
      match:
        any:
          - resources:
              kinds:
                - Deployment
                - StatefulSet
                - DaemonSet
      validate:
        message: "All workloads must have app.kubernetes.io/name label"
        pattern:
          spec:
            template:
              metadata:
                labels:
                  app.kubernetes.io/name: "?*"

---
# Require read-only root filesystem
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-readonly-rootfs
  annotations:
    policies.kyverno.io/title: Require Read-Only Root Filesystem
    policies.kyverno.io/category: CIS 5.2.4
spec:
  validationFailureAction: Audit
  rules:
    - name: validate-rootfs
      match:
        any:
          - resources:
              kinds:
                - Deployment
                - StatefulSet
      validate:
        message: "Containers must use read-only root filesystem (securityContext.readOnlyRootFilesystem=true)"
        pattern:
          spec:
            template:
              spec:
                containers:
                  - securityContext:
                      readOnlyRootFilesystem: true

---
# Disallow hostPath mounts
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-hostpath
  annotations:
    policies.kyverno.io/title: Disallow hostPath Volume Mounts
    policies.kyverno.io/category: CIS 5.2.5
spec:
  validationFailureAction: Audit
  rules:
    - name: prevent-hostpath
      match:
        any:
          - resources:
              kinds:
                - Deployment
                - StatefulSet
                - Pod
      validate:
        message: "hostPath volume mounts are not allowed"
        pattern:
          spec:
            template:
              spec:
                volumes:
                  - X(hostPath): null

---
# Require resource quotas on production namespaces
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-quota
  annotations:
    policies.kyverno.io/title: Require ResourceQuota on Production Namespaces
    policies.kyverno.io/category: Compliance
spec:
  validationFailureAction: Audit
  rules:
    - name: check-resource-quota
      match:
        any:
          - resources:
              kinds:
                - Namespace
      validate:
        message: "Production namespaces must have a ResourceQuota"
        deny:
          conditions:
            all:
              - key: "{{ request.object.metadata.labels.environment || '' }}"
                operator: Equals
                value: "production"
              - key: "{{ has_resource_quota }}"
                operator: Equals
                value: false

Phase 6: CIS Controls Not Applied

Document CIS controls that are intentionally skipped for the homelab:

CIS Control Reason for Skipping
1.2.1 (API server anonymous auth) K3s requires anonymous auth for health checks
1.4.1 (TLS certificates via CA) Using self-signed CA (Guide 15) — acceptable for homelab
3.2.1 (etcd encryption) Not configured — Vault handles secret encryption
4.2.1 (kubelet anonymous auth) K3s default — acceptable in isolated network
5.1.1 (PodSecurityPolicy) PSP deprecated in K8s 1.25+; using Kyverno instead
5.3.2 (Seccomp profile) Using RuntimeDefault via Falco recommendations

Verification

# Verify SSH hardening on Proxmox hosts
ssh root@192.168.1.11 "grep -E 'PermitRootLogin|PasswordAuthentication|X11Forwarding' /etc/ssh/sshd_config"

# Verify fail2ban
ssh root@192.168.1.11 "fail2ban-client status sshd"

# Verify kernel parameters
ssh root@192.168.1.11 "sysctl kernel.kptr_restrict kernel.dmesg_restrict net.core.bpf_jit_harden"

# Verify auditd
ssh root@192.168.1.11 "auditctl -l"

# Run kube-bench
kubectl logs job/kube-bench -n monitoring | grep -E "PASS|FAIL|WARN" | head -30

# Check Kyverno compliance policies
kubectl get clusterpolicy -o wide

# Verify OpenSCAP report exists
ls -la reports/oscap-*.html

Troubleshooting

kube-bench fails on K3s

# K3s has specific paths — use --benchmark k3s-cis-1.8
# Check kube-bench supports K3s version
kube-bench run --targets master --benchmark k3s-cis-1.8 --debug

SSH hardening locks you out

# Recovery: connect via Proxmox web console (no SSH needed)
# Fix sshd_config via console:
vi /etc/ssh/sshd_config
systemctl restart sshd

fail2ban blocking legitimate connections

# Check banned IPs
fail2ban-client status sshd
# Unban an IP
fail2ban-client set sshd unbanip <IP>
# Whitelist Tailscale range
# Add to /etc/fail2ban/jail.local: ignoreip = 127.0.0.1/8 100.64.0.0/10

Kyverno audit mode not logging

# Check Kyverno webhook
kubectl get validatingwebhookconfiguration -o yaml | grep kyverno
# Check Kyverno logs
kubectl logs -n kyverno -l app.kubernetes.io/name=kyverno --tail=50

Completion Checklist

  • SSH hardened on all Proxmox hosts (key-only auth, no root login, restricted ciphers)
  • fail2ban installed and active on all hosts
  • Kernel hardening sysctl parameters applied
  • auditd configured with homelab-specific rules
  • K3s CIS config parameters applied via Ansible
  • kube-bench runs successfully against K3s CIS benchmark
  • kube-bench weekly CronJob scheduled
  • kube-bench results pushed to Prometheus (Grafana dashboard)
  • OpenSCAP scans K3s VMs against CIS Ubuntu profile
  • OpenSCAP reports generated and reviewed
  • Kyverno compliance policies deployed (labels, read-only FS, no hostPath, resource quotas)
  • CIS controls not applied are documented with justification
  • All hardening playbooks committed to Gitea

Source: docs/guides/18-dast-testing.md


Guide 18: Dynamic Application Security Testing

Deploy OWASP ZAP to automatically scan vulnerable applications (DVWA, Juice Shop) and integrate results into the security pipeline.


Overview

This guide deploys OWASP ZAP as an automated DAST scanner targeting the vulnerable applications already running in the sandbox namespace (DVWA, Juice Shop). ZAP runs scheduled scans and pushes results to Loki and Grafana for centralized security visibility.

Time Required: ~60 minutes Prerequisites: Guide 09 (Red/Blue Team) completed

                DAST Testing Architecture
    ┌─────────────────────────────────────────────────┐
    │              Sandbox Namespace (maul)            │
    │                                                  │
    │  ┌──────────┐  ┌──────────┐  ┌──────────────┐   │
    │  │  DVWA    │  │  Juice   │  │Metasploitable│   │
    │  │          │  │  Shop    │  │     2        │   │
    │  └─────┬────┘  └─────┬────┘  └──────┬───────┘   │
    │        │             │              │            │
    │        └──────┬──────┘──────────────┘            │
    │               │  (HTTP targets)                  │
    │               ▼                                  │
    │        ┌──────────────┐                          │
    │        │  OWASP ZAP   │ ◄── Scheduled scans      │
    │        │  (scanner)   │ ◄── CI/CD triggered      │
    │        └──────┬───────┘                          │
    │               │                                  │
    └───────────────┼──────────────────────────────────┘
                    ▼
    ┌─────────────────────────────────────────────────┐
    │           Results Pipeline                       │
    │  ZAP JSON → Parse → Loki (logs)                 │
    │                   → Prometheus (metrics)         │
    │                   → Grafana (dashboards)         │
    └─────────────────────────────────────────────────┘

Phase 1: Deploy OWASP ZAP

1.1 Create ZAP Deployment

Create gitops-apps/security/zap/deployment.yaml:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: zap-config
  namespace: sandbox
data:
  # Scan targets (internal service URLs)
  targets: |
    - name: juice-shop
      url: http://juice-shop.sandbox.svc.cluster.local:3000
    - name: dvwa
      url: http://dvwa.sandbox.svc.cluster.local:80

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: zap-scanner
  namespace: sandbox
  labels:
    app.kubernetes.io/name: zap-scanner
    environment: sandbox
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: zap-scanner
  template:
    metadata:
      labels:
        app.kubernetes.io/name: zap-scanner
        environment: sandbox
    spec:
      containers:
        - name: zap
          image: zaproxy/zap-stable:latest
          ports:
            - containerPort: 8080
              name: api
            - containerPort: 8090
              name: proxy
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: "2"
              memory: 2Gi
          env:
            - name: ZAP_PORT
              value: "8080"
            - name: ZAP_API_KEY
              valueFrom:
                secretKeyRef:
                  name: zap-api-key
                  key: api-key
          volumeMounts:
            - name: zap-session
              mountPath: /home/zap/.ZAP/session
            - name: zap-reports
              mountPath: /zap/reports
      volumes:
        - name: zap-session
          emptyDir: {}
        - name: zap-reports
          persistentVolumeClaim:
            claimName: zap-reports-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: zap-reports-pvc
  namespace: sandbox
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: longhorn-ephemeral
  resources:
    requests:
      storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
  name: zap-scanner
  namespace: sandbox
spec:
  selector:
    app.kubernetes.io/name: zap-scanner
  ports:
    - name: api
      port: 8080
      targetPort: 8080
    - name: proxy
      port: 8090
      targetPort: 8090

1.2 Create API Key Secret

kubectl create secret generic zap-api-key \
  --namespace sandbox \
  --from-literal=api-key="$(openssl rand -hex 32)" \
  --dry-run=client -o yaml | kubectl apply -f -

1.3 Deploy

kubectl apply -f gitops-apps/security/zap/deployment.yaml

# Wait for ZAP to be ready
kubectl wait --for=condition=available deployment/zap-scanner -n sandbox --timeout=120s

# Verify
kubectl get pods -n sandbox -l app.kubernetes.io/name=zap-scanner

Phase 2: Baseline Scans

2.1 Juice Shop Baseline Scan

# Port-forward ZAP API
kubectl port-forward -n sandbox svc/zap-scanner 8080:8080 &

# Run baseline scan against Juice Shop
kubectl exec -n sandbox deploy/zap-scanner -- \
  zap-cli quick-scan \
    --self-contained \
    --cmd-options "-addoninstallall" \
    -f json \
    -o "-config scanner.strength=INSANE" \
    http://juice-shop.sandbox.svc.cluster.local:3000

# Alternative: Use ZAP API directly
ZAP_API="http://127.0.0.1:8080"
API_KEY=$(kubectl get secret zap-api-key -n sandbox -o jsonpath='{.data.api-key}' | base64 -d)

# Start scan via API
curl -s "${ZAP_API}/JSON/ascan/action/scan/?apikey=${API_KEY}&url=http://juice-shop.sandbox.svc.cluster.local:3000&recurse=true"

2.2 DVWA Baseline Scan

# DVWA requires authentication — configure ZAP context
kubectl exec -n sandbox deploy/zap-scanner -- \
  zap-cli quick-scan \
    --self-contained \
    -f json \
    http://dvwa.sandbox.svc.cluster.local:80

2.3 Review Results

# List alerts
kubectl exec -n sandbox deploy/zap-scanner -- \
  zap-cli report -o /zap/reports/baseline-$(date +%Y%m%d).json -f json

# Copy report locally
kubectl cp sandbox/$(kubectl get pod -n sandbox -l app.kubernetes.io/name=zap-scanner -o jsonpath='{.items[0].metadata.name}'):/zap/reports/ ./zap-reports/

Phase 3: Active Scan (Authenticated)

3.1 Configure Authentication for DVWA

API_KEY=$(kubectl get secret zap-api-key -n sandbox -o jsonpath='{.data.api-key}' | base64 -d)
ZAP_API="http://127.0.0.1:8080"

# Create a new context
CONTEXT_ID=$(curl -s "${ZAP_API}/JSON/context/action/newContext/?apikey=${API_KEY}&contextName=dvwa-scan" | jq -r '.contextId')

# Include in context
curl -s "${ZAP_API}/JSON/context/action/includeInContext/?apikey=${API_KEY}&contextName=dvwa-scan&regex=http://dvwa\.sandbox\.svc\.cluster\.local.*"

# Set authentication method (form-based)
curl -s "${ZAP_API}/JSON/authentication/action/setAuthenticationMethod/?apikey=${API_KEY}&contextId=${CONTEXT_ID}&authMethodName=formBasedAuthentication&authMethodConfigParams=loginUrl=http://dvwa.sandbox.svc.cluster.local/login.php%20username={%25username%25}%26password={%25password%25}%26Login=Login"

# Set credentials
curl -s "${ZAP_API}/JSON/users/action/newUser/?apikey=${API_KEY}&contextId=${CONTEXT_ID}&name=admin"
USER_ID=$(curl -s "${ZAP_API}/JSON/users/action/newUser/?apikey=${API_KEY}&contextId=${CONTEXT_ID}&name=admin" | jq -r '.userId')

curl -s "${ZAP_API}/JSON/users/action/setAuthenticationCredentials/?apikey=${API_KEY}&contextId=${CONTEXT_ID}&userId=${USER_ID}&authCredentialsConfigParams=username=admin&password=password"

# Enable user
curl -s "${ZAP_API}/JSON/users/action/setUserEnabled/?apikey=${API_KEY}&contextId=${CONTEXT_ID}&userId=${USER_ID}&enabled=true"

# Set user for forced mode
curl -s "${ZAP_API}/JSON/forcedUser/action/setForcedUser/?apikey=${API_KEY}&contextId=${CONTEXT_ID}&userId=${USER_ID}"
curl -s "${ZAP_API}/JSON/forcedUser/action/setForcedUserModeEnabled/?apikey=${API_KEY}&enabled=true"

3.2 Run Authenticated Active Scan

# Spider the application first
curl -s "${ZAP_API}/JSON/spider/action/scan/?apikey=${API_KEY}&url=http://dvwa.sandbox.svc.cluster.local&contextName=dvwa-scan"

# Wait for spider to complete
sleep 30

# Run active scan
curl -s "${ZAP_API}/JSON/ascan/action/scan/?apikey=${API_KEY}&url=http://dvwa.sandbox.svc.cluster.local&recurse=true&contextName=dvwa-scan"

Phase 4: Automated Scheduled Scans

4.1 CronJob for Weekly Scans

apiVersion: batch/v1
kind: CronJob
metadata:
  name: zap-weekly-scan
  namespace: sandbox
spec:
  schedule: "0 2 * * 0"  # Weekly Sunday 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: zap-scan
              image: zaproxy/zap-stable:latest
              command:
                - /bin/sh
                - -c
                - |
                  DATE=$(date +%Y%m%d)
                  # Juice Shop baseline scan
                  zap-cli quick-scan --self-contained \
                    -f json \
                    -o "-config scanner.strength=INSANE" \
                    -o "-addoninstallall" \
                    /zap/reports/juiceshop-${DATE}.json \
                    http://juice-shop.sandbox.svc.cluster.local:3000

                  # DVWA baseline scan
                  zap-cli quick-scan --self-contained \
                    -f json \
                    /zap/reports/dvwa-${DATE}.json \
                    http://dvwa.sandbox.svc.cluster.local:80

                  # Parse and push metrics
                  TOTAL_ALERTS=$(jq '[.[].alerts | length] | add' /zap/reports/juiceshop-${DATE}.json 2>/dev/null || echo "0")
                  HIGH_ALERTS=$(jq '[.[] | select(.riskdesc | startswith("High"))] | length' /zap/reports/juiceshop-${DATE}.json 2>/dev/null || echo "0")

                  echo "zap_scan_alerts_total{target=\"juice-shop\",severity=\"total\"} ${TOTAL_ALERTS}" > /tmp/metrics
                  echo "zap_scan_alerts_total{target=\"juice-shop\",severity=\"high\"} ${HIGH_ALERTS}" >> /tmp/metrics

                  # Push to Pushgateway
                  wget --post-file=/tmp/metrics \
                    http://kube-prometheus-stack-prometheus-pushgateway.monitoring.svc.cluster.local:9091/metrics/job/zap-scan

              volumeMounts:
                - name: reports
                  mountPath: /zap/reports
          volumes:
            - name: reports
              persistentVolumeClaim:
                claimName: zap-reports-pvc
          restartPolicy: OnFailure

4.2 CI/CD Triggered Scans

Add to gitops-apps/.gitea/workflows/dast-scan.yaml:

name: DAST Scan

on:
  push:
    branches: [main]
    paths:
      - 'sandbox-apps/**'

jobs:
  zap-baseline:
    name: "🔍 ZAP Baseline Scan"
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run ZAP Baseline Scan
        run: |
          docker run -t zaproxy/zap-stable:latest \
            zap-cli quick-scan --self-contained \
            -f json \
            -o "-addoninstallall" \
            /zap/reports/scan-results.json \
            http://juice-shop.sandbox.svc.cluster.local:3000 || true

      - name: Check for High Alerts
        run: |
          HIGH=$(jq '[.[] | select(.riskdesc | startswith("High"))] | length' scan-results.json)
          if [ "$HIGH" -gt 0 ]; then
            echo "::error::ZAP found ${HIGH} high-severity vulnerabilities"
            exit 1
          fi

      - name: Upload Results
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: zap-scan-results
          path: scan-results.json

Phase 5: Grafana Integration

5.1 Push Scan Logs to Loki

Create a sidecar that ships ZAP reports to Loki:

# Add to ZAP deployment as sidecar
- name: report-shipper
  image: grafana/alloy:latest
  volumeMounts:
    - name: zap-reports
      mountPath: /reports

Or use a simpler approach — push results via curl to Loki after each scan:

# Push ZAP results to Loki
LOKI_URL="http://loki.logging.svc.cluster.local:3100"
REPORT_FILE="/zap/reports/juiceshop-$(date +%Y%m%d).json"

# Convert ZAP JSON to Loki-compatible format
jq -c '{stream: {labels: {job: "zap-scanner", target: "juice-shop"}}, values: [[(now | tostring), (. | tostring)]]}' \
  "$REPORT_FILE" | curl -X POST "${LOKI_URL}/loki/api/v1/push" -H "Content-Type: application/json" --data-binary @-

5.2 ZAP Grafana Dashboard

Create a dashboard showing:

  • Total vulnerabilities by severity (High/Medium/Low/Info)
  • Vulnerabilities per target application
  • Vulnerability trends over time
  • Top 10 most common vulnerability types
  • OWASP Top 10 coverage (which categories are triggered)

5.3 Correlate with Trivy Findings

Create a combined vulnerability dashboard comparing:

  • Trivy (static): Container image vulnerabilities
  • ZAP (dynamic): Runtime application vulnerabilities
  • Falco (runtime): Active security events

This gives a complete view: static analysis + dynamic testing + runtime monitoring.


Verification

# Verify ZAP is running
kubectl get pods -n sandbox -l app.kubernetes.io/name=zap-scanner

# Run a quick test scan
kubectl exec -n sandbox deploy/zap-scanner -- \
  zap-cli quick-scan --self-contained -f json \
  http://juice-shop.sandbox.svc.cluster.local:3000

# Check scan results
kubectl exec -n sandbox deploy/zap-scanner -- ls -la /zap/reports/

# Copy report to local machine
POD=$(kubectl get pod -n sandbox -l app.kubernetes.io/name=zap-scanner -o jsonpath='{.items[0].metadata.name}')
kubectl cp sandbox/${POD}:/zap/reports/ ./zap-reports/

# View report
cat ./zap-reports/*.json | jq '.[0].alerts[:5]'

# Verify CronJob is scheduled
kubectl get cronjob -n sandbox zap-weekly-scan

Troubleshooting

ZAP scan hangs

# Check ZAP pod resources
kubectl top pod -n sandbox -l app.kubernetes.io/name=zap-scanner
# ZAP is memory-hungry — increase memory limit if needed
# Also check: target application is reachable from sandbox namespace
kubectl exec -n sandbox deploy/zap-scanner -- curl -sI http://juice-shop.sandbox.svc.cluster.local:3000

Scan produces no results

# Ensure spider finds the application pages
kubectl exec -n sandbox deploy/zap-scanner -- \
  zap-cli spider http://juice-shop.sandbox.svc.cluster.local:3000
# Check: target URL is correct, service is running
kubectl get svc -n sandbox

Cannot push to Loki

# Test Loki connectivity from ZAP pod
kubectl exec -n sandbox deploy/zap-scanner -- \
  curl -s http://loki.logging.svc.cluster.local:3100/ready
# Expected: ready

Completion Checklist

  • OWASP ZAP deployed in sandbox namespace
  • ZAP API accessible and functional
  • Baseline scan completed against Juice Shop
  • Baseline scan completed against DVWA
  • Authenticated scan configured for DVWA (form-based login)
  • Active scan tested with authentication
  • Weekly automated scan CronJob scheduled
  • CI/CD triggered DAST scan workflow created
  • ZAP results pushed to Loki for log analysis
  • ZAP metrics pushed to Prometheus Pushgateway
  • Grafana ZAP dashboard created
  • Combined vulnerability dashboard (Trivy + ZAP + Falco) created
  • Scan reports stored on persistent volume (zap-reports-pvc)

Source: docs/guides/19-chaos-engineering.md


Guide 19: Chaos Engineering

Deploy Chaos Mesh to deliberately inject failures and test the resilience of the security and monitoring stack.


Overview

This guide installs Chaos Mesh and runs controlled experiments against the homelab infrastructure. Each experiment tests a specific resilience property: pod recovery, network partition handling, storage degradation, and alert pipeline integrity.

Time Required: ~75 minutes Prerequisites: Guide 10 (Monitoring Stack) completed

              Chaos Engineering Architecture
    ┌─────────────────────────────────────────────────┐
    │              Chaos Mesh Dashboard                │
    │         (chaos-mesh.homelab.local)               │
    └──────────────────┬──────────────────────────────┘
                       │
    ┌──────────────────┼──────────────────────────────┐
    │              Experiment Library                   │
    │                                                  │
    │  ┌──────────────┐  ┌──────────────┐             │
    │  │  PodChaos    │  │ NetworkChaos │             │
    │  │  kill/fail   │  │  delay/part  │             │
    │  └──────────────┘  └──────────────┘             │
    │  ┌──────────────┐  ┌──────────────┐             │
    │  │ StressChaos  │  │   IOChaos    │             │
    │  │  CPU/mem     │  │  latency/fault│             │
    │  └──────────────┘  └──────────────┘             │
    └──────────────────┬──────────────────────────────┘
                       │ inject failures
    ┌──────────────────┼──────────────────────────────┐
    │              Target Services                     │
    │                                                  │
    │  Falco · Vault · Prometheus · Loki · Tempo      │
    │  Grafana · AlertManager · OTel Collector        │
    └─────────────────────────────────────────────────┘

Phase 1: Install Chaos Mesh

1.1 Create Namespace

kubectl create namespace chaos-mesh
kubectl label namespace chaos-mesh environment=infrastructure

1.2 Install via Helm

helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/k3s/containerd/containerd.sock \
  --set dashboard.create=true \
  --set dashboard.ingress.enabled=false \
  --set prometheus.create=true \
  --set prometheus.serviceMonitor.enabled=true \
  --set controllerManager.serviceAccount.name=chaos-controller-manager \
  --set dnsServer.create=true \
  --wait

1.3 Verify Installation

kubectl get pods -n chaos-mesh
# Expected:
# chaos-controller-manager   Running
# chaos-daemon               Running (on each node)
# chaos-dashboard            Running
# chaos-dns-server           Running

# Port-forward dashboard for setup
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333 &
# Access: http://localhost:2333

1.4 Create Ingress for Dashboard

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: chaos-mesh-dashboard
  namespace: chaos-mesh
  annotations:
    cert-manager.io/cluster-issuer: homelab-ca-issuer
    nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - chaos-mesh.homelab.local
      secretName: chaos-mesh-tls
  rules:
    - host: chaos-mesh.homelab.local
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: chaos-dashboard
                port:
                  number: 2333

Phase 2: Safety Mechanisms

2.1 Create Safety Policy

Chaos Mesh uses ServiceAccount and RBAC to limit what chaos experiments can do. Always set:

  • Duration limits: Every experiment must have an explicit duration field
  • Namespace selectors: Only target specific namespaces
  • Emergency stop: Know how to halt all experiments immediately
# Emergency stop — delete all chaos experiments
kubectl delete networkchaos,podchaos,stresschaos,iochaos,timechaos,dnschaos --all -A

# Or use the dashboard's "Pause All" button

2.2 Namespace Allowlist

# Label namespaces that chaos experiments can target
kubectl label namespace monitoring chaos-ready=true
kubectl label namespace logging chaos-ready=true
kubectl label namespace security chaos-ready=true
kubectl label namespace sandbox chaos-ready=true

# Do NOT label: kube-system, longhorn-system, argocd, cert-manager

Phase 3: Experiment 1 — Falco Pod Kill

Tests: Falco self-healing and monitoring continuity

3.1 Define the Experiment

# chaos-experiments/falco-pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: falco-pod-kill
  namespace: chaos-mesh
  labels:
    experiment: falco-resilience
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - security
    labelSelectors:
      app.kubernetes.io/name: falco
  scheduler:
    cron: "@every 10m"
  duration: "30s"

3.2 Run and Observe

# Apply the experiment
kubectl apply -f chaos-experiments/falco-pod-kill.yaml

# Watch Falco pod get killed and restart
kubectl get pods -n security -l app.kubernetes.io/name=falco -w

# Verify Falco recovers and is functional
# Wait 2 minutes, then:
kubectl exec -n security daemonset/falco -- falco --list-source=syscall

# Check: monitoring still receives Falco events
kubectl logs -n security -l app.kubernetes.io/name=falco --tail=10

# Clean up
kubectl delete -f chaos-experiments/falco-pod-kill.yaml

Phase 4: Experiment 2 — Vault Network Partition

Tests: Application behavior when Vault is unreachable

4.1 Define the Experiment

# chaos-experiments/vault-network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: vault-partition
  namespace: chaos-mesh
  labels:
    experiment: vault-isolation
spec:
  action: partition
  direction: both
  mode: all
  selector:
    namespaces:
      - monitoring
    labelSelectors:
      app.kubernetes.io/name: prometheus
  external:
    targets:
      - mode: all
        selector:
          namespaces:
            - security
          labelSelectors:
            app: vault
  duration: "60s"

4.2 Run and Observe

kubectl apply -f chaos-experiments/vault-network-partition.yaml

# Observe: applications using Vault secrets should handle the disconnect
# Check for error logs in services that depend on Vault
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus --tail=20 | grep -i vault

# Verify: network partition is active
kubectl describe networkchaos vault-partition -n chaos-mesh

# After 60s, verify connectivity is restored
kubectl exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
  wget -qO- http://vault.security.svc.cluster.local:8200/v1/sys/health

# Clean up (auto-cleans after duration, or manual)
kubectl delete -f chaos-experiments/vault-network-partition.yaml

Phase 5: Experiment 3 — Monitoring Pipeline Network Delay

Tests: Alert pipeline resilience under network degradation

5.1 Define the Experiment

# chaos-experiments/monitoring-network-delay.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: monitoring-delay
  namespace: chaos-mesh
  labels:
    experiment: alert-pipeline
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - monitoring
    labelSelectors:
      app.kubernetes.io/name: prometheus
  delay:
    latency: "500ms"
    correlation: "50"
    jitter: "100ms"
  duration: "120s"

5.2 Run and Observe

kubectl apply -f chaos-experiments/monitoring-network-delay.yaml

# Observe: Prometheus scrape latency increase
kubectl exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/query?query=scrape_duration_seconds' 2>/dev/null | jq .

# Check: alerts still fire (may be delayed)
kubectl get alerts -A

# After 120s, verify latency returns to normal
sleep 120
kubectl exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/query?query=scrape_duration_seconds' 2>/dev/null | jq .

Phase 6: Experiment 4 — CPU Stress Test

Tests: Pod eviction and resource handling under load

6.1 Define the Experiment

# chaos-experiments/worker-cpu-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: worker-cpu-stress
  namespace: chaos-mesh
  labels:
    experiment: resource-pressure
spec:
  mode: one
  selector:
    namespaces:
      - monitoring
    labelSelectors:
      app.kubernetes.io/name: prometheus
  stressors:
    cpu:
      workers: 2
      load: 80
  duration: "60s"

6.2 Run and Observe

kubectl apply -f chaos-experiments/worker-cpu-stress.yaml

# Watch CPU usage spike
kubectl top pods -n monitoring

# Check: Prometheus still responds (may be slower)
kubectl exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
  wget -qO- http://localhost:9090/-/healthy

# Clean up after 60s

Phase 7: Experiment 5 — Longhorn IO Latency

Tests: Storage degradation handling

7.1 Define the Experiment

# chaos-experiments/longhorn-io-latency.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: longhorn-io-delay
  namespace: chaos-mesh
  labels:
    experiment: storage-resilience
spec:
  action: latency
  mode: one
  selector:
    namespaces:
      - monitoring
    labelSelectors:
      app.kubernetes.io/name: prometheus
  delay: "200ms"
  methods:
    - READ
    - WRITE
  path: "/data"
  duration: "60s"

7.2 Run and Observe

kubectl apply -f chaos-experiments/longhorn-io-latency.yaml

# Check: Prometheus TSDB write latency
kubectl exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
  wget -qO- 'http://localhost:9090/api/v1/query?query=prometheus_tsdb_compaction_duration_seconds' 2>/dev/null | jq .

# Check: Longhorn volume health
kubectl get volumes -n longhorn-system

# Clean up after 60s

Phase 8: Scheduled Chaos

8.1 Weekly Game Day Schedule

# chaos-experiments/game-day.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: weekly-game-day
  namespace: chaos-mesh
spec:
  schedule: "0 4 * * 1"  # Monday 4 AM
  historyLimit: 3
  concurrencyPolicy: Forbid
  type: PodChaos
  podChaos:
    action: pod-kill
    mode: one
    selector:
      namespaces:
        - monitoring
        - logging
        - security
      labelSelectors:
        chaos-ready: "true"
    duration: "30s"

Phase 9: Grafana Dashboard

9.1 Chaos Metrics Dashboard

Create a Grafana dashboard with panels:

  • Active Experiments: Current running chaos experiments (from Chaos Mesh metrics)
  • Pod Recovery Time: Time from pod kill to pod Ready (kube_pod_status_phase metric)
  • Service Availability During Chaos: Target service uptime during experiments
  • Alert Delivery During Chaos: AlertManager alerts fired vs delivered during experiments

Import Chaos Mesh dashboard (Dashboard ID: 16463) or create custom.


Verification

# Verify Chaos Mesh installation
kubectl get pods -n chaos-mesh

# Run all experiments sequentially (with cleanup between)
for EXP in falco-pod-kill vault-network-partition monitoring-network-delay worker-cpu-stress; do
  echo "=== Running: ${EXP} ==="
  kubectl apply -f chaos-experiments/${EXP}.yaml
  sleep 90
  kubectl delete -f chaos-experiments/${EXP}.yaml
  echo "=== Cleaned up: ${EXP} ==="
  sleep 30
done

# Verify all services are healthy after experiments
kubectl get pods -A | grep -v Running | grep -v Completed

# Check monitoring is still functional
kubectl exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
  wget -qO- http://localhost:9090/-/healthy

Troubleshooting

Chaos Daemon not starting

# Check containerd socket path (K3s uses non-default path)
kubectl logs -n chaos-mesh -l app.kubernetes.io/component=chaos-daemon
# Fix: set chaosDaemon.socketPath=/run/k3s/containerd/containerd.sock

Experiment stuck and won't stop

# Force delete
kubectl delete podchaos,stresschaos,networkchaos,iochaos <name> -n chaos-mesh --force --grace-period=0
# Or pause via dashboard

Dashboard not accessible

# Check service
kubectl get svc -n chaos-mesh chaos-dashboard
# Port-forward if ingress not configured
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333

Completion Checklist

  • Chaos Mesh installed in chaos-mesh namespace
  • Dashboard accessible via ingress or port-forward
  • Safety labels applied to target namespaces (chaos-ready=true)
  • Emergency stop procedure documented and tested
  • Experiment 1: Falco pod kill — verified auto-recovery
  • Experiment 2: Vault network partition — verified graceful degradation
  • Experiment 3: Monitoring network delay — verified alert pipeline resilience
  • Experiment 4: CPU stress — verified pod eviction handling
  • Experiment 5: Longhorn IO latency — verified storage degradation handling
  • Weekly game day schedule created
  • Chaos experiment YAML files committed to gitops-apps
  • Grafana chaos dashboard created
  • All services healthy after running all experiments

Source: docs/guides/20-policy-as-code.md


Guide 20: Policy as Code

Implement OPA Gatekeeper for Kubernetes admission control and Conftest for CI pipeline policy enforcement.


Overview

This guide deploys OPA Gatekeeper alongside Kyverno and creates a policy-as-code framework. Gatekeeper handles admission control with Rego-based constraint templates, while Conftest validates manifests in the CI pipeline. Kyverno remains for YAML-native policies — Gatekeeper adds Rego flexibility for complex policies.

Time Required: ~75 minutes Prerequisites: Guide 08 (Security Tooling), Guide 12 (CI/CD Pipeline Security) completed

              Policy as Code Architecture
    ┌─────────────────────────────────────────────────┐
    │            Policy Sources (Git)                  │
    │                                                  │
    │  policies/                                       │
    │  ├── conftest/     (CI policy checks)            │
    │  ├── gatekeeper/   (K8s admission)               │
    │  └── kyverno/      (K8s admission, YAML-native)  │
    └──────────────────┬──────────────────────────────┘
                       │
          ┌────────────┼────────────┐
          ▼            ▼            ▼
    ┌──────────┐ ┌──────────┐ ┌──────────────┐
    │ Conftest │ │Gatekeeper│ │   Kyverno    │
    │ (CI/CD)  │ │ (K8s API)│ │ (K8s API)    │
    └──────────┘ └──────────┘ └──────────────┘
          │            │            │
          ▼            ▼            ▼
    ┌──────────┐ ┌──────────────────────────┐
    │ Pipeline │ │   Kubernetes Cluster      │
    │ Gate     │ │   (Admission Control)     │
    └──────────┘ └──────────────────────────┘

Kyverno vs Gatekeeper — When to Use Each

Aspect Kyverno OPA Gatekeeper
Policy language YAML (native K8s) Rego (OPA)
Complexity Simple policies Complex logic, loops, data joins
Learning curve Low Medium-High
Best for Labels, limits, image rules Cross-resource validation, data-driven policies
Policy testing Manual OPA test framework
Audit Per-policy Centralized audit

Phase 1: Install OPA Gatekeeper

1.1 Deploy via Helm

helm repo add gatekeeper https://open-policy-agent.github.io/gatekeeper/charts
helm repo update

helm install gatekeeper gatekeeper/gatekeeper \
  --namespace gatekeeper-system \
  --create-namespace \
  --set auditInterval=60 \
  --set auditFromCache=true \
  --set logLevel=INFO \
  --set emitAdmissionEvents=true \
  --set emitAuditEvents=true \
  --set validatingWebhookTimeoutSeconds=10 \
  --set disabledBuiltins={"http.send"} \
  --wait

1.2 Verify Installation

kubectl get pods -n gatekeeper-system
# Expected: gatekeeper-audit Running, gatekeeper-controller-manager Running

kubectl get crd | grep gatekeeper
# Expected: constrainttemplates.templates.gatekeeper.sh
#           configs.config.gatekeeper.sh
#           constraintpodstatuses.status.gatekeeper.sh

1.3 ArgoCD Application

# gitops-apps/security/gatekeeper/application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: gatekeeper
  namespace: argocd
spec:
  project: homelab
  source:
    repoURL: https://git.homelab.local/homelab/gitops-apps.git
    targetRevision: main
    path: security/gatekeeper
  destination:
    server: https://kubernetes.default.svc
    namespace: gatekeeper-system
  syncPolicy:
    automated:
      selfHeal: true
      prune: false
    syncOptions:
      - ServerSideApply=true
      - CreateNamespace=true

Phase 2: Constraint Templates

2.1 Required Labels Template

Create gitops-apps/security/gatekeeper/templates/required-labels.yaml:

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
  annotations:
    description: "Require specific labels on Kubernetes resources"
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          type: object
          properties:
            labels:
              type: array
              items:
                type: string
            message:
              type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels

        violation[{"msg": msg}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := input.parameters.message
        }

2.2 Banned Image Registry Template

Create gitops-apps/security/gatekeeper/templates/banned-registries.yaml:

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8sbannedregistries
  annotations:
    description: "Block images from unauthorized registries"
spec:
  crd:
    spec:
      names:
        kind: K8sBannedRegistries
      validation:
        openAPIV3Schema:
          type: object
          properties:
            registries:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sbannedregistries

        violation[{"msg": msg}] {
          container := input.review.object.spec.template.spec.containers[_]
          banned := input.parameters.registries[_]
          startswith(container.image, banned)
          msg := sprintf("Container image <%v> uses banned registry <%v>", [container.image, banned])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.template.spec.initContainers[_]
          banned := input.parameters.registries[_]
          startswith(container.image, banned)
          msg := sprintf("Init container image <%v> uses banned registry <%v>", [container.image, banned])
        }

2.3 Required Probes Template

Create gitops-apps/security/gatekeeper/templates/required-probes.yaml:

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredprobes
  annotations:
    description: "Require liveness and readiness probes on containers"
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredProbes
      validation:
        openAPIV3Schema:
          type: object
          properties:
            probeTypes:
              type: array
              items:
                type: string
                enum: ["livenessProbe", "readinessProbe", "startupProbe"]
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredprobes

        violation[{"msg": msg}] {
          probeType := input.parameters.probeTypes[_]
          container := input.review.object.spec.template.spec.containers[_]
          not container[probeType]
          msg := sprintf("Container <%v> missing <%v>", [container.name, probeType])
        }

2.4 Longhorn Storage Class Template

Homelab-specific: require Longhorn storage classes on PVCs.

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: homelabstorageclass
  annotations:
    description: "Require Longhorn storage classes on PVCs"
spec:
  crd:
    spec:
      names:
        kind: HomelabStorageClass
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package homelabstorageclass

        allowed := {"longhorn-critical", "longhorn-default", "longhorn-ephemeral"}

        violation[{"msg": msg}] {
          input.review.kind.kind == "PersistentVolumeClaim"
          sc := input.review.object.spec.storageClassName
          not allowed[sc]
          msg := sprintf("PVC <%v> uses unsupported storageClass <%v>. Use: longhorn-critical, longhorn-default, or longhorn-ephemeral", [input.review.object.metadata.name, sc])
        }

        violation[{"msg": msg}] {
          input.review.kind.kind == "PersistentVolumeClaim"
          not input.review.object.spec.storageClassName
          msg := sprintf("PVC <%v> must specify a storageClassName", [input.review.object.metadata.name])
        }

Phase 3: Constraints (Policies)

3.1 Apply Templates First

kubectl apply -f gitops-apps/security/gatekeeper/templates/

# Verify templates
kubectl get constrainttemplates

3.2 Create Constraints

Create gitops-apps/security/gatekeeper/constraints/:

# gitops-apps/security/gatekeeper/constraints/required-labels.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: require-app-labels
spec:
  enforcementAction: dryrun  # Start in dryrun, change to deny after testing
  match:
    kinds:
      - kinds: ["Deployment", "StatefulSet", "DaemonSet"]
    excludedNamespaces:
      - kube-system
      - gatekeeper-system
      - longhorn-system
  parameters:
    labels:
      - "app.kubernetes.io/name"
    message: "All workloads must have app.kubernetes.io/name label"
# gitops-apps/security/gatekeeper/constraints/banned-registries.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sBannedRegistries
metadata:
  name: ban-dockerhub-latest
spec:
  enforcementAction: dryrun
  match:
    kinds:
      - kinds: ["Deployment", "StatefulSet", "DaemonSet", "Pod"]
    excludedNamespaces:
      - kube-system
      - gatekeeper-system
  parameters:
    registries:
      - "docker.io/library/"  # Ban official Docker Hub images (require registry mirror)
# gitops-apps/security/gatekeeper/constraints/required-probes.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredProbes
metadata:
  name: require-health-probes
spec:
  enforcementAction: dryrun
  match:
    kinds:
      - kinds: ["Deployment", "StatefulSet"]
    excludedNamespaces:
      - kube-system
      - gatekeeper-system
      - longhorn-system
  parameters:
    probeTypes:
      - "livenessProbe"
      - "readinessProbe"
# gitops-apps/security/gatekeeper/constraints/storage-class.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: HomelabStorageClass
metadata:
  name: require-longhorn-storage
spec:
  enforcementAction: dryrun
  match:
    kinds:
      - kinds: ["PersistentVolumeClaim"]
    excludedNamespaces:
      - kube-system
      - gatekeeper-system
      - longhorn-system
      - velero

3.3 Apply Constraints

kubectl apply -f gitops-apps/security/gatekeeper/constraints/

# Verify constraints
kubectl get constraints
kubectl describe K8sRequiredLabels require-app-labels

Phase 4: Conftest CI Integration

4.1 Policy Library Structure

policies/
├── conftest/
│   ├── require_labels.rego
│   ├── require_limits.rego
│   ├── disallow_latest.rego
│   ├── require_storage_class.rego
│   └── require_authelia_annotation.rego
├── conftest-tests/       (unit tests)
│   ├── require_labels_test.rego
│   └── disallow_latest_test.rego
├── gatekeeper/
│   ├── templates/
│   └── constraints/
└── kyverno/              (existing)

4.2 Authelia Annotation Policy

# policies/conftest/require_authelia_annotation.rego
package main

# Require forward-auth annotation on all ingresses in production
warn[msg] {
    input.kind == "Ingress"
    ns := input.metadata.namespace
    prod_ns := {"services", "monitoring", "security"}
    ns in prod_ns
    not input.metadata.annotations["nginx.ingress.kubernetes.io/auth-url"]
    msg := sprintf("Ingress '%s' in namespace '%s' must have Authelia forward-auth annotation", [input.metadata.name, ns])
}

4.3 Test Policies with OPA Test Framework

# policies/conftest-tests/require_labels_test.rego
package main

test_pass_with_label {
    allow with input as {
        "kind": "Deployment",
        "metadata": {
            "name": "test-app",
            "namespace": "services",
            "labels": {"app.kubernetes.io/name": "test-app"}
        },
        "spec": {
            "template": {
                "spec": {
                    "containers": [{
                        "name": "test",
                        "image": "test:1.0",
                        "resources": {
                            "limits": {"cpu": "100m", "memory": "128Mi"}
                        }
                    }]
                }
            }
        }
    }
}

test_fail_without_label {
    deny[msg] with input as {
        "kind": "Deployment",
        "metadata": {"name": "test-app", "namespace": "services"}
    }
}
# Run policy unit tests
conftest verify --policy policies/conftest/ --policy policies/conftest-tests/

Phase 5: CI Pipeline Gate

5.1 Conftest Workflow

Add to the security pipeline (from Guide 12):

  # Add to security-pipeline.yaml
  conftest:
    name: "📋 Policy Check"
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install Conftest
        run: |
          curl -L https://github.com/open-policy-agent/conftest/releases/latest/download/conftest_linux_amd64.tar.gz | tar xz
          sudo mv conftest /usr/local/bin/

      - name: Verify Policy Tests Pass
        run: |
          conftest verify --policy policies/conftest/

      - name: Validate All Manifests
        run: |
          conftest test --policy policies/conftest/ --output table gitops-apps/

      - name: Validate Helm Charts
        run: |
          # If using Helm charts, render and validate
          for chart in charts/*/; do
            helm template "${chart}" | conftest test --policy policies/conftest/ -
          done

Phase 6: Compliance Reporting

6.1 Gatekeeper Audit

Gatekeeper runs periodic audits. View results:

# Check audit results
kubectl get constraints -o yaml | grep -A5 "totalViolations"

# View violations per constraint
kubectl describe K8sRequiredLabels require-app-labels

6.2 Prometheus Metrics

Gatekeeper exposes metrics at :8888/metrics. Create a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: gatekeeper-metrics
  namespace: gatekeeper-system
spec:
  selector:
    matchLabels:
      app: gatekeeper
  endpoints:
    - port: metrics
      interval: 30s

6.3 Grafana Dashboard

Create a Gatekeeper compliance dashboard showing:

  • Total violations per constraint
  • Constraint enforcement actions (dryrun vs deny)
  • Audit run duration
  • Admission webhook latency

Import OPA Gatekeeper dashboard (Dashboard ID: 16922).


Verification

# Verify Gatekeeper installation
kubectl get pods -n gatekeeper-system
kubectl get constrainttemplates

# Verify constraints
kubectl get constraints
kubectl describe K8sRequiredLabels require-app-labels

# Test constraint enforcement (dryrun mode — should log but not block)
kubectl run test-no-label --image=nginx -n services
# Check: violation logged in constraint status
kubectl describe K8sRequiredLabels require-app-labels

# Test Conftest locally
conftest test --policy policies/conftest/ gitops-apps/
conftest verify --policy policies/conftest/

# Switch to deny mode after testing
kubectl patch K8sRequiredLabels require-app-labels \
  --type merge -p '{"spec":{"enforcementAction":"deny"}}'

# Test: should now be blocked
kubectl run test-no-label-2 --image=nginx -n services
# Expected: admission webhook denied the request

# Switch back to dryrun for safety
kubectl patch K8sRequiredLabels require-app-labels \
  --type merge -p '{"spec":{"enforcementAction":"dryrun"}}'

Troubleshooting

Gatekeeper webhook blocking everything

# Temporarily disable webhook
kubectl delete validatingwebhookconfiguration gatekeeper-validating-webhook-configuration
# Re-enable:
helm upgrade gatekeeper gatekeeper/gatekeeper --namespace gatekeeper-system --reuse-values

Constraint violations not showing

# Check audit pod logs
kubectl logs -n gatekeeper-system -l control-plane=audit-controller --tail=50
# Verify auditInterval is set (default 60s)
kubectl get config -n gatekeeper-system config -o yaml

Conftest policy errors

# Debug with trace
conftest test --policy policies/conftest/ --trace gitops-apps/
# Verify Rego syntax
conftest parse gitops-apps/argocd-apps/root-application.yaml

Completion Checklist

  • OPA Gatekeeper installed in gatekeeper-system namespace
  • ConstraintTemplate K8sRequiredLabels created
  • ConstraintTemplate K8sBannedRegistries created
  • ConstraintTemplate K8sRequiredProbes created
  • ConstraintTemplate HomelabStorageClass created (homelab-specific)
  • Constraints deployed in dryrun mode
  • Constraint violations reviewed and acceptable ones documented
  • Conftest policies in policies/conftest/ directory
  • Authelia annotation policy (forward-auth on production ingresses)
  • Policy unit tests pass (conftest verify)
  • CI pipeline includes Conftest gate
  • Gatekeeper ServiceMonitor configured
  • Grafana Gatekeeper dashboard imported
  • All policy files committed to Gitea
  • Documented when to use Kyverno vs Gatekeeper

Source: docs/guides/21-incident-response.md


Guide 21: Incident Response & Alerting

Route Falco runtime alerts through AlertManager to Grafana, create incident runbooks, and configure Wazuh active response for automated remediation.


Overview

This guide builds the complete alert pipeline: Falco detects runtime threats → falcosidekick routes to AlertManager → AlertManager notifies via Grafana and webhook. Includes six incident runbooks and Wazuh active response automation.

Time Required: ~90 minutes Prerequisites: Guide 08 (Security Tooling), Guide 10 (Monitoring Stack) completed

              Incident Response Pipeline
    ┌─────────────────────────────────────────────────┐
    │              Detection Sources                    │
    │                                                  │
    │  Falco (runtime)    Trivy (vulns)    Kyverno     │
    │  kube-bench (CIS)   cert-manager    Longhorn    │
    └────────┬──────────────┬──────────────┬──────────┘
             │              │              │
             ▼              ▼              ▼
    ┌─────────────────────────────────────────────────┐
    │           falcosidekick (router)                 │
    │    Routes events to multiple outputs             │
    └────────┬──────────────┬──────────────┬──────────┘
             │              │              │
             ▼              ▼              ▼
    ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
    │ AlertManager │ │    Loki      │ │ Wazuh Active │
    │ (alerts)     │ │   (logs)     │ │   Response   │
    └──────┬───────┘ └──────────────┘ └──────────────┘
           │
           ▼
    ┌─────────────────────────────────────────────────┐
    │         Notification Channels                    │
    │  Grafana · Webhook · Slack (optional)            │
    └─────────────────────────────────────────────────┘

Phase 1: Deploy falcosidekick

1.1 Install falcosidekick

helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

cat > falcosidekick-values.yaml <<'EOF'
config:
  alertmanager:
    hostport: "http://kube-prometheus-stack-alertmanager.monitoring.svc.cluster.local:9093"
    minimumpriority: "warning"
  loki:
    hostport: "http://loki.logging.svc.cluster.local:3100"
    minimumpriority: "informational"

  # Webhook for custom integrations
  webhook:
    address: "http://wazuh-manager.sandbox.svc.cluster.local:55000/webhook"
    minimumpriority: "critical"

  # Slack (optional — requires internet)
  # slack:
  #   webhookurl: "https://hooks.slack.com/services/XXX"
  #   minimumpriority: "warning"

  # Custom fields for all outputs
  customfields:
    environment: "homelab"
    cluster: "k3s-homelab"

resources:
  requests:
    cpu: 50m
    memory: 64Mi
  limits:
    cpu: 200m
    memory: 128Mi
EOF

helm install falcosidekick falcosecurity/falcosidekick \
  --namespace security \
  -f falcosidekick-values.yaml \
  --wait

1.2 Configure Falco to Use sidekick

# Update Falco config to output JSON to falcosidekick
helm upgrade falco falcosecurity/falco \
  --namespace security \
  --reuse-values \
  --set falco.jsonOutput=true \
  --set falco.programOutput.enabled=true \
  --set falco.programOutput.program="curl -s http://falcosidekick.security.svc.cluster.local:2801 -X POST -H 'Content-Type: application/json' -d @-"

1.3 Verify Pipeline

# Trigger a Falco event
kubectl run shell-test --image=alpine -n monitoring -- sh -c "sleep 3600"
kubectl exec -n monitoring shell-test -- sh -c "cat /etc/shadow"

# Check Falco logs
kubectl logs -n security daemonset/falco --tail=10

# Check falcosidekick logs
kubectl logs -n security deploy/falcosidekick --tail=10

# Check AlertManager received the alert
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
curl -s http://localhost:9093/api/v2/alerts | jq .

# Check Loki received the event
kubectl exec -n monitoring deploy/kube-prometheus-stack-grafana -- \
  wget -qO- 'http://loki.logging.svc.cluster.local:3100/loki/api/v1/query?query={job="falco"}' | jq .

# Clean up test pod
kubectl delete pod shell-test -n monitoring --force

Phase 2: AlertManager Configuration

2.1 Create Alert Routes

Create gitops-apps/monitoring/alertmanager-config.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
type: Opaque
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m

    route:
      group_by: ['alertname', 'namespace', 'severity']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: 'grafana-notifications'

      routes:
        # Critical alerts — immediate notification
        - match:
            severity: critical
          receiver: 'grafana-critical'
          repeat_interval: 15m
          group_wait: 10s

        # Falco security alerts
        - match_re:
            alertname: Falco.*
          receiver: 'grafana-security'
          repeat_interval: 30m

        # Backup failures
        - match_re:
            alertname: VeleroBackup.*|ProxmoxBackup.*
          receiver: 'grafana-critical'

    receivers:
      - name: 'grafana-notifications'
        # Uses Grafana unified alerting — no separate webhook needed
        # Alerts appear in Grafana Alerting UI

      - name: 'grafana-critical'
        # Same as above — severity label routes them

      - name: 'grafana-security'
        # Security-specific alerts from Falco

2.2 Prometheus Alert Rules

Create gitops-apps/monitoring/security-alerts.yaml:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: security-alerts
  namespace: monitoring
spec:
  groups:
    - name: falco.rules
      rules:
        - alert: FalcoRuntimeAlert
          expr: increase(falco_events{priority="Critical"}[5m]) > 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Falco critical runtime event detected"
            description: "Falco detected {{ $value }} critical event(s) in the last 5 minutes. Rule: {{ $labels.rule }}"

        - alert: FalcoHighEventRate
          expr: rate(falco_events[5m]) > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Unusually high Falco event rate"
            description: "Falco is generating {{ $value }} events/second. This may indicate an active attack."

    - name: trivy.rules
      rules:
        - alert: TrivyCriticalVulnerability
          expr: trivy_image_vulnerabilities{severity="Critical"} > 0
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: "Critical vulnerability in image {{ $labels.image }}"
            description: "Image {{ $labels.image }} in namespace {{ $labels.namespace }} has {{ $value }} critical vulnerabilities."

    - name: cluster.rules
      rules:
        - alert: NodeDown
          expr: up{job="node-exporter"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Node {{ $labels.instance }} is down"
            description: "Node {{ $labels.instance }} has been unreachable for 5 minutes."

        - alert: PodCrashLooping
          expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
            description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value }} times in 15 minutes."

        - alert: CertificateExpiringSoon
          expr: certmanager_certificate_expiration_timestamp_seconds - time() < 86400 * 14
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: "Certificate {{ $labels.name }} expires in less than 14 days"

        - alert: LonghornVolumeDegraded
          expr: longhorn_volume_status{state="degraded"} == 1
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Longhorn volume {{ $labels.volume }} is degraded"

Phase 3: Incident Severity Classification

Level Name Response Time Examples
SEV1 Critical Immediate (< 15 min) Node down, active intrusion, data loss
SEV2 High < 1 hour Critical vulnerability, cert expiring, backup failure
SEV3 Medium < 4 hours Warning alert, degraded storage, policy violation
SEV4 Low Next business day Info alert, audit finding, non-critical CIS failure

Phase 4: Incident Runbooks

4.1 Runbook: Unauthorized Shell in Container

## IR-001: Unauthorized Shell Spawned in Container

**Severity:** SEV1 — Critical
**Source:** Falco rule "Terminal shell in container"

### Detection
Falco alert: `Terminal shell in container`
AlertManager: Critical severity
Grafana: Security dashboard → Falco events

### Investigation Steps
1. Identify the affected pod and namespace:
   ```bash
   # From Falco event
   kubectl logs -n security daemonset/falco | grep "shell in container"
  1. Check who spawned the shell:
    kubectl describe pod <pod-name> -n <namespace>
    kubectl logs <pod-name> -n <namespace> --previous
  2. Check ArgoCD sync history (was this deployed recently?):
    argocd app history <app-name>
  3. Review Kubernetes audit logs:
    kubectl logs -n kube-system -l component=kube-apiserver --tail=100

Containment

  1. If unauthorized — isolate the pod:
    kubectl label pod <pod-name> compromised=true -n <namespace>
    # Apply emergency NetworkPolicy to block egress
    kubectl apply -f - <<NETPOL
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: quarantine
    spec:
      podSelector:
        matchLabels:
          compromised: "true"
      policyTypes: ["Ingress", "Egress"]
    NETPOL
  2. Capture forensics:
    kubectl exec <pod-name> -n <namespace> -- ps aux > /tmp/forensics-ps.log
    kubectl exec <pod-name> -n <namespace> -- netstat -tlnp > /tmp/forensics-net.log

Remediation

  1. Delete the compromised pod
  2. Review the container image for tampering
  3. Check Kyverno audit logs for policy violations
  4. Rotate any credentials the pod had access to (Vault)

Post-Incident

  • Document in Gitea: docs/incidents/YYYY-MM-DD-IR-001.md
  • Update Falco rules if needed
  • Update Kyverno policies to prevent recurrence

### 4.2 Runbook: Critical Vulnerability Detected

```markdown
## IR-002: Critical Vulnerability (Trivy)

**Severity:** SEV2 — High
**Source:** Trivy Operator vulnerability scan

### Detection
Trivy alert: `TrivyCriticalVulnerability`
PrometheusRule: Critical vulnerability in image

### Investigation Steps
1. Check Trivy vulnerability reports:
   ```bash
   kubectl get vulnerabilityreports -A
   kubectl describe vulnerabilityreport <report-name> -n <namespace>
  1. Identify the vulnerable image and CVE:
    kubectl get vulnerabilityreport -A -o json | \
      jq '.items[] | select(.report.summary.criticalCount > 0)'
  2. Check if a fix is available:
    trivy image --severity CRITICAL <image>:<tag>

Remediation

  1. Update the image to a patched version
  2. If no fix available — add to Grype allowlist with justification
  3. If high risk — consider removing the workload:
    kubectl scale deployment <name> -n <namespace> --replicas=0

Post-Incident

  • Document: docs/incidents/YYYY-MM-DD-IR-002.md
  • Update CI pipeline to block this image version

### 4.3 Runbook: Node Failure

```markdown
## IR-003: Node Failure

**Severity:** SEV1 — Critical
**Source:** Prometheus alert `NodeDown`

### Detection
AlertManager: Node {{ instance }} is down
Longhorn: Volume replicas degraded

### Investigation Steps
1. Check Proxmox UI — is the VM running?
2. SSH to the Proxmox host:
   ```bash
   ssh root@192.168.1.11  # vader or sidious
   qm status <VM_ID>
  1. Check VM console for errors
  2. Check physical hardware: power, network, disk

Remediation

  1. If VM stopped — restart:
    qm start <VM_ID>
  2. If hardware failure — migrate VMs to healthy node
  3. Wait for K3s node to rejoin:
    kubectl get nodes
  4. Longhorn will auto-rebuild degraded replicas

Post-Incident

  • Document hardware failure details
  • Check if preventive maintenance is needed

### 4.4 Runbook: Certificate Expiring

```markdown
## IR-004: Certificate Expiring

**Severity:** SEV2 — High
**Source:** cert-manager PrometheusRule

### Remediation
cert-manager should auto-renew. If it hasn't:
```bash
kubectl describe certificate <name> -n <namespace>
# Check for errors in Events section
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager --tail=50
# Force renewal:
kubectl annotate certificate <name> -n <namespace> \
  cert-manager.io/issue-temporary-certificate="true"

### 4.5 Runbook: Storage Degradation

```markdown
## IR-005: Longhorn Volume Degraded

**Severity:** SEV3 — Medium
**Source:** Prometheus alert `LonghornVolumeDegraded`

### Investigation
```bash
kubectl get volumes -n longhorn-system
kubectl describe volume <name> -n longhorn-system

Remediation

Longhorn auto-rebuilds from healthy replicas. If not:

# Trigger rebuild via Longhorn UI or API
# Check node storage health:
ssh root@<node> "lsblk && df -h"

### 4.6 Runbook: Unauthorized Access Attempt

```markdown
## IR-006: Unauthorized Access (Authelia)

**Severity:** SEV2 — High
**Source:** Authelia logs

### Investigation
```bash
kubectl logs -n security deploy/authelia | grep "authentication failed"
# Check source IP — is it from Tailscale VPN or internal?
# If from Tailscale — check who was connected
# If from unexpected source — check pfSense firewall rules

Remediation

  1. If brute force — block IP via pfSense
  2. If compromised credentials — reset in LLDAP
  3. Review Authelia access control rules

---

## Phase 5: Falco Grafana Dashboard

### 5.1 Falco Metrics Dashboard

Create a Grafana dashboard with panels:

| Panel | Metric | Description |
|-------|--------|-------------|
| Events/sec | `rate(falco_events[5m])` | Falco event throughput |
| Events by priority | `falco_events by (priority)` | Breakdown by severity |
| Events by rule | `topk(10, falco_events by (rule))` | Top 10 triggered rules |
| Events by namespace | `falco_events by (k8s_ns_name)` | Events per namespace |
| Total alerts sent | `increase(falcosidekick_output{status="ok"}[1h])` | Alerts successfully routed |
| Alert delivery failures | `falcosidekick_output{status="error"}` | Failed alert deliveries |

Import Falco dashboard (Dashboard ID: `11922`).

---

## Phase 6: Wazuh Active Response

### 6.1 Configure Active Response

On the Wazuh manager (sandbox namespace), add active response rules:

```xml
<!-- /var/ossec/etc/ossec.conf on Wazuh manager -->
<active-response>
  <command>firewall-drop</command>
  <location>local</location>
  <rules_id>100100,100101</rules_id>
  <timeout>3600</timeout>
</active-response>

<active-response>
  <command>disable-account</command>
  <location>local</location>
  <rules_id>100200</rules_id>
  <timeout>1800</timeout>
</active-response>

6.2 Custom Active Response Script

Create /var/ossec/active-response/bin/k8s-isolate-pod.sh:

#!/bin/bash
# Isolate a Kubernetes pod when Wazuh triggers an alert
# Requires kubectl access from Wazuh manager

ACTION=$1
USER=$2
IP=$3
ALERTID=$4
RULEID=$5

KUBECONFIG="/var/ossec/.kube/config"

if [ "$ACTION" = "add" ]; then
    # Block the source IP at pfSense level
    logger "WAZUH AR: Blocking IP $IP (rule $RULEID)"
    # Or apply K8s NetworkPolicy to isolate
fi

if [ "$ACTION" = "delete" ]; then
    logger "WAZUH AR: Unblocking IP $IP (timeout expired)"
fi

Phase 7: Alert Silencing

7.1 Planned Maintenance Silencing

# Silence all alerts during maintenance window
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &

# Create silence via API
curl -X POST http://localhost:9093/api/v2/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {"name": "namespace", "value": "monitoring", "isRegex": false}
    ],
    "startsAt": "2026-04-20T10:00:00Z",
    "endsAt": "2026-04-20T12:00:00Z",
    "createdBy": "admin@homelab.local",
    "comment": "Scheduled maintenance window"
  }'

# List active silences
curl -s http://localhost:9093/api/v2/silences | jq .

Verification

# Verify falcosidekick is running
kubectl get pods -n security -l app.kubernetes.io/name=falcosidekick

# Verify AlertManager is receiving alerts
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
curl -s http://localhost:9093/api/v2/alerts | jq '.[].status.state'

# Verify PrometheusRules are loaded
kubectl get prometheusrules -n monitoring

# Test Falco alert pipeline
kubectl run test-shell --image=alpine -n monitoring -- sh -c "cat /etc/shadow; sleep 300"
sleep 10
kubectl logs -n security daemonset/falco --tail=5
kubectl logs -n security deploy/falcosidekick --tail=5

# Check Grafana for Falco events
# Navigate to: Explore → {job="falco"} → LogQL

# Verify runbooks exist
ls docs/incidents/ 2>/dev/null || echo "Create docs/incidents/ directory"
mkdir -p docs/incidents

# Clean up
kubectl delete pod test-shell -n monitoring --force

Troubleshooting

falcosidekick not receiving events

kubectl logs -n security daemonset/falco --tail=20
# Check: Falco is outputting JSON (--set falco.jsonOutput=true)
# Check: Program output is configured to curl falcosidekick

AlertManager not routing correctly

# Check AlertManager config
kubectl get secret alertmanager-config -n monitoring -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d
# Check AlertManager logs
kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 --tail=20

Falco metrics not in Prometheus

# Check ServiceMonitor exists
kubectl get servicemonitor -n security
# Falco metrics endpoint is typically on port 9376
kubectl exec -n security daemonset/falco -- wget -qO- http://localhost:9376/metrics

Completion Checklist

  • falcosidekick deployed in security namespace
  • Falco outputting JSON to falcosidekick via program output
  • falcosidekick routing to AlertManager (warning+ priority)
  • falcosidekick routing to Loki (informational+ priority)
  • AlertManager configured with severity-based routes
  • PrometheusRules for Falco, Trivy, NodeDown, CrashLoop, Certs, Longhorn
  • Incident severity classification documented (SEV1-SEV4)
  • Runbook IR-001: Unauthorized shell in container
  • Runbook IR-002: Critical vulnerability (Trivy)
  • Runbook IR-003: Node failure / pod eviction
  • Runbook IR-004: Certificate expiring
  • Runbook IR-005: Storage degradation
  • Runbook IR-006: Unauthorized access (Authelia)
  • Falco Grafana dashboard created (events/sec, by priority, by rule)
  • Alert silencing procedure documented
  • Wazuh active response configured (firewall-drop)
  • Custom K8s isolation script created
  • End-to-end alert pipeline tested (Falco → sidekick → AlertManager)

Source: docs/guides/22-network-security.md


Guide 22: Advanced Network Security

Replace Flannel with Cilium CNI for eBPF-based networking, Hubble observability, L7 network policies, and transparent encryption.


Overview

This guide migrates the K3s cluster from Flannel (default CNI) to Cilium. Cilium brings eBPF-based datapath, identity-aware security, L7 network policies (HTTP, gRPC), Hubble flow visualization, and WireGuard transparent encryption between nodes.

Caution

This is a breaking change. The migration will briefly disrupt pod networking. Schedule a maintenance window and have a rollback plan ready. Read Phase 7 before starting.

Time Required: ~120 minutes Prerequisites: Guide 05 (K3s Cluster), Guide 10 (Monitoring Stack) completed

              Cilium Network Security Architecture
    ┌─────────────────────────────────────────────────┐
    │                  Cilium CNI                      │
    │         (eBPF datapath — kernel-level)           │
    │                                                  │
    │  ┌──────────────┐  ┌──────────────────────────┐ │
    │  │   Hubble     │  │ Transparent Encryption   │ │
    │  │ Observability│  │   (WireGuard)            │ │
    │  └──────────────┘  └──────────────────────────┘ │
    │                                                  │
    │  ┌──────────────────────────────────────────────┐│
    │  │         Cilium Network Policies              ││
    │  │  L3/L4 (like K8s NetworkPolicy)              ││
    │  │  L7 (HTTP, gRPC, DNS, Kafka)                 ││
    │  └──────────────────────────────────────────────┘│
    └─────────────────────────────────────────────────┘
                       │
    ┌──────────────────┼──────────────────────────────┐
    │           Node-to-Node Traffic                   │
    │   vader (10.10.10.10) ◄──WireGuard──► sidious   │
    │                          (10.10.10.12)           │
    └─────────────────────────────────────────────────┘

Phase 1: Pre-Migration Checklist

1.1 Verify Current State

# Current CNI
kubectl get nodes -o wide
# K3s default: Flannel, iface=cni0

# Current NetworkPolicies
kubectl get networkpolicies -A

# Current pods (baseline for verification)
kubectl get pods -A -o wide > /tmp/pre-migration-pods.txt

# Check Longhorn volumes are healthy
kubectl get volumes -n longhorn-system

1.2 Backup Current State

# Velero backup before migration
velero backup create pre-cilium-migration \
  --include-namespaces '*' \
  --exclude-namespaces velero,kube-system \
  --snapshot-volumes=true \
  --wait

# Export all NetworkPolicies
kubectl get networkpolicies -A -o yaml > /tmp/networkpolicies-backup.yaml

1.3 Prepare K3s for Cilium

On the K3s master (10.10.10.10):

# Stop K3s to reconfigure
ssh rancher@10.10.10.10 "sudo systemctl stop k3s"

# Edit K3s config to disable Flannel
ssh rancher@10.10.10.10 "sudo tee /etc/rancher/k3s/config.yaml" <<'EOF'
cluster-init: true
token: <existing-token>
disable:
  - traefik
  - flannel

# Disable kube-proxy — Cilium replaces it
kube-proxy-args:
  "--disable=true"

node-ip: 10.10.10.10
advertise-address: 10.10.10.10
flannel-backend: none
write-kubeconfig-mode: "0644"
EOF

On the K3s worker (10.10.10.12):

ssh rancher@10.10.10.12 "sudo systemctl stop k3s-agent"

ssh rancher@10.10.10.12 "sudo tee /etc/rancher/k3s/config.yaml" <<'EOF'
server: https://10.10.10.10:6443
token: <existing-token>
disable:
  - traefik
  - flannel

flannel-backend: none
node-ip: 10.10.10.12
write-kubeconfig-mode: "0644"
EOF

Important

The flannel-backend: none setting is critical. Without it, K3s will recreate Flannel on startup.


Phase 2: Install Cilium

2.1 Clean Up Old CNI

On both K3s nodes:

# Remove old CNI configuration and interfaces
ssh rancher@10.10.10.10 "sudo rm -rf /var/lib/cni/ /etc/cni/net.d/*"
ssh rancher@10.10.10.12 "sudo rm -rf /var/lib/cni/ /etc/cni/net.d/*"

# Remove Flannel interfaces
ssh rancher@10.10.10.10 "sudo ip link delete flannel.1 2>/dev/null; sudo ip link delete cni0 2>/dev/null"
ssh rancher@10.10.10.12 "sudo ip link delete flannel.1 2>/dev/null; sudo ip link delete cni0 2>/dev/null"

2.2 Start K3s (Without Flannel)

# Start master first
ssh rancher@10.10.10.10 "sudo systemctl start k3s"

# Wait for API server
kubectl wait --for=condition=Ready node/k3s-master-01 --timeout=120s

# Start worker
ssh rancher@10.10.10.12 "sudo systemctl start k3s-agent"

# Wait for worker
kubectl wait --for=condition=Ready node/k3s-worker-01 --timeout=120s

Note

At this point, pods won't have networking. This is expected — Cilium will provide it.

2.3 Install Cilium via Helm

helm repo add cilium https://helm.releases.cilium.io/
helm repo update

# Get K3s API server IP
API_SERVER_IP=10.10.10.10

helm install cilium cilium/cilium \
  --namespace kube-system \
  --set kubeProxyReplacement=true \
  --set operator.replicas=1 \
  --set hubble.enabled=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true \
  --set hubble.metrics.enabled="{dns,drop,tcp,flow,icmp,http}" \
  --set encryption.enabled=true \
  --set encryption.type=wireguard \
  --set k8sServiceHost=${API_SERVER_IP} \
  --set k8sServicePort=6443 \
  --set tunnelProtocol=vxlan \
  --set ipam.mode=kubernetes \
  --set operator.rollOutPods=true \
  --set rollOutCiliumPods=true \
  --wait

2.4 Verify Cilium

# Check Cilium pods
kubectl get pods -n kube-system -l k8s-app=cilium
# Expected: cilium Running on each node, cilium-operator Running

# Check Cilium status
kubectl exec -n kube-system ds/cilium -- cilium status
# Expected: Host: OK, NodeMonitor: OK, Hubble: OK

# Run connectivity test
kubectl exec -n kube-system ds/cilium -- cilium connectivity test

# Verify WireGuard encryption
kubectl exec -n kube-system ds/cilium -- cilium encrypt status
# Expected: Encryption: Wireguard

Phase 3: Restart Application Pods

3.1 Restart All Pods to Use Cilium Networking

# Restart all pods (they need to join the new CNI)
kubectl rollout restart deployment --all -A
kubectl rollout restart statefulset --all -A
kubectl rollout restart daemonset --all -A

# Wait for all pods to be running
kubectl get pods -A -o wide

# Verify all pods have Cilium-managed IPs
kubectl exec -n kube-system ds/cilium -- cilium endpoint list

3.2 Verify Core Services

# Test DNS
kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup kubernetes.default

# Test inter-pod connectivity
kubectl run netshoot --image=nicolaka/netshoot --rm -it --restart=Never -- curl -s http://gitea.services.svc.cluster.local:3000

# Test external connectivity
kubectl run netshoot2 --image=nicolaka/netshoot --rm -it --restart=Never -- curl -sI https://1.1.1.1

Phase 4: Hubble Observability

4.1 Access Hubble UI

# Port-forward Hubble UI
kubectl port-forward -n kube-system svc/hubble-ui 8080:80 &
# Access: http://localhost:8080

4.2 Create Hubble Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: hubble-ui
  namespace: kube-system
  annotations:
    cert-manager.io/cluster-issuer: homelab-ca-issuer
    nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - hubble.homelab.local
      secretName: hubble-tls
  rules:
    - host: hubble.homelab.local
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: hubble-ui
                port:
                  number: 80

4.3 Hubble CLI

# Install Hubble CLI locally
HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/main/stable.txt)
curl -L https://github.com/cilium/hubble/releases/download/${HUBBLE_VERSION}/hubble-linux-amd64.tar.gz | tar xz
sudo mv hubble /usr/local/bin/

# Port-forward Hubble API
kubectl port-forward -n kube-system svc/hubble-relay 4245:4245 &

# Observe network flows in real-time
hubble observe --since 1m

# Observe DNS queries
hubble observe --type l7 --dns --since 5m

# Observe HTTP requests
hubble observe --type l7 --http --since 5m

# Observe dropped packets
hubble observe --type trace --trace-type drop --since 5m

# Filter by namespace
hubble observe --namespace monitoring --since 5m

# Filter by label
hubble observe --label app.kubernetes.io/name=prometheus --since 5m

4.4 Hubble Metrics in Grafana

Cilium exposes Prometheus metrics. The Hubble metrics are already configured in the Helm install (hubble.metrics.enabled). Create a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: cilium
  namespace: monitoring
spec:
  selector:
    matchLabels:
      k8s-app: cilium
  namespaceSelector:
    matchNames:
      - kube-system
  endpoints:
    - port: hubble-metrics
      interval: 15s

Import Cilium dashboard (Dashboard ID: 16611) and Hubble dashboard (16612).


Phase 5: Cilium Network Policies

5.1 L3/L4 Policy (Similar to K8s NetworkPolicy)

Replace existing Kyverno-generated NetworkPolicies with Cilium equivalents:

# gitops-apps/security/cilium/default-deny-ingress.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: monitoring
spec:
  endpointSelector:
    matchLabels: {}
  ingressDeny:
    - fromEndpoints:
        - matchLabels: {}
  ingress:
    # Allow from same namespace
    - fromEndpoints:
        - matchLabels:
            io.kubernetes.pod.namespace: monitoring

5.2 L7 Policy — HTTP (Cilium Exclusive)

Allow only specific HTTP methods and paths between services:

# gitops-apps/security/cilium/grafana-to-prometheus.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: grafana-to-prometheus
  namespace: monitoring
spec:
  endpointSelector:
    matchLabels:
      app.kubernetes.io/name: prometheus
  ingress:
    - fromEndpoints:
        - matchLabels:
            app.kubernetes.io/name: grafana
      toPorts:
        - ports:
            - port: "9090"
          rules:
            http:
              - method: GET
                path: "/api/v1/.*"

5.3 L7 Policy — ArgoCD to Gitea

# gitops-apps/security/cilium/argocd-to-gitea.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: argocd-to-gitea
  namespace: services
spec:
  endpointSelector:
    matchLabels:
      app.kubernetes.io/name: gitea
  ingress:
    - fromEndpoints:
        - matchLabels:
            io.kubernetes.pod.namespace: argocd
      toPorts:
        - ports:
            - port: "3000"
          rules:
            http:
              - method: GET
                path: "/.*"
              - method: POST
                path: "/api/v1/.*"

5.4 DNS Policy — Restrict External DNS

# Only allow DNS queries to AdGuard and internal names
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: restrict-dns
  namespace: security
spec:
  endpointSelector:
    matchLabels: {}
  egress:
    - toEndpoints:
        - matchLabels:
            io.kubernetes.pod.namespace: services
            app.kubernetes.io/name: adguard
      toPorts:
        - ports:
            - port: "53"
              protocol: UDP
            - port: "53"
              protocol: TCP

5.5 Audit Mode (Test Before Enforcing)

# CNP in audit mode — log but don't block
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: audit-deny-egress
  namespace: monitoring
  annotations:
    # Audit mode: log violations but don't block
    io.cilium.policy.audit: "true"
spec:
  endpointSelector:
    matchLabels: {}
  egressDeny:
    - toCIDR:
        - 10.20.20.0/24  # Block egress to sandbox

Phase 6: Transparent Encryption

6.1 Verify WireGuard Encryption

# Check encryption status on each node
kubectl exec -n kube-system ds/cilium -- cilium encrypt status

# Verify WireGuard keys
kubectl exec -n kube-system ds/cilium -- cilium encrypt show-keys

6.2 Store WireGuard Keys in Vault (Optional)

# Cilium generates WireGuard keys automatically
# To store in Vault for backup:
kubectl exec -n kube-system ds/cilium -- cat /run/cilium/wg/private.key > /tmp/wg-private.key

vault kv put secret/cilium/wireguard \
  private-key="$(cat /tmp/wg-private.key)"

rm /tmp/wg-private.key

6.3 Verify Encrypted Traffic

# Capture traffic between nodes — should be WireGuard encrypted
# On pve-vader, capture traffic to sidious:
ssh rancher@10.10.10.10 "sudo tcpdump -i any -c 20 host 10.10.10.12"
# Expected: UDP packets on WireGuard port (51871 default)
# Raw TCP payloads should NOT be visible

Phase 7: Rollback Procedure

If Cilium migration fails:

7.1 Remove Cilium

helm uninstall cilium -n kube-system

# Clean up Cilium interfaces
ssh rancher@10.10.10.10 "sudo ip link delete cilium_host 2>/dev/null; sudo ip link delete cilium_vxlan 2>/dev/null"
ssh rancher@10.10.10.12 "sudo ip link delete cilium_host 2>/dev/null; sudo ip link delete cilium_vxlan 2>/dev/null"

# Clean up CNI config
ssh rancher@10.10.10.10 "sudo rm -rf /var/lib/cni/ /etc/cni/net.d/*"
ssh rancher@10.10.10.12 "sudo rm -rf /var/lib/cni/ /etc/cni/net.d/*"

7.2 Re-enable Flannel

# Edit K3s config on master — remove flannel-backend: none
ssh rancher@10.10.10.10 "sudo sed -i '/flannel-backend/d' /etc/rancher/k3s/config.yaml"
ssh rancher@10.10.10.12 "sudo sed -i '/flannel-backend/d' /etc/rancher/k3s/config.yaml"

# Restart K3s
ssh rancher@10.10.10.10 "sudo systemctl restart k3s"
ssh rancher@10.10.10.12 "sudo systemctl restart k3s-agent"

# Wait for nodes
kubectl wait --for=condition=Ready node --all --timeout=120s

# Restart pods
kubectl rollout restart deployment --all -A

7.3 Restore from Velero (Last Resort)

velero restore create rollback-from-cilium \
  --from-backup pre-cilium-migration \
  --wait

Phase 8: Performance Comparison

8.1 Benchmark Before and After

# Install iperf3 for network performance testing
kubectl run iperf-server --image=networkstatic/iperf3 -- -s
kubectl run iperf-client --image=networkstatic/iperf3 -- -c iperf-server

# Run bandwidth test
kubectl exec iperf-client -- iperf3 -c iperf-server -t 30

# Expected results:
# Flannel (VXLAN):  ~8-9 Gbps (on 1Gbps link: ~900 Mbps)
# Cilium (VXLAN):   ~9-10 Gbps (eBPF overhead is lower)
# Cilium (WireGuard): ~7-8 Gbps (encryption overhead)

Verification

# Cilium status
kubectl exec -n kube-system ds/cilium -- cilium status

# Hubble status
kubectl exec -n kube-system deploy/hubble-relay -- hubble status

# Connectivity test
kubectl exec -n kube-system ds/cilium -- cilium connectivity test

# Encryption verification
kubectl exec -n kube-system ds/cilium -- cilium encrypt status

# DNS resolution
kubectl run test-dns --image=busybox --rm -it --restart=Never -- nslookup kubernetes.default

# Pod-to-pod connectivity
kubectl run test-curl --image=curlimages/curl --rm -it --restart=Never -- curl -s http://gitea.services.svc.cluster.local:3000

# Network policies
kubectl get ciliumnetworkpolicies -A

# All pods running
kubectl get pods -A -o wide | grep -v Running

Troubleshooting

Cilium pods not starting

kubectl logs -n kube-system ds/cilium --tail=50
# Common: missing kernel headers (BPF compilation)
# Fix: sudo apt install -y linux-headers-$(uname -r)

Pods can't communicate

# Check Cilium endpoint list
kubectl exec -n kube-system ds/cilium -- cilium endpoint list
# Check Cilium identity list
kubectl exec -n kube-system ds/cilium -- cilium identity list
# Check Hubble flows for drops
hubble observe --type trace --trace-type drop

WireGuard encryption not working

kubectl exec -n kube-system ds/cilium -- cilium encrypt status
# If "Disabled": check encryption.enabled=true in Helm values
# Verify kernel module: modprobe wireguard

KubeProxy replacement issues

# Cilium replaces kube-proxy — check service routing
kubectl exec -n kube-system ds/cilium -- cilium service list
# If services missing: check k8sServiceHost and k8sServicePort values

Hubble UI shows no flows

# Check Hubble Relay
kubectl logs -n kube-system deploy/hubble-relay --tail=20
# Verify Hubble is enabled on Cilium agents
kubectl exec -n kube-system ds/cilium -- cilium config | grep hubble

Completion Checklist

  • Flannel removed from K3s config (flannel-backend: none)
  • Old CNI config and interfaces cleaned up
  • Cilium installed with kube-proxy replacement
  • All pods restarted and running with Cilium networking
  • DNS resolution working across namespaces
  • Pod-to-pod connectivity verified
  • External connectivity verified
  • Hubble UI accessible (ingress or port-forward)
  • Hubble CLI installed and observing flows
  • Hubble metrics flowing to Prometheus
  • Cilium and Hubble Grafana dashboards imported
  • L3/L4 CiliumNetworkPolicies replacing K8s NetworkPolicies
  • L7 HTTP policies deployed (Grafana→Prometheus, ArgoCD→Gitea)
  • DNS restriction policy deployed
  • Transparent WireGuard encryption enabled and verified
  • Traffic between nodes confirmed encrypted (tcpdump)
  • Velero backup taken before migration
  • Rollback procedure documented and tested
  • Performance benchmark completed (Flannel vs Cilium comparison)
  • All CiliumNetworkPolicies committed to gitops-apps/security/cilium/

Source: docs/guides/23-events-log-analytics.md


Guide 23: Events & Log Analytics

Capture Kubernetes events, enrich logs with structured parsing, and build log-based alerting and anomaly detection on top of Loki and Grafana Alloy.


Overview

Kubernetes events disappear after 1 hour by default. This guide configures Grafana Alloy to persist events and container logs to Loki, adds structured parsing and enrichment, enables log-based alerting via Loki ruler, and builds analytics dashboards for pattern detection.

Time Required: ~75 minutes Prerequisites: Guide 10 (Monitoring Stack) completed

              Events & Log Analytics Pipeline
    ┌─────────────────────────────────────────────────┐
    │              Data Sources                         │
    │                                                  │
    │  K8s Events (etcd, 1hr TTL)                     │
    │  Container Logs (stdout/stderr)                  │
    │  Application Logs (JSON, text)                   │
    └──────────────────┬──────────────────────────────┘
                       ▼
    ┌─────────────────────────────────────────────────┐
    │          Grafana Alloy (already installed)        │
    │                                                  │
    │  loki.source.kubernetes_events  → Loki           │
    │  loki.source.pod_logs           → Loki           │
    │  loki.process stages:                            │
    │    - JSON parsing                                │
    │    - Label extraction (namespace, pod, app)      │
    │    - Timestamp normalization                     │
    │    - Drop noisy logs                             │
    └──────────────────┬──────────────────────────────┘
                       ▼
    ┌─────────────────────────────────────────────────┐
    │               Loki (already installed)            │
    │                                                  │
    │  LogQL queries · Ruler alerts · Analytics        │
    └──────────────────┬──────────────────────────────┘
                       ▼
    ┌─────────────────────────────────────────────────┐
    │          Grafana Dashboards & Alerts              │
    │                                                  │
    │  Error rate · Event timeline · Log volume         │
    │  Anomaly detection · Top errors · Correlation    │
    └─────────────────────────────────────────────────┘

Phase 1: Kubernetes Event Capture

1.1 Configure Alloy to Capture Events

Grafana Alloy already runs as a DaemonSet in the logging namespace. Add a kubernetes_events source to its configuration.

Create gitops-apps/monitoring/alloy-events-config.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: alloy-event-capture
  namespace: logging
data:
  event-capture.alloy: |
    // ── Kubernetes Event Capture ───────────────────────
    loki.source.kubernetes_events "events" {
      job_name   = "kubernetes-events"
      log_format = "json"
      namespaces = []  // empty = all namespaces

      forward_to = [loki.process.event_enrichment.receiver]
    }

    // ── Event Enrichment ───────────────────────────────
    loki.process "event_enrichment" {
      // Extract key fields as labels for querying
      stage.json {
        expressions = {
          reason   = "reason",
          kind     = "involvedObject.kind",
          name     = "involvedObject.name",
          ns       = "involvedObject.namespace",
          severity = "type",
        }
      }

      // Set Loki labels from extracted fields
      stage.labels {
        values = {
          reason   = "",
          kind     = "",
          severity = "",
        }
      }

      // Add namespace as external label
 stage.static_labels {
        values = {
          job = "kubernetes-events",
        }
      }

      forward_to = [loki.write.homelab.receiver]
    }

1.2 Update Alloy to Include Event Capture

# If Alloy is deployed via Helm, add the config via values
cat > alloy-events-values.yaml <<'EOF'
extraConfigmapMounts:
  - name: alloy-event-capture
    configMap: alloy-event-capture
    mountPath: /etc/alloy/event-capture.alloy
    subPath: event-capture.alloy

alloy:
  configMap:
    content: |
      // Include default logging config
      // Include event capture config
      import "event-capture" "/etc/alloy/event-capture.alloy"

      // Existing logging config would go here
      // ...
EOF

helm upgrade alloy grafana/alloy \
  --namespace logging \
  --reuse-values \
  -f alloy-events-values.yaml \
  --wait

1.3 Verify Event Capture

# Check Alloy is processing events
kubectl logs -n logging -l app.kubernetes.io/name=alloy --tail=20 | grep kubernetes-events

# Query events in Grafana Explore
# LogQL: {job="kubernetes-events"}
# Or filter: {job="kubernetes-events"} |= "Failed"

# Trigger an event for testing
kubectl run event-test --image=invalid-image-that-does-not-exist -n default
# Wait 30 seconds, then check Loki
# Query: {job="kubernetes-events"} |= "event-test"
kubectl delete pod event-test -n default --force

Phase 2: Structured Log Parsing

2.1 Configure Log Pipeline Stages

Create gitops-apps/monitoring/alloy-log-pipeline-config.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: alloy-log-pipeline
  namespace: logging
data:
  log-pipeline.alloy: |
    // ── Pod Log Collection with Enrichment ─────────────
    discovery.kubernetes "pods" {
      role = "pod"
    }

    loki.source.kubernetes "pod_logs" {
      targets    = discovery.kubernetes.pods.targets
      job_name   = "integrations/kubernetes/pod_logs"
      forward_to = [loki.process.pod_enrichment.receiver]
    }

    // ── Log Enrichment Pipeline ────────────────────────
    loki.process "pod_enrichment" {
      // Stage 1: Parse JSON logs (apps using structured logging)
      stage.json {
        expressions = {
          level     = "",
          msg       = "",
          timestamp = "",
          logger    = "",
          traceID   = "",
          spanID    = "",
        }
      }

      // Stage 2: Set level as label for fast filtering
      stage.labels {
        values = {
          level = "",
        }
      }

      // Stage 3: Extract K8s metadata from pod labels
      stage.kubernetes {
        // Already enriched by discovery.kubernetes
      }

      // Stage 4: Drop noisy system logs to reduce storage
      stage.match {
        selector = '{namespace="kube-system"} |= "kube-proxy"'
        action   = "drop"
        drop_counter_reason = "noisy-system-logs"
      }

      stage.match {
        selector = '{namespace="longhorn-system"} |~ "instance manager client.*connect"'
        action   = "drop"
        drop_counter_reason = "longhorn-heartbeat-noise"
      }

      // Stage 5: Normalize log levels
      stage.match {
        selector = '{level=""} |~ "(?i)error|err|fatal|panic"'
        stage.json {
          expressions = {level = "error"}
        }
      }

      stage.static_labels {
        values = {
          cluster = "homelab-k3s",
        }
      }

      forward_to = [loki.write.homelab.receiver]
    }

2.2 Apply the Pipeline

kubectl apply -f gitops-apps/monitoring/alloy-log-pipeline-config.yaml

# Restart Alloy to pick up new config
kubectl rollout restart daemonset -n logging alloy
kubectl rollout status daemonset -n logging alloy

Phase 3: Log-Based Alerting

3.1 Enable Loki Ruler

Loki's ruler component evaluates LogQL expressions and fires alerts. Update the Loki Helm values:

cat > loki-ruler-values.yaml <<'EOF'
loki:
  ruler:
    enabled: true
    alertmanager_url: "http://kube-prometheus-stack-alertmanager.monitoring.svc.cluster.local:9093"
    storage:
      type: configmap
      configmap:
        name: loki-ruler-rules
    rule_path: /loki/rules
    ring:
      kvstore:
        store: inmemory
EOF

helm upgrade loki grafana/loki \
  --namespace logging \
  --reuse-values \
  -f loki-ruler-values.yaml \
  --wait

3.2 Create Log-Based Alert Rules

Create gitops-apps/monitoring/loki-log-alerts.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-ruler-rules
  namespace: logging
data:
  log-alerts.yaml: |
    groups:
      - name: homelab-log-alerts
        rules:
          # High error rate across any service
          - alert: HighErrorRate
            expr: |
              sum(rate({level="error"}[5m])) by (namespace, job)
              /
              sum(rate({job=~".+"}[5m])) by (namespace, job)
              > 0.1
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High error rate in {{ $labels.namespace }}/{{ $labels.job }}"
              description: "Error rate is {{ $value | humanizePercentage }} in the last 5 minutes"

          # Vault seal/unseal events
          - alert: VaultSealEvent
            expr: |
              sum(count_over_time({app="vault"} |= "core: seal" [5m])) > 0
            for: 1m
            labels:
              severity: critical
            annotations:
              summary: "Vault sealed unexpectedly"
              description: "Vault in namespace security has been sealed"

          # Pod OOMKilled events
          - alert: PodOOMKilled
            expr: |
              sum(count_over_time({job="kubernetes-events"} |= "OOMKilled" [10m])) by (name) > 0
            for: 1m
            labels:
              severity: warning
            annotations:
              summary: "Pod {{ $labels.name }} was OOMKilled"
              description: "A pod was killed due to out-of-memory. Consider increasing memory limits."

          # Image pull failures
          - alert: ImagePullFailure
            expr: |
              sum(count_over_time({job="kubernetes-events"} |~ "Failed.*ImagePull|ErrImagePull" [10m])) by (name) > 0
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Image pull failure for {{ $labels.name }}"

          # Longhorn volume issues
          - alert: LonghornVolumeError
            expr: |
              sum(count_over_time({namespace="longhorn-system"} |= "error" |~ "volume|replica" [5m])) > 3
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Longhorn volume errors detected"

          # CrashLoopBackOff events
          - alert: CrashLoopBackOffDetected
            expr: |
              sum(count_over_time({job="kubernetes-events"} |= "BackOff" |~ "CrashLoop" [5m])) by (name) > 0
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Pod {{ $labels.name }} in CrashLoopBackOff"

          # Falco high event rate (log-based)
          - alert: FalcoHighEventRate
            expr: |
              sum(rate({namespace="security"} |= "Falco" [5m])) > 5
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Unusually high Falco event rate"
              description: "Falco is generating {{ $value }} events/second via logs"

          # SSL/TLS certificate errors
          - alert: CertificateError
            expr: |
              sum(count_over_time({job=~".+"} |~ "certificate.*error|tls.*handshake.*fail|x509.*cert" [10m])) > 0
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "TLS/certificate errors detected"
kubectl apply -f gitops-apps/monitoring/loki-log-alerts.yaml

# Restart Loki to pick up ruler config
kubectl rollout restart statefulset -n logging loki

Phase 4: Log Analytics Dashboards

4.1 Kubernetes Events Dashboard

Create a Grafana dashboard with these panels:

Panel LogQL Query Description
Events over time sum(count_over_time({job="kubernetes-events"}[1h])) Event volume trends
Events by reason topk(10, sum(count_over_time({job="kubernetes-events"} [1h])) by (reason)) Top event types
Events by namespace sum(count_over_time({job="kubernetes-events"} [1h])) by (ns) Which namespaces generate events
Warning events sum(count_over_time({job="kubernetes-events"} |="Warning" [1h])) Warning-level events
Failed pods count_over_time({job="kubernetes-events"} |="Failed" [1h]) Pod failure events
Image pull failures count_over_time({job="kubernetes-events"} |~"ErrImagePull|ImagePullBackOff" [24h]) Image issues

4.2 Error Analytics Dashboard

Panel LogQL Query Description
Error rate by namespace sum(rate({level="error"}[5m])) by (namespace) Error distribution
Top error messages topk(10, sum(count_over_time({level="error"} [1h])) by (msg)) Most frequent errors
Error trend sum_over_time({level="error"} [1d]) Daily error count
New errors (not seen before) Custom query comparing time windows First-seen errors
Error by service sum(rate({level="error"}[5m])) by (app) Which service errors most

4.3 Log Volume Anomaly Dashboard

Panel LogQL Query Description
Total log volume sum(rate({job=~".+"} [5m])) Overall ingestion rate
Volume by namespace sum(rate({job=~".+"} [5m])) by (namespace) Log distribution
Volume spike detection Compare current rate to 1h average Unusual log surges
Dropped logs sum(rate(alloy_dropped_log_lines_total[5m])) by (namespace) Logs dropped by Alloy
Loki ingestion rate loki_distributor_lines_received_total Lines ingested per second

4.4 Import via Provisioning

Create gitops-apps/monitoring/grafana-dashboards/events-analytics.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: events-analytics-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  events-analytics.json: |
    {
      "annotations": { "list": [] },
      "title": "Kubernetes Events & Log Analytics",
      "tags": ["homelab", "events", "logs"],
      "timezone": "utc",
      "panels": [
        {
          "title": "Events Over Time",
          "type": "timeseries",
          "datasource": { "type": "loki", "uid": "loki" },
          "targets": [{
            "expr": "sum(count_over_time({job=\"kubernetes-events\"}[$__interval]))",
            "refId": "A"
          }],
          "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
        },
        {
          "title": "Top Event Reasons",
          "type": "barchart",
          "datasource": { "type": "loki", "uid": "loki" },
          "targets": [{
            "expr": "topk(10, sum(count_over_time({job=\"kubernetes-events\"} [1h])) by (reason))",
            "refId": "A"
          }],
          "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
        },
        {
          "title": "Warning Events",
          "type": "timeseries",
          "datasource": { "type": "loki", "uid": "loki" },
          "targets": [{
            "expr": "sum(count_over_time({job=\"kubernetes-events\"} |= \"Warning\" [$__interval])) by (reason)",
            "refId": "A",
            "legendFormat": "{{reason}}"
          }],
          "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }
        },
        {
          "title": "Error Rate by Namespace",
          "type": "timeseries",
          "datasource": { "type": "loki", "uid": "loki" },
          "targets": [{
            "expr": "sum(rate({level=\"error\"}[5m])) by (namespace)",
            "refId": "A",
            "legendFormat": "{{namespace}}"
          }],
          "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 }
        },
        {
          "title": "Log Volume by Namespace",
          "type": "timeseries",
          "datasource": { "type": "loki", "uid": "loki" },
          "targets": [{
            "expr": "sum(rate({job=~\".+\"}[5m])) by (namespace)",
            "refId": "A",
            "legendFormat": "{{namespace}}"
          }],
          "gridPos": { "h": 8, "w": 24, "x": 0, "y": 16 }
        }
      ],
      "refresh": "30s",
      "time": { "from": "now-1h", "to": "now" }
    }
kubectl apply -f gitops-apps/monitoring/grafana-dashboards/events-analytics.yaml

# Grafana will auto-import if dashboard sidecar is configured
# Otherwise: Dashboards → Import → paste JSON

Phase 5: Log-to-Trace Correlation

5.1 Configure TraceID Extraction

Add to the Alloy log pipeline:

    // Stage: Extract traceID from structured logs for Tempo correlation
    stage.json {
      expressions = {
        traceID = "traceID",
        spanID  = "spanID",
      }
    }

    // Link to Tempo trace in Grafana
    stage.labels {
      values = {
        traceID = "",
      }
    }

5.2 Configure Grafana Derived Fields

In Grafana, configure the Loki datasource derived fields to link to Tempo:

  1. Connections → Data Sources → Loki
  2. Derived fields → Add:
    • Name: TraceID
    • Regex: traceID=(\w+)
    • Datasource: Tempo
    • URL: leave empty (uses datasource)

This makes trace IDs in logs clickable — jumping directly to the trace in Tempo.


Phase 6: Log Retention & Storage

6.1 Configure Loki Retention

cat > loki-retention-values.yaml <<'EOF'
loki:
  limits_config:
    retention_period: 744h  # 31 days
    max_query_length: 721h
    ingestion_rate_mb: 10
    ingestion_burst_size_mb: 20

  compactor:
    working_directory: /loki/compactor
    compaction_interval: 10m
    retention_enabled: true
    retention_delete_delay: 2h
    delete_request_store: filesystem

  schema_config:
    configs:
      - from: 2024-01-01
        store: tsdb
        object_store: filesystem
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
EOF

helm upgrade loki grafana/loki \
  --namespace logging \
  --reuse-values \
  -f loki-retention-values.yaml \
  --wait

6.2 Monitor Log Storage

# Check Loki storage usage
kubectl exec -n logging loki-0 -- du -sh /loki/
kubectl exec -n logging loki-0 -- du -sh /loki/chunks/ /loki/index/

# Check PVC usage
kubectl exec -n logging loki-0 -- df -h /loki

Phase 7: Useful LogQL Queries

7.1 Event Queries

# All Warning events
{job="kubernetes-events"} |= "Warning"

# Events for a specific pod
{job="kubernetes-events"} | json | name="my-pod"

# Failed scheduling events
{job="kubernetes-events"} |= "FailedScheduling"

# Eviction events
{job="kubernetes-events"} |= "Evicting"

# Recent events sorted by time
{job="kubernetes-events"} | json | line_format "{{.metadata.creationTimestamp}} [{{.type}}] {{.reason}}: {{.message}}"

7.2 Error Queries

# Errors in last hour by service
{level="error"} | json | line_format "{{.logger}}: {{.msg}}"

# Errors with stack traces
{level="error"} |~ "panic|fatal|stack trace"

# HTTP 5xx errors
|~ "status_code=\"5[0-9][0-9]\"|HTTP 5[0-9][0-9]|\"status\":5[0-9][0-9]"

# Slow requests (>1s)
|~ "duration_ms=\"[1-9][0-9]{3,}\"|elapsed.*[0-9]+s"

7.3 Anomaly Detection Queries

# Log volume spike (compare to 1h average)
sum(rate({job=~".+"} [5m])) / (sum(rate({job=~".+"} [1h])) * 12) > 2

# New error messages (not seen in previous hour)
sum(count_over_time({level="error"} [5m])) by (msg)
  and on (msg)
  sum(count_over_time({level="error"} [5m] offset 1h])) by (msg) == 0

Verification

# Verify Alloy is capturing events
kubectl logs -n logging -l app.kubernetes.io/name=alloy --tail=20 | grep "kubernetes-events"

# Verify events in Loki
kubectl exec -n monitoring deploy/kube-prometheus-stack-grafana -- \
  wget -qO- 'http://loki.logging.svc.cluster.local:3100/loki/api/v1/query?query={job="kubernetes-events"}' | jq '.data.result | length'

# Verify log-based alerts are loaded
kubectl get configmap loki-ruler-rules -n logging -o yaml | grep alert

# Verify Loki retention is configured
kubectl exec -n logging loki-0 -- wget -qO- 'http://localhost:3100/loki/api/v1/status/config' | jq '.limits_config.retention_period'

# Verify dashboards
curl -s "http://admin:admin@10.10.10.10:30090/api/dashboards/uid" | jq '.[].title' | grep -i event

# Generate test events and verify capture
kubectl run test-events --image=invalid-image -n default
sleep 30
# Query in Grafana: {job="kubernetes-events"} |= "test-events"
kubectl delete pod test-events -n default --force

Troubleshooting

No events in Loki

# Check Alloy event capture config
kubectl exec -n logging ds/alloy -- cat /etc/alloy/event-capture.alloy
# Check Alloy logs for errors
kubectl logs -n logging ds/alloy --tail=50 | grep -i error
# Verify Loki is receiving: check /loki/api/v1/labels

Loki ruler not firing alerts

# Check ruler is enabled
kubectl logs -n logging loki-0 --tail=50 | grep ruler
# Verify rules configmap
kubectl get cm loki-ruler-rules -n logging -o yaml
# Check ruler metrics
kubectl exec -n logging loki-0 -- wget -qO- 'http://localhost:3100/metrics' | grep ruler

Too much log volume

# Identify noisy sources
kubectl exec -n monitoring deploy/kube-prometheus-stack-grafana -- \
  wget -qO- 'http://loki.logging.svc.cluster.local:3100/loki/api/v1/query?query=topk(10,sum(rate({job=~".+"}[1h]))by(namespace))'
# Add more drop stages in Alloy config for noisy namespaces

Grafana not showing Loki data

# Test Loki query directly
kubectl exec -n monitoring deploy/kube-prometheus-stack-grafana -- \
  wget -qO- 'http://loki.logging.svc.cluster.local:3100/ready'
# Expected: ready
# Check datasource URL in Grafana: http://loki.logging.svc.cluster.local:3100

Completion Checklist

  • Grafana Alloy configured with loki.source.kubernetes_events
  • Kubernetes events persisted to Loki (survive past 1hr TTL)
  • Events enriched with labels (reason, kind, severity, namespace)
  • Alloy log pipeline stages configured (JSON parsing, label extraction, noise filtering)
  • Loki ruler enabled with AlertManager integration
  • Log-based alert rules created (error rate, OOMKilled, ImagePull, CrashLoop, Falco, certs)
  • Kubernetes Events dashboard created in Grafana
  • Error Analytics dashboard created in Grafana
  • Log Volume Anomaly dashboard created in Grafana
  • TraceID extraction configured for log-to-trace correlation
  • Grafana derived fields link Loki logs to Tempo traces
  • Loki retention set to 31 days
  • Log storage monitored (PVC usage)
  • LogQL query reference documented
  • All configs committed to gitops-apps

Source: docs/guides/README.md


Implementation Guides

Comprehensive step-by-step guides for deploying your DevSecOps homelab infrastructure.


Guide Index

Guide Description Target Node
01 - Local Setup Configure MacBook with TF/Ansible Local
02 - Proxmox Cluster Form Cluster & SDN (VXLAN) Vader (Master)
03 - Terraform Provision 24/7 VMs & Hack Box Vader/Sidious
04 - Ansible Bootstrap OS hardening & K3s Prereqs All Nodes
05 - K3s Cluster Deploy Kubernetes (Server/Agent) Vader/Sidious
06 - Longhorn HA Distributed Block Storage K3s Nodes
07 - GitOps Stack Deploy Gitea & ArgoCD K3s Cluster
08 - Security Tooling Vault, Falco, Trivy, Kyverno K3s Cluster
09 - Red/Blue Team Deploy Kali & Security Sandboxes Maul (Hack Box)
10 - LGTM Stack Loki, Grafana, Tempo, Prometheus K3s Cluster
11 - Identity & SSO Authelia & Active Directory K3s Cluster
12 - CI/CD Pipeline Gitea Actions, gitleaks, security gates K3s Cluster
13 - Supply Chain Cosign, Syft, Grype, Sigstore K3s Cluster
14 - IaC Security tfsec, Checkov, Terrascan, Conftest K3s Cluster
15 - Cert Manager TLS automation, private CA K3s Cluster
16 - Backup & DR Velero, PBS, restore runbooks K3s Cluster
17 - Compliance kube-bench, CIS, OpenSCAP All Nodes
18 - DAST OWASP ZAP automated scanning Maul (Sandbox)
19 - Chaos Eng Chaos Mesh resilience testing K3s Cluster
20 - Policy as Code OPA Gatekeeper, Conftest policies K3s Cluster
21 - Incident Response Falco→AlertManager, runbooks K3s Cluster
22 - Network Security Cilium CNI, Hubble, WireGuard K3s Cluster
23 - Events & Logs K8s events capture, Loki ruler, log analytics K3s Cluster

IP Address Reference

Component Network IP Address Host Node
pve-vader Physical 192.168.1.11 Master
pve-sidious Physical 192.168.1.12 24/7 Node
pve-maul Physical 192.168.1.10 Hack Box
pfSense LAN VNet1 10.10.10.1 Vader
AdGuard Home VNet1 10.10.10.2 Vader
Tailscale VNet1 10.10.10.3 Vader
K3s Master VNet1 10.10.10.10 Vader
K3s Worker 1 VNet1 10.10.10.12 Sidious
Kali Linux VNet2 10.20.20.10 Maul

Execution Phase Outcomes

Phase 1: Foundation (Guides 01-04)

Outcome: Proxmox cluster healthy, SDN configured, VMs provisioned via IaC.

Phase 2: Kubernetes (Guides 05-06)

Outcome: HA cluster running with distributed storage backed by physical SATA SSDs.

Phase 3: Platform (Guides 07-08)

Outcome: GitOps engine (ArgoCD) and Secrets (Vault) operational.

Phase 4: Operations (Guides 09-11)

Outcome: Full LGTM observability, isolated security labs, and Single Sign-On (SSO) active.

Phase 5: Shift-Left Security (Guides 12-14)

Outcome: CI/CD pipeline with security gates, image signing and SBOMs, IaC scanning in pipeline.

Phase 6: Infrastructure Hardening (Guides 15-17)

Outcome: Automated TLS certificates, backup/DR with tested restore procedures, CIS compliance.

Phase 7: Advanced Security (Guides 18-20)

Outcome: DAST scanning against vulnerable apps, chaos engineering for resilience, policy-as-code with OPA Gatekeeper.

Phase 8: Response & Network (Guides 21-22)

Outcome: Complete incident response pipeline (Falco→AlertManager→Grafana), Cilium eBPF networking with WireGuard encryption and Hubble observability.

Phase 9: Observability Analytics (Guide 23)

Outcome: Kubernetes events persisted beyond 1hr TTL, log-based alerting via Loki ruler, structured log parsing, and analytics dashboards for pattern detection and anomaly identification.


Source: docs/README.md


DevSecOps Homelab Documentation

Architecture for a 3-node Proxmox and Kubernetes homelab tailored for DevSecOps, GitOps, and Observability.


Architecture Summary

Physical Infrastructure

Node Role Hardware Pool Workloads
pve-vader Primary Master NVMe + SATA SSD pfSense, K3s Master, ArgoCD, Vault
pve-sidious 24/7 Worker NVMe + SATA SSD K3s Worker, LGTM Stack Persistence
pve-maul Hack Box NVMe Only Kali Linux, Red Team Sandboxes

Key Technologies

  • Hypervisor: Proxmox VE 8.x with SDN (VXLAN Tunneling)
  • Kubernetes: K3s (Master on Vader, Workers on Sidious/Vader)
  • Storage: Longhorn HA (backed by host-replicated Virtual Disks)
  • Observability: LGTM Stack (Loki, Grafana, Tempo, Prometheus)
  • Networking: pfSense (Double NAT Gateway), Tailscale (Admin VPN)

Critical Design Decisions

1. 24/7 Operations via Vader/Sidious

Core networking (pfSense) and the Kubernetes Control Plane are pinned to pve-vader. Along with pve-sidious, these nodes maintain Proxmox quorum and cluster stability 24/7.

2. The Hack Box (Maul)

pve-maul is designated as an optional node. It is isolated at the firewall level (pfSense) to allow for high-risk security testing without compromising the stability of the management infrastructure.

3. Storage HA

To resolve host-path limitations, Longhorn uses virtual disks mapped from the physical SATA SSDs on Vader and Sidious. This ensures data persistence across node reboots.


Next Steps

  1. Physical Preparation: Complete the LVM Thin Pool setup in docs/checklist/storage-checklist.md.
  2. Infrastructure Build: Follow the docs/checklist/implementation-checklist.md starting from Phase 0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment