Aggregated on Fri 24 Apr 2026 18:38:02 PST
To simulate a professional environment, the gitops-apps repository is structured to separate global infrastructure from individual product environments.
%%{init: {'theme': 'dark'}}%%
flowchart TB
subgraph GiteaOrg["Gitea Organization: homelab"]
subgraph GitopsApps["gitops-apps (Main Repository)"]
A1["argocd-apps/<br/>Root App Manifests"]
A2["production/<br/>prd-alpha, prd-beta"]
A3["development/<br/>dev-gamma, dev-delta"]
A4["infrastructure/<br/>Longhorn, LGTM Stack, Ingress"]
A5["security/<br/>Falco, Trivy, Kyverno"]
end
subgraph IaC["Infrastructure as Code"]
T1["terraform-proxmox/<br/>Proxmox VM Provisioning"]
AN1["ansible-playbooks/<br/>K3s & Service Bootstrap"]
end
end
The "Root" application manages the state of all other applications. This allows you to add a new "Product" just by adding a YAML file to the production/ or development/ folder.
%%{init: {'theme': 'dark'}}%%
flowchart TB
subgraph Root["Root Management"]
RootApp["root-application.yaml<br/>(The Master Sync)"]
end
subgraph Projects["ArgoCD AppProjects"]
InfraP["Project: infrastructure<br/>(Cluster Wide)"]
ProdP["Project: production<br/>(Strict Security)"]
DevP["Project: development<br/>(Audit Security)"]
end
subgraph Apps["Sync Targets"]
InfraP --> Longhorn & LGTMStack & Ingress
ProdP --> ProductAlpha & ProductBeta
DevP --> SandboxApp
end
RootApp --> InfraP & ProdP & DevP
Secrets are never stored in Git. Instead, Git contains a "Secret Reference" that Vault uses to inject real credentials at runtime.
%%{init: {'theme': 'dark'}}%%
flowchart LR
subgraph Git["Git Repo"]
YAML["Deployment YAML<br/>(with Annotations)"]
end
subgraph K8s["Kubernetes Cluster"]
Pod["Application Pod"]
Sidecar["Vault Agent (Sidecar)"]
end
subgraph V["HashiCorp Vault"]
Store[("Encrypted Secrets")]
end
YAML -->|ArgoCD Sync| Pod
Sidecar -->|Auth via K8s ServiceAccount| V
Store -->|Inject as File| Sidecar
Sidecar -->|Mounted Volume| Pod
All diagrams use Mermaid syntax. Render in VS Code with Mermaid extension or view on GitHub.
Since pve-vader and pve-sidious are online 24/7, all critical infrastructure runs on them. pve-maul is used only for ephemeral sandbox testing.
%%{init: {'theme': 'dark'}}%%
flowchart TB
subgraph pveVader["pve-vader (24/7 Master)"]
subgraph VaderVMs["Proxmox VMs/LXCs"]
pfS["pfSense VM"]
DNS["AdGuard Home"]
VPN["Tailscale"]
end
subgraph NodeV["k3s-master-01"]
K3sS["K3s Control Plane"]
Argo["ArgoCD"]
Vault["Vault (Leader)"]
end
end
subgraph pveSidious["pve-sidious (24/7 Node)"]
subgraph NodeS["k3s-worker-01"]
Prom["Prometheus"]
Loki["Loki Storage"]
Tempo["Tempo Tracing"]
Vault2["Vault (Follower)"]
end
end
subgraph pveMaul["pve-maul (Hack Box)"]
subgraph NodeM["k3s-sandbox-01"]
Kali["Kali Linux"]
Vuln["Vulnerable Targets"]
end
end
style pveVader fill:#0d47a1,stroke:#2196f3
style pveSidious fill:#1b5e20,stroke:#4caf50
style pveMaul fill:#4a1c1c,stroke:#ff5252
%%{init: {'theme': 'dark'}}%%
flowchart TB
subgraph Apps["Instrumented Applications"]
ProductA["Product A"]
ProductB["Product B"]
end
subgraph OTel["Telemetry Collection"]
Collector["OTel Collector (Hub)"]
end
subgraph Storage["LGTM Persistence Layer"]
Loki[("Loki (Logs)")]
Prom[("Prometheus (Metrics)")]
Tempo[("Tempo (Traces)")]
end
subgraph Visualization["Visualization"]
Grafana["Grafana Central"]
end
Apps -->|OTLP| Collector
Collector -->|Push| Loki
Collector -->|Push| Tempo
Prom -->|Scrape| Collector
Loki & Prom & Tempo --> Grafana
style OTel fill:#7b1fa2,stroke:#ab47bc
style Storage fill:#1565c0,stroke:#42a5f5
With pve-maul off, the 3-replica HA services (Vault, Longhorn) enter a "Degraded but Available" state.
%%{init: {'theme': 'dark'}}%%
flowchart LR
subgraph Cluster["Primary Lab Cluster"]
Node1["Vader (ON)"]
Node2["Sidious (ON)"]
Node3["Maul (OFF)"]
end
subgraph HAState["HA Service Status (e.g. Vault)"]
Replica1["Replica 1 (Active)"]
Replica2["Replica 2 (Standby)"]
Replica3["Replica 3 (Offline)"]
end
Node1 --- Replica1
Node2 --- Replica2
Node3 -.- Replica3
Note right of HAState: 2/3 Replicas Online = Quorum Maintained
Longhorn does not access the Proxmox host directly. It uses Virtual Disks backed by the physical SATA SSDs.
%%{init: {'theme': 'dark'}}%%
flowchart TB
subgraph HostV["pve-vader"]
SATA_V["SATA SSD"]
ThinV["LVM Thin Pool"]
VDisk_V["Virtual Disk (/dev/sdb)"]
end
subgraph VM_V["k3s-master-01"]
LH_V["Longhorn Engine"]
end
SATA_V --> ThinV --> VDisk_V --> VM_V
VM_V -->|Mount| LH_V
All diagrams use Mermaid syntax. Render in VS Code with Mermaid extension or view on GitHub.
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#1a1a2e', 'primaryTextColor': '#eaeaea', 'primaryBorderColor': '#16213e', 'lineColor': '#0f3460', 'secondaryColor': '#16213e', 'tertiaryColor': '#0f3460'}}}%%
flowchart TB
subgraph Internet["🌐 Internet"]
WAN((WAN))
end
subgraph Home["🏠 Home Network (192.168.1.0/24)"]
Router["AX3000 Router (No Bridge Mode)<br/>192.168.1.1"]
DevMachine["Dev Machine<br/>(MacBook)"]
end
subgraph Proxmox["Proxmox VE Cluster"]
subgraph Vader["pve-vader (24/7 Master)"]
pfSense["pfSense VM (Double NAT)<br/>WAN: 192.168.1.x<br/>LAN: 10.10.10.1"]
AdGuard["AdGuard LXC<br/>DNS: 10.10.10.2"]
Tailscale["Tailscale LXC<br/>VPN: 10.10.10.3"]
K3sMaster["K3s Master VM<br/>10.10.10.10"]
end
subgraph Sidious["pve-sidious (24/7 Node)"]
K3sWorker1["K3s Worker VM<br/>10.10.10.12"]
BlueVM1["Blue Team VMs<br/>Wazuh/ELK"]
end
subgraph Maul["pve-maul (Hack Box - Optional)"]
K3sWorker2["Sandbox K3s Node"]
RedVM1["Red Team VMs<br/>Kali/Targets"]
end
end
WAN --> Router
Router -->|Physical Link| pfSense
Router --> DevMachine
pfSense --> AdGuard
pfSense --> K3sMaster
pfSense --> K3sWorker1
pfSense --> K3sWorker2
pfSense --> RedVM1
pfSense --> BlueVM1
style Vader fill:#0f3460,stroke:#e94560
style Sidious fill:#1a1a2e,stroke:#e94560
style Maul fill:#1a1a2e,stroke:#e94560
%%{init: {'theme': 'dark'}}%%
flowchart TB
subgraph Physical["Actual Hardware (lsblk verified)"]
subgraph VaderHW["pve-vader (Master)"]
V_CPU["Intel i5-8500"]
V_NVMe["477GB NVMe (OS/VMs)"]
V_SATA["238GB SATA SSD (Longhorn)"]
end
subgraph SidiousHW["pve-sidious (Worker)"]
S_CPU["Intel i5-8500"]
S_NVMe["477GB NVMe (OS/VMs)"]
S_SATA["238GB SATA SSD (Longhorn)"]
end
subgraph MaulHW["pve-maul (Hack Box)"]
M_CPU["Intel i5-8500T"]
M_NVMe["238GB NVMe (Limited)"]
end
end
subgraph VirtualVMs["Critical VM Placement"]
pfSenseVM["pfSense (on Vader)"]
K3sServer["K3s Master (on Vader)"]
K3sWorker1["K3s Worker (on Sidious)"]
KaliVM["Kali Linux (on Maul)"]
end
V_NVMe --> pfSenseVM
V_NVMe --> K3sServer
S_NVMe --> K3sWorker1
M_NVMe --> KaliVM
V_SATA -.->|Virtio Disk| K3sServer
S_SATA -.->|Virtio Disk| K3sWorker1
| Source | Destination | Action | Purpose |
|---|---|---|---|
| LAN (10.10.10.0/24) | WAN | ALLOW | Internet access |
| OPT (10.20.20.0/24) | WAN | ALLOW | Sandbox updates |
| OPT (10.20.20.0/24) | LAN (10.10.10.0/24) | BLOCK | Isolate hack box from lab |
| LAN (10.10.10.0/24) | Home (192.168.1.0/24) | BLOCK | Isolate lab from family |
| Tailscale Subnet | LAN | ALLOW | Remote admin access |
Status Legend:
[ ]Not Started |[~]In Progress |[x]Complete |[-]Skipped/N/A
- Verify all 3 nodes have Proxmox VE 8.x installed
- Confirm network connectivity between all nodes
- Verify storage availability on each node:
- pve-vader: 477GB NVMe + 238GB SATA SSD
- pve-sidious: 477GB NVMe + 238GB SATA SSD
- pve-maul: 238GB NVMe (Hack Box - 141GB local-lvm)
- Document MAC addresses and IP assignments
- Configure BIOS/UEFI settings (virtualization enabled, power management)
- Reserve IP addresses on router for static assignments
- Document final IP scheme:
- Proxmox management IPs (192.168.1.x)
- VNet1 Homelab-Net (10.10.10.0/24)
- VNet2 Sandbox-Net (10.20.20.0/24)
- Plan Tailscale subnet router placement
- Install Terraform >= 1.6
- Install Ansible >= 2.15
- Install kubectl
- Install helm
- Install Tailscale client
- Configure SSH keypair for infrastructure access
- Install Proxmox Terraform provider credentials
- Create Proxmox cluster on pve-vader (Master)
pvecm create homelab-cluster --link0 192.168.1.11
- Join pve-sidious to cluster (24/7 Node)
pvecm add 192.168.1.11 --link0 192.168.1.12
- Join pve-maul to cluster (Hack Box)
pvecm add 192.168.1.11 --link0 192.168.1.10
- Verify cluster quorum:
pvecm status(expectQuorum: 1,Nodes: 3)
- Physical Prep: Create LVM thin pools on vader/sidious SATA SSDs (See
storage-checklist.md§ Storage Operations)- Run
wipefs,pvcreate,vgcreate vg-longhorn,lvcreateon both nodes
- Run
- Register storage pool in Proxmox UI: Datacenter → Storage → Add → LVM-Thin
- ID:
vg-longhorn| VG:vg-longhorn| Thin Pool:tp-longhorn| Nodes: vader, sidious - (Required before Terraform can provision virtual disks from this pool)
- ID:
- Verify pool visible:
pvesm statuson each node - Enable directory storage for ISOs/templates on each node
- Upload Ubuntu 24.04 Cloud-Init image to all nodes (used as VM template)
- Enable Proxmox SDN via UI: Datacenter → SDN → Zones → Add → VXLAN
- Zone ID:
vxlan-zone| MTU:1450| Nodes: vader, sidious, maul
- Zone ID:
- Create VNet1: ID
vnet-homelab| Tag: 100 | CIDR: 10.10.10.0/24 - Create VNet2: ID
vnet-sandbox| Tag: 200 | CIDR: 10.20.20.0/24 - Apply SDN configuration:
pvesh create /cluster/sdn --action apply
- Verify VNets are visible on all nodes:
pvesh get /cluster/sdn/vnets
Reference: Guide 01 (Local Setup) | Guide 03 (Terraform) | Guide 04 (Ansible)
- Install required tools (Guide 01):
terraform >= 1.6,ansible >= 2.15,kubectl,helm,jq,yq - Generate SSH keypair:
ssh-keygen -t ed25519 -C "homelab" -f ~/.ssh/homelab - Distribute public key to all Proxmox nodes
- Create Proxmox API token (
root@pam→ Datacenter → Permissions → API Tokens) - Configure
.envrcwithPM_API_URL,PM_API_TOKEN,KUBECONFIG - Verify API access:
curl -sk https://192.168.1.11:8006/api2/json/nodes | jq . - Verify SSH access to all three nodes
- Create file
terraform/environments/homelab/providers.tf(provider:bpg/proxmox) - Create file
terraform/environments/homelab/variables.tf - Create file
terraform/environments/homelab/main.tf - Create reusable VM module:
terraform/modules/vm/{main,variables,outputs}.tf - Run
terraform initinterraform/environments/homelab/ - Run
terraform validate— expect zero errors - Run
terraform plan— review all resources before applying
- Create file
ansible/ansible.cfg - Create file
ansible/inventories/homelab/hosts.ymlwith all node IPs/users - Create file
ansible/inventories/homelab/group_vars/all.yml - Install required collections:
ansible-galaxy collection install community.general ansible.posix - Test inventory:
ansible all -m ping -i ansible/inventories/homelab/hosts.yml - Create roles:
common,adguard,tailscale,k3s
- Provision VM via Terraform: 2 vCPU, 4GB RAM, 3 NICs (WAN, LAN, OPT)
- Configure LAN interface (10.10.10.1) and OPT (10.20.20.1)
- Set Firewall Rule: BLOCK OPT → LAN (Sandbox isolation)
- Provision AdGuard Home LXC (10.10.10.2)
- Provision Tailscale Subnet Router LXC (10.10.10.3)
- Configure DNS Rewrites for
*.homelab.local
- Provision VM via Terraform: 4 vCPU, 8GB RAM, 100GB Disk
- Attach secondary SATA-backed Virtual Disk for Longhorn
- Bootstrap K3s Server with
--cluster-initand--disable traefik - BACKUP: Configure automated etcd snapshots to local disk
- Provision Worker VM: 4 vCPU, 8GB RAM, 100GB Disk
- Attach secondary SATA-backed Virtual Disk for Longhorn
- Join worker to the master node
Note: Traefik is disabled at install time.
ingress-nginxmust be installed before any service can be exposed via Ingress.
- Install Ingress-Nginx via Helm (critical — replaces disabled Traefik)
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm install ingress-nginx ingress-nginx/ingress-nginx -n ingress-nginx --create-namespace \ --set controller.service.type=NodePort \ --set controller.service.nodePorts.http=30080 \ --set controller.service.nodePorts.https=30443
- Verify ingress controller is Running:
kubectl get pods -n ingress-nginx - Install cert-manager via Helm (TLS automation)
helm install cert-manager jetstack/cert-manager -n cert-manager --create-namespace \ --set crds.enabled=true
- Verify cert-manager pods are Running:
kubectl get pods -n cert-manager - (Optional, Phase 10+) Deploy
cloudflaredtunnel pod for public service exposure
Reference: Guide 06 | Guide 05 (K3s)
- Install iSCSI client on all K3s nodes:
apt install -y open-iscsi && systemctl enable --now iscsid - Install Longhorn via Helm:
helm repo add longhorn https://charts.longhorn.io helm install longhorn longhorn/longhorn -n longhorn-system --create-namespace \ --set defaultSettings.defaultDataPath=/mnt/longhorn \ --set defaultSettings.defaultReplicaCount=2 \ --set defaultSettings.storageMinimalAvailablePercentage=10
- Verify all Longhorn pods Running:
kubectl get pods -n longhorn-system - Format and mount secondary disks inside VMs (see
storage-checklist.md§ Configure Longhorn):mkfs.ext4 /dev/sdb mkdir -p /mnt/longhorn echo '/dev/sdb /mnt/longhorn ext4 defaults,noatime,nofail 0 2' >> /etc/fstab && mount /mnt/longhorn
- Apply
longhorn-node-config.yamlto register disks:kubectl apply -f longhorn-node-config.yaml
Important:
longhorn-default(2 replicas) is set as the default StorageClass. With only vader + sidious active, a 3-replica class would immediately showDegraded. Only uselonghorn-criticalfor workloads that truly require 3 replicas and where degraded state is acceptable when Maul is offline.
-
longhorn-default(2 replicas) — default StorageClass -
longhorn-critical(3 replicas — HA, for Vault/Gitea) -
longhorn-ephemeral(1 replica — cache/temp) - Verify:
kubectl get storageclass - Verify Longhorn sees SATA disks:
kubectl -n longhorn-system get nodes.longhorn.io -o wide
Reference: Guide 07
- Add Bitnami Helm repo:
helm repo add bitnami https://charts.bitnami.com/bitnami - Deploy PostgreSQL (namespace:
postgresql, storage:longhorn-critical) - Add Gitea Helm repo:
helm repo add gitea-charts https://dl.gitea.com/charts/ - Deploy Gitea (namespace:
gitea, storage:longhorn-critical) - Access Gitea via NodePort and complete initial setup
- Create Gitea organisation
homelaband repositories:gitops-apps,terraform-proxmox,ansible-playbooks - Store Gitea admin credentials in Vault (Phase 7), not in shell history
- Add Argo Helm repo:
helm repo add argo https://argoproj.github.io/argo-helm - Deploy ArgoCD (namespace:
argocd, storage:longhorn-default)- Do not use
--set server.dev.enabled=true; use a valid bcrypt hash for admin password
- Do not use
- Access ArgoCD UI, change default admin password
- Connect ArgoCD to Gitea
gitops-appsrepository via SSH key (preferred) or HTTPS token
- Create repo folder structure:
argocd-apps/,infrastructure/,services/,monitoring/,security/ - Commit
argocd-apps/root-application.yamlandargocd-apps/projects.yamlto git - Apply root application:
kubectl apply -f gitops-apps/argocd-apps/root-application.yaml - Verify ArgoCD shows root app as Synced/Healthy
- Add HashiCorp Helm repo:
helm repo add hashicorp https://helm.releases.hashicorp.com - Deploy Vault in standalone mode (not dev mode) with Longhorn-backed storage (
longhorn-critical) - Initialise Vault:
vault operator init -key-shares=5 -key-threshold=3— save unseal keys and root token securely - Unseal Vault: provide 3 of 5 keys via
vault operator unseal - Enable KV-v2 engine:
vault secrets enable -path=homelab kv-v2 - Enable Kubernetes auth:
vault auth enable kubernetes - Create
homelab-appspolicy and role in Vault - Verify:
vault statusshowsSealed: false,HA Enabled: false(standalone)
- Add Falco Helm repo:
helm repo add falcosecurity https://falcosecurity.github.io/charts - Deploy Falco with eBPF driver (
driver.kind=ebpf) — requires kernel ≥ 4.14 with BTF - Apply custom Falco rules ConfigMap
- Test: trigger a shell-in-container event and confirm Falco logs it
- Add Aqua Helm repo:
helm repo add aqua https://aquasecurity.github.io/helm-charts/ - Deploy Trivy Operator (namespace:
trivy-system) with in-cluster DB server enabled - Add Kyverno Helm repo:
helm repo add kyverno https://kyverno.github.io/kyverno/ - Deploy Kyverno (namespace:
kyverno) - Apply baseline policies:
disallow-privileged,require-limits,disallow-latest-tag(start inauditmode) - Promote policies to
enforcemode after validating no existing workloads violate them
- Download current Kali Linux installer ISO (not VMware image) and upload to pve-maul
- Provision Kali Linux VM (80GB Disk) via
qm createon pve-maul, attached tovnet-sandbox - Deploy DVWA and Juice Shop in isolated K3s namespaces (add
WARNING: not production-safelabel)
- Deploy Wazuh (SIEM) in
blue-teamnamespace via Helm — uselonghorn-defaultstorage - Configure Wazuh agents on K3s nodes and Proxmox hosts
- Verify Wazuh dashboard accessible and receiving alerts
Reference: Guide 10
- Add Prometheus community Helm repo:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts - Deploy
kube-prometheus-stack(includes Grafana) — namespace:monitoring - Add Grafana Helm repo:
helm repo add grafana https://grafana.github.io/helm-charts - Deploy Loki (standalone chart, not deprecated
loki-stack) — namespace:logging - Deploy Promtail DaemonSet for log collection
- Configure Grafana Loki datasource:
http://loki.logging.svc.cluster.local:3100
- Deploy Grafana Tempo — namespace:
monitoring - Configure Grafana Tempo datasource:
http://tempo.monitoring.svc.cluster.local:3200(port 3200, not 3100) - Add OTel Helm repo:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts - Deploy OpenTelemetry Collector with
prometheusremotewriteexporter for metrics - Deploy a test app with OTLP instrumentation and verify traces appear in Grafana Tempo
- Create a single-node, disposable K3s cluster on the Hack Box
- Use for high-risk learning or malware analysis practice
| Check | Command | Expected Result |
|---|---|---|
| Proxmox Quorum | pvecm status |
Quorum: 1, Nodes: 3 |
| K3s Node Health | kubectl get nodes -o wide |
All nodes Ready |
| Longhorn Storage | kubectl -n longhorn-system get nodes.longhorn.io |
Both nodes healthy |
| ArgoCD Sync | kubectl get applications -n argocd |
All apps Synced/Healthy |
| Vault Status | vault status |
Sealed: false |
| Falco Running | kubectl get pods -n falco |
All pods Running |
| Network Isolation | SSH to Kali: ping 10.10.10.2 |
Request timeout |
| LGTM Traces | Deploy OTel test app | Traces visible in Grafana |
- All health checks above pass
- No Kyverno policy violations in
auditlogs:kubectl get policyreport -A - Trivy vulnerability report generated:
kubectl get vulnerabilityreports -A
- Weekly: Review security alerts
- Monthly: Update OS packages
- Quarterly: Update Kubernetes version
- PBS Setup: Provision a Proxmox Backup Server VM on
pve-maul- Allocate 60GB-80GB for the backup datastore
- Configure deduplication and compression
- Set Retention Policy: Keep last 2 days of backups
- Automated Jobs: Schedule nightly backups for all critical VMs (Vader/Sidious)
- K8s Backups: Configure Longhorn recurring snapshots (every 4-8 hours)
- Offsite (Optional): Configure Velero to sync critical metadata to S3/Local Storage
Reference: Guide 11
- Choose backend:
- Option A: Windows Server 2022 VM on
pve-vader— promote to Domain Controller (homelab.local) - Option B: Deploy
lldapvia ArgoCD (lightweight, recommended for DevSecOps learning)
- Option A: Windows Server 2022 VM on
- Create a standard user and
homelab-adminsgroup in the chosen directory - Create a service account (e.g.,
authelia-bind) for Authelia to query LDAP
- Deploy Redis (session store) in the
securitynamespace- Authelia requires Redis for session management — do not skip
- Re-use the existing PostgreSQL instance from Phase 6 (add a
autheliadatabase)
- Add Authelia Helm repo:
helm repo add authelia https://charts.authelia.com - Create
authelia-values.yamlwith:authentication_backend.ldap.urlpointing to LLDAP/ADstorage.postgrespointing to the existing PostgreSQLsession.redispointing to Redisaccess_controlrules (e.g.,one_factorfor internal tools,two_factorfor ArgoCD/Vault)
- Deploy Authelia in the
securitynamespace:helm install authelia authelia/authelia -n security -f authelia-values.yaml - Integrate secret values via Vault (do not put passwords in
authelia-values.yamlplaintext)
- Configure
ingress-nginxforward-auth annotations on all protected Ingress objects - Configure Gitea as an OIDC client of Authelia (
client_id,client_secretin Authelia config) - Configure ArgoCD OIDC login via Authelia
- Protect Grafana, Longhorn UI, and ArgoCD behind Authelia 2FA
- Open private/incognito window → navigate to
https://argocd.homelab.local - Confirm redirect to Authelia portal
- Login and verify access is granted with correct TOTP/LDAP credentials
- Confirm unauthenticated requests are rejected (HTTP 401/302)
Verified hardware output from each node and step-by-step storage operations for preparing Longhorn persistent storage.
pve-maul
root@pve-maul:~# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nvme0n1 259:0 0 238.5G 0 disk
├─nvme0n1p1 259:1 0 1007K 0 part
├─nvme0n1p2 259:2 0 1G 0 part /boot/efi
└─nvme0n1p3 259:3 0 237.5G 0 part
├─pve-swap 252:0 0 8G 0 lvm [SWAP]
├─pve-root 252:1 0 69.4G 0 lvm /
├─pve-data_tmeta 252:2 0 1.4G 0 lvm
│ └─pve-data 252:4 0 141.2G 0 lvm
└─pve-data_tdata 252:3 0 141.2G 0 lvm
└─pve-data 252:4 0 141.2G 0 lvm
root@pve-maul:~# fdisk -l
Disk /dev/nvme0n1: 238.47 GiB, 256060514304 bytes, 500118192 sectors
Disk model: ADATA SX8200PNP
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: E61EB1D3-EC90-4BE5-96C2-7F88C8489778
Device Start End Sectors Size Type
/dev/nvme0n1p1 34 2047 2014 1007K BIOS boot
/dev/nvme0n1p2 2048 2099199 2097152 1G EFI System
/dev/nvme0n1p3 2099200 500118158 498018959 237.5G Linux LVM
Disk /dev/mapper/pve-swap: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/mapper/pve-root: 69.37 GiB, 74482450432 bytes, 145473536 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
root@pve-maul:~# pvesm status
Name Type Status Total Used Available %
local dir active 71017632 3537964 63826448 4.98%
local-lvm lvmthin active 148086784 0 148086784 0.00%
root@pve-maul:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content iso,vztmpl,backup
lvmthin: local-lvm
thinpool data
vgname pve
content rootdir,images
root@pve-maul:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 12G 0 12G 0% /dev
tmpfs 2.4G 1.3M 2.4G 1% /run
/dev/mapper/pve-root 68G 3.4G 61G 6% /
tmpfs 12G 28M 12G 1% /dev/shm
efivarfs 118K 55K 59K 48% /sys/firmware/efi/efivars
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 1.0M 0 1.0M 0% /run/credentials/systemd-journald.service
tmpfs 12G 0 12G 0% /tmp
/dev/nvme0n1p2 1022M 8.8M 1014M 1% /boot/efi
/dev/fuse 128M 16K 128M 1% /etc/pve
tmpfs 1.0M 0 1.0M 0% /run/credentials/getty@tty1.service
tmpfs 2.4G 4.0K 2.4G 1% /run/user/0
root@pve-maul:~#
root@pve-maul:~# lsblk -f
NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS
nvme0n1
├─nvme0n1p1
├─nvme0n1p2 vfat FAT32 E77D-92ED 1013.2M 1% /boot/efi
└─nvme0n1p3 LVM2_member LVM2 001 vWU0Gw-xMdB-HvB7-Zoca-7UuH-q7ZB-GW1Rpg
├─pve-swap swap 1 ea4070e5-eefe-4d85-b66f-fa13be69926a [SWAP]
├─pve-root ext4 1.0 c3a3ef3d-b6a6-4cd2-85e6-bf756c3d1731 60.9G 5% /
├─pve-data_tmeta
│ └─pve-data
└─pve-data_tdata
└─pve-datapve-sidious
root@pve-sidious:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 238.5G 0 disk
└─sda1 8:1 0 238.5G 0 part
nvme0n1 259:0 0 476.9G 0 disk
├─nvme0n1p1 259:1 0 1007K 0 part
├─nvme0n1p2 259:2 0 1G 0 part /boot/efi
└─nvme0n1p3 259:3 0 475G 0 part
├─pve-swap 252:0 0 8G 0 lvm [SWAP]
├─pve-root 252:1 0 96G 0 lvm /
├─pve-data_tmeta 252:2 0 3.5G 0 lvm
│ └─pve-data 252:4 0 347.9G 0 lvm
└─pve-data_tdata 252:3 0 347.9G 0 lvm
└─pve-data 252:4 0 347.9G 0 lvm
root@pve-sidious:~# fdisk-l
-bash: fdisk-l: command not found
root@pve-sidious:~# fdisk -l
Disk /dev/nvme0n1: 476.94 GiB, 512110190592 bytes, 1000215216 sectors
Disk model: Colorful CN600 512GB PRO
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: EAC52235-00AD-4DF6-852C-0750ED95AE21
Device Start End Sectors Size Type
/dev/nvme0n1p1 34 2047 2014 1007K BIOS boot
/dev/nvme0n1p2 2048 2099199 2097152 1G EFI System
/dev/nvme0n1p3 2099200 998244352 996145153 475G Linux LVM
Disk /dev/sda: 238.47 GiB, 256060514304 bytes, 500118192 sectors
Disk model: Colorful SL500 2
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: DEC3BA58-81C3-4088-B0A4-3509FBE7EE5E
Device Start End Sectors Size Type
/dev/sda1 34 500117503 500117470 238.5G Linux filesystem
Disk /dev/mapper/pve-swap: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/mapper/pve-root: 96 GiB, 103079215104 bytes, 201326592 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
root@pve-sidious:~# pvesm status
Name Type Status Total Used Available %
local dir active 98497780 3575700 89872532 3.63%
local-lvm lvmthin active 364797952 0 364797952 0.00%
root@pve-sidious:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content iso,vztmpl,backup
lvmthin: local-lvm
thinpool data
vgname pve
content rootdir,images
root@pve-sidious:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 16G 0 16G 0% /dev
tmpfs 3.2G 1.2M 3.2G 1% /run
/dev/mapper/pve-root 94G 3.5G 86G 4% /
tmpfs 16G 28M 16G 1% /dev/shm
efivarfs 150K 78K 68K 54% /sys/firmware/efi/efivars
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 1.0M 0 1.0M 0% /run/credentials/systemd-journald.service
tmpfs 16G 0 16G 0% /tmp
/dev/nvme0n1p2 1022M 8.8M 1014M 1% /boot/efi
/dev/fuse 128M 16K 128M 1% /etc/pve
tmpfs 1.0M 0 1.0M 0% /run/credentials/getty@tty1.service
tmpfs 3.2G 4.0K 3.2G 1% /run/user/0
root@pve-sidious:~# lsblk -f
NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS
sda
└─sda1 ext4 1.0 Files 52e05ce3-e51c-4a86-a4b2-5eacfd9b3096
nvme0n1
├─nvme0n1p1
├─nvme0n1p2 vfat FAT32 4AD4-24E4 1013.2M 1% /boot/efi
└─nvme0n1p3 LVM2_member LVM2 001 TQNOWn-EvuS-AYTt-qOZg-UnRA-iGdf-7GW5Mv
├─pve-swap swap 1 d9e824f3-7c1b-4f1b-ae5d-0fe6949e7849 [SWAP]
├─pve-root ext4 1.0 da7a9d2c-3d52-4fc1-adf9-c543be9dca5d 85.7G 4% /
├─pve-data_tmeta
│ └─pve-data
└─pve-data_tdata
└─pve-datapve-vader
root@pve-vader:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 238.5G 0 disk
└─sda1 8:1 0 238.5G 0 part
nvme0n1 259:0 0 476.9G 0 disk
├─nvme0n1p1 259:1 0 1007K 0 part
├─nvme0n1p2 259:2 0 1G 0 part /boot/efi
└─nvme0n1p3 259:3 0 475.9G 0 part
├─pve-swap 252:0 0 8G 0 lvm [SWAP]
├─pve-root 252:1 0 96G 0 lvm /
├─pve-data_tmeta 252:2 0 3.6G 0 lvm
│ └─pve-data 252:4 0 348.8G 0 lvm
└─pve-data_tdata 252:3 0 348.8G 0 lvm
└─pve-data 252:4 0 348.8G 0 lvm
root@pve-vader:~# fdisk -l
Disk /dev/nvme0n1: 476.94 GiB, 512110190592 bytes, 1000215216 sectors
Disk model: Colorful CN600 512GB PRO
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 0905714F-0DFC-4F16-9888-622B297706E0
Device Start End Sectors Size Type
/dev/nvme0n1p1 34 2047 2014 1007K BIOS boot
/dev/nvme0n1p2 2048 2099199 2097152 1G EFI System
/dev/nvme0n1p3 2099200 1000215182 998115983 475.9G Linux LVM
Disk /dev/sda: 238.47 GiB, 256060514304 bytes, 500118192 sectors
Disk model: Colorful SL500 2
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: A2A608B6-6638-48F6-AB47-461241BC3907
Device Start End Sectors Size Type
/dev/sda1 2048 500118158 500116111 238.5G Linux filesystem
Disk /dev/mapper/pve-swap: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk /dev/mapper/pve-root: 96 GiB, 103079215104 bytes, 201326592 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
root@pve-vader:~# pvesm status
Name Type Status Total Used Available %
local dir active 98497780 3573856 89874376 3.63%
local-lvm lvmthin active 365760512 0 365760512 0.00%
root@pve-vader:~# cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content iso,vztmpl,backup
lvmthin: local-lvm
thinpool data
vgname pve
content rootdir,images
root@pve-vader:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 16G 0 16G 0% /dev
tmpfs 3.2G 1.2M 3.2G 1% /run
/dev/mapper/pve-root 94G 3.5G 86G 4% /
tmpfs 16G 28M 16G 1% /dev/shm
efivarfs 150K 75K 71K 52% /sys/firmware/efi/efivars
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 1.0M 0 1.0M 0% /run/credentials/systemd-journald.service
tmpfs 16G 0 16G 0% /tmp
/dev/nvme0n1p2 1022M 8.8M 1014M 1% /boot/efi
/dev/fuse 128M 20K 128M 1% /etc/pve
tmpfs 1.0M 0 1.0M 0% /run/credentials/getty@tty1.service
tmpfs 3.2G 4.0K 3.2G 1% /run/user/0
root@pve-vader:~# lsblk -f
NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS
sda
└─sda1 ext4 1.0 6bed5c80-5e4c-4fd4-842c-96cf73d40e12
nvme0n1
├─nvme0n1p1
├─nvme0n1p2 vfat FAT32 7ED7-8304 1013.2M 1% /boot/efi
└─nvme0n1p3 LVM2_member LVM2 001 4nPkAt-XcQV-0Upy-odzR-pVkS-jf2P-PV2FRc
├─pve-swap swap 1 d3525fc7-22b0-4d5b-a5b8-e4fc0ec2cc60 [SWAP]
├─pve-root ext4 1.0 142fff43-9bd5-4ec8-8c57-798b66f37277 85.7G 4% /
├─pve-data_tmeta
│ └─pve-data
└─pve-data_tdata
└─pve-datapve-vader = 24x7 node (Master node for Proxmox and K3s) pve-sidious = 24x7 node (Proxmox Quorum + K3s Worker) pve-maul = Hack Box (Security lab - turned on for practice)
| Node | Role | VMs/LXCs | Disk Source |
|---|---|---|---|
| pve-vader | 24x7 Master | pfSense (32GB), K3s Master (100GB), AdGuard (8GB), Tailscale (4GB) | local-lvm (~415GB free) + SATA SSD (256GB) |
| pve-sidious | 24x7 Worker | K3s Worker 01 (100GB), Longhorn on SATA SSD | local-lvm (~415GB free) + SATA SSD (256GB) |
| pve-maul | Hack Box | Kali Linux (80GB), Security Sandboxes | local-lvm (141GB free) |
To make the SATA SSDs accessible to Longhorn (which runs inside K3s VMs), we will create a dedicated LVM thin pool on each host and attach a virtual disk to the VMs.
# === Run on BOTH pve-vader AND pve-sidious ===
# 1. Clear existing filesystem signatures (Critical)
wipefs -a /dev/sda1
# 2. Initialize as LVM Physical Volume
pvcreate /dev/sda1
# 3. Create a Volume Group on the SATA SSD
vgcreate vg-longhorn /dev/sda1
# 4. Create a Thin Pool
lvcreate -L 230G -T vg-longhorn/tp-longhorn
# 4. In Terraform (Guide 03), we will attach a disk from this pool to the K3s VMs
# It will appear inside the VM as /dev/sdbAfter deploying Longhorn (Phase 4), configure the storage path inside the K3s VM.
# === Run inside K3s Master and Workers ===
# 1. Format the second disk
mkfs.ext4 /dev/sdb
# 2. Mount it
mkdir -p /mnt/longhorn
echo '/dev/sdb /mnt/longhorn ext4 defaults,noatime,nofail 0 2' >> /etc/fstab
mount /mnt/longhornApply node configuration via manifest (using K3s node names):
# longhorn-node-config.yaml
apiVersion: longhorn.io/v1beta2
kind: Node
metadata:
name: k3s-master-01 # Correct K3s node name
namespace: longhorn-system
spec:
disks:
sata-ssd:
path: /mnt/longhorn
storageReserved: 26843545600 # 25 GiB in bytes (Longhorn expects bytes, not a percentage)
tags:
- ssd
- longhorn
---
apiVersion: longhorn.io/v1beta2
kind: Node
metadata:
name: k3s-worker-01 # Correct K3s node name
namespace: longhorn-system
spec:
disks:
sata-ssd:
path: /mnt/longhorn
storageReserved: 26843545600 # 25 GiB in bytes (Longhorn expects bytes, not a percentage)
tags:
- ssd
- longhornkubectl apply -f longhorn-node-config.yamlCaution
Take a Proxmox backup before proceeding. This is a destructive, irreversible operation on a live host. In the Proxmox UI, go to Datacenter → Backup → Backup Now for both pve-vader and pve-sidious and store the backup on a USB drive or pve-maul. Do not skip this step.
WARNING: This requires rebooting into rescue/live mode. Cannot shrink a mounted root filesystem. Prerequisite: Boot from Proxmox installer USB in rescue mode, or use SystemRescue ISO.
# === Boot from Proxmox installer USB, select "Rescue Boot" ===
# Or boot a SystemRescue ISO via Proxmox ISO mount
# Activate LVM volumes
vgchange -ay
# Check filesystem before shrinking
e2fsck -f /dev/mapper/pve-root
# Shrink filesystem AND logical volume in one step (to 40GB)
# --resizefs shrinks the filesystem first, then the LV
lvreduce --resizefs -L 40G /dev/pve/root
# Extend the thin pool to consume freed space (~56GB per node)
lvextend -l +100%FREE /dev/pve/data
# Verify
lvs
# pve-root should show 40G
# pve-data (thin pool) should show ~404GB
# Reboot
exit
rebootExpected result after shrinking both nodes:
| Node | pve-root | local-lvm (thin pool) |
|---|---|---|
| pve-vader | 40GB | ~404GB |
| pve-sidious | 40GB | ~404GB |
# After reboot, verify
pvesm status
df -h /
lvsAfter Longhorn is deployed and configured (Phase 4):
# Verify Longhorn sees the SATA SSD disks on both worker nodes
kubectl -n longhorn-system get nodes.longhorn.io -o wide
# Check disk status
kubectl -n longhorn-system get disks.longhorn.io -A
# Verify storage capacity
kubectl -n longhorn-system get nodes.longhorn.io \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.diskStatus.sata-ssd.storageAvailable}{"\n"}{end}'
# Create a test PVC to verify replica scheduling
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: longhorn-test
namespace: default
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 1Gi
EOF
# Verify volume has 2 replicas across nodes
kubectl -n longhorn-system get volumes.longhorn.io
# Clean up test
kubectl delete pvc longhorn-test -n defaultExpected capacity:
- pve-vader: ~236GB (SATA SSD) dedicated to Longhorn
- pve-sidious: ~236GB (SATA SSD) dedicated to Longhorn
- Total replicated capacity: ~472GB with 2 replicas (effective ~236GB usable)
Configure your development machine for infrastructure provisioning and cluster management.
This guide prepares your local macOS/Linux machine with all required tools for deploying and managing the homelab infrastructure.
Time Required: ~15 minutes Prerequisites: macOS or Linux machine, internet connection
| Tool | Version | Purpose |
|---|---|---|
| Terraform | >= 1.6 | Provision Proxmox VMs/LXCs |
| Ansible | >= 2.15 | Bootstrap services |
| kubectl | >= 1.28 | Kubernetes management |
| helm | >= 3.14 | Kubernetes package management |
| Tailscale | Latest | VPN access to homelab |
| direnv | Latest | Auto-load .envrc per project directory |
| jq | Latest | JSON processing |
| yq | Latest | YAML processing |
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"# Terraform
brew tap hashicorp/tap
brew install hashicorp/tap/terraform
# Ansible
brew install ansible
# kubectl
brew install kubectl
# helm
brew install helm
# Tailscale
brew install --cask tailscale
# direnv (auto-loads .envrc when you cd into the project directory)
brew install direnv
# Add to your shell init file (~/.zshrc or ~/.bashrc):
echo 'eval "$(direnv hook zsh)"' >> ~/.zshrc && source ~/.zshrc
# jq (JSON processor)
brew install jq
# yq (YAML processor)
brew install yqterraform version
ansible --version
kubectl version --client
helm version
tailscale version
jq --version
yq --versionsudo tailscale upFollow the browser prompt to authenticate.
sudo apt update && sudo apt upgrade -ysudo apt install -y curl wget git software-properties-commonwget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraformsudo apt update
sudo apt install -y ansiblecurl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bashcurl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale upsudo apt install -y jq
wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/bin/yq
chmod +x /usr/bin/yqsudo apt install -y direnv
# Add to ~/.bashrc
echo 'eval "$(direnv hook bash)"' >> ~/.bashrc && source ~/.bashrcGenerate SSH keys for infrastructure access:
# Generate new key pair
ssh-keygen -t ed25519 -C "homelab" -f ~/.ssh/homelab
# Display public key (add to Proxmox nodes)
cat ~/.ssh/homelab.pubCopy the public key and add it to each Proxmox node:
# On each Proxmox node
mkdir -p ~/.ssh
echo "YOUR_PUBLIC_KEY_HERE" >> ~/.ssh/authorized_keysCreate the project structure:
cd /Volumes/Codex/Projects/homelab
# Create directories
mkdir -p terraform/environments/homelab
mkdir -p ansible/inventories/homelab/group_vars
mkdir -p ansible/playbooks
mkdir -p ansible/roles
mkdir -p gitops-apps/{infrastructure,services,monitoring,security}
# Verify structure
tree -L 2 -dCreate the project environment file and activate it with direnv:
# Create .envrc file
cat > .envrc << 'EOF'
# Proxmox Credentials (bpg/proxmox provider env var names)
export PROXMOX_VE_ENDPOINT="https://192.168.1.11:8006"
export PROXMOX_VE_API_TOKEN="root@pam!terraform=<your-token-here>"
# Kubernetes
export KUBECONFIG="${HOME}/.kube/homelab-config"
# Terraform SSH key (injected into Cloud-Init)
export TF_VAR_ssh_public_key=$(cat ~/.ssh/homelab.pub 2>/dev/null || echo "")
# Tailscale
export TS_AUTHKEY="" # Optional: for automated Tailscale auth
# Git
export GIT_USERNAME="your-gitea-username"
export GIT_EMAIL="your-email@example.com"
EOF
# Allow direnv to load this file (run once per project)
# direnv will then auto-load/unload .envrc whenever you cd in/out of the directory
direnv allow .Important
.envrc contains secrets. Make sure .envrc is in .gitignore before committing:
grep '.envrc' .gitignore || echo '.envrc' >> .gitignore- Log into Proxmox web UI:
https://192.168.1.11:8006 - Go to Datacenter > Permissions > API Tokens
- Click Add: Select
root@pam - Set Token ID:
terraform - Uncheck Privilege Separation (note: for production use, create a dedicated
terraform@pveuser with minimal permissions instead ofroot@pam) - Click Generate
- Copy the token immediately (it won't be shown again)
# Edit .envrc and set the token
export PROXMOX_VE_API_TOKEN="root@pam!terraform=YOUR_TOKEN_HERE"
# Reload
direnv allow .curl -s "https://192.168.1.11:8006/api2/json/nodes" \
-H "Authorization: PVEAPIToken=root@pam!terraform@homelab=YOUR_TOKEN" \
-k | jq .Expected output:
{
"data": [
{"node": "vader", "status": "online", ...},
{"node": "sidious", "status": "online", ...},
{"node": "maul", "status": "online", ...}
]
}# Test SSH to each node
ssh -i ~/.ssh/homelab root@192.168.1.11 "hostname"
ssh -i ~/.ssh/homelab root@192.168.1.12 "hostname"
ssh -i ~/.ssh/homelab root@192.168.1.10 "hostname"Configure git for the project:
git config --global user.name "Your Name"
git config --global user.email "your-email@example.com"
git config --global core.sshCommand "ssh -i ~/.ssh/homelab"
# Initialize git repository
git init
git add .
git commit -m "Initial: Project structure"Create and run verification script:
cat > verify-setup.sh << 'EOF'
#!/bin/bash
echo "🔍 Verifying local environment setup..."
# Check tools
tools=("terraform" "ansible" "kubectl" "helm" "jq" "yq")
for tool in "${tools[@]}"; do
if command -v $tool &> /dev/null; then
echo "✅ $tool: $(command -v $tool)"
else
echo "❌ $tool: NOT FOUND"
fi
done
# Check Tailscale
if tailscale status &> /dev/null; then
echo "✅ Tailscale: Connected"
else
echo "⚠️ Tailscale: Not connected"
fi
# Check SSH key
if [ -f ~/.ssh/homelab ]; then
echo "✅ SSH key exists"
else
echo "❌ SSH key not found"
fi
# Check Proxmox API
if [ -n "$PM_PASS_TOKEN" ]; then
echo "✅ Proxmox token configured"
else
echo "❌ Proxmox token not set"
fi
echo "✨ Setup verification complete!"
EOF
chmod +x verify-setup.sh
./verify-setup.sh# macOS: Rehash brew
hash -r
# Linux: Check PATH
echo $PATH | grep -o "[^:]*"# Verify API URL is correct
echo $PM_API_URL
# Test connectivity
ping -c 3 192.168.1.10
# Check if API token is valid
curl -k "https://192.168.1.10:8006/api2/json/version" \
-H "Authorization: PVEAPIToken=root@pam!terraform@homelab=$PM_PASS_TOKEN"# Verify SSH is running on nodes
ssh root@192.168.1.10 "systemctl status ssh"
# Check if key is added to node
ssh -i ~/.ssh/homelab root@192.168.1.10 "cat ~/.ssh/authorized_keys"Once local setup is verified:
➡️ Continue to Guide 02: Proxmox Cluster
- All required tools installed
- SSH key generated and distributed
- Proxmox API token created
- Project directory structure created
- Environment variables configured
- Proxmox API access verified
- SSH access verified to all nodes
- Tailscale connected
Form the Proxmox cluster and configure Software-Defined Networking (SDN) for network isolation.
This guide creates a 3-node Proxmox VE cluster and configures VXLAN-based virtual networks. pve-vader is the primary master.
Time Required: ~20 minutes Prerequisites: All three Proxmox nodes installed and reachable via SSH on the management network (192.168.1.x)
Run on each node (pve-vader, pve-sidious, pve-maul):
Ensure /etc/hosts contains all three hosts:
192.168.1.11 pve-vader.homelab.local pve-vader
192.168.1.12 pve-sidious.homelab.local pve-sidious
192.168.1.10 pve-maul.homelab.local pve-maul
Verify connectivity between nodes:
# Run on pve-vader
ping -c 3 192.168.1.12 # to pve-sidious
ping -c 3 192.168.1.10 # to pve-maulNote
MTU consideration: VXLAN encapsulates with an overhead of 50 bytes. The physical NICs must have MTU ≥ 1500 (default). If you set NIC MTU to 9000 (jumbo frames), set VNet MTU to 8950 instead of 1450. Do not mix MTU sizes across nodes.
# SSH to pve-vader
ssh root@192.168.1.11
# Create the cluster using the management IP as link0
pvecm create homelab-cluster --link0 192.168.1.11# SSH to pve-sidious
ssh root@192.168.1.12
# Join the cluster
pvecm add 192.168.1.11 --link0 192.168.1.12# SSH to pve-maul
ssh root@192.168.1.10
# Join the cluster
pvecm add 192.168.1.11 --link0 192.168.1.10# Run on pve-vader
pvecm status
# Expected output (key fields):
# Quorate: Yes
# Total votes: 3
# Nodes: 3
# ID Votes Flags Name
# 1 1 M,Ees pve-vader
# 2 1 Ees pve-sidious
# 3 1 Ees pve-maul
# Also verify from the Proxmox UI:
# Datacenter → Cluster → check all 3 nodes show green- Navigate to Datacenter > SDN > Zones
- Click Add: VXLAN
- Zone:
vxlan-zone - MTU:
1450(must be 50 less than physical NIC MTU) - Nodes: Select all 3 nodes
- Zone:
| VNet Name | Tag | CIDR | Purpose |
|---|---|---|---|
vnet-homelab |
100 | 10.10.10.0/24 | Primary K3s Cluster |
vnet-sandbox |
200 | 10.20.20.0/24 | Isolated Hack Box |
# Apply SDN configuration to cluster (Proxmox VE 8.x)
pvesh create /cluster/sdn --action apply# Verify VNets are visible on all nodes
pvesh get /cluster/sdn/vnets
# Expected: both vnet-homelab and vnet-sandbox listed
# Verify on individual node
ssh root@192.168.1.12 "pvesh get /nodes/pve-sidious/network" | grep -E 'vnet|vxlan'Since pve-vader and pve-sidious are online 24/7, they maintain quorum (2/3 votes) automatically even when pve-maul is powered off.
| State | Votes Online | Status |
|---|---|---|
| Vader + Sidious + Maul | 3/3 | Healthy |
| Vader + Sidious | 2/3 | Healthy (Quorum Maintained) |
| Vader Only | 1/3 | Cluster Read-Only (Quorum Lost) |
-
/etc/hostsupdated on all 3 nodes with correct hostnames - Cluster created on pve-vader:
pvecm create homelab-cluster --link0 192.168.1.11 - pve-sidious joined cluster:
pvecm statusshowsNodes: 2,Quorate: Yes - pve-maul joined cluster:
pvecm statusshowsNodes: 3 - VXLAN zone
vxlan-zonecreated (MTU: 1450) - VNet
vnet-homelab(Tag: 100) created - VNet
vnet-sandbox(Tag: 200) created - SDN applied:
pvesh create /cluster/sdn --action apply - VNets verified on all nodes:
pvesh get /cluster/sdn/vnets
Deploy all VMs and LXC containers using Terraform with the Proxmox provider.
This guide provisions the primary infrastructure. All management and control plane VMs are pinned to pve-vader (Master) and pve-sidious (24/7 Worker).
Time Required: ~30 minutes
Prerequisites: Guide 02 completed, direnv and Terraform installed (Guide 01), .envrc configured with PROXMOX_VE_API_TOKEN
Terraform Provider: bpg/proxmox — actively maintained, supports Cloud-Init and multi-disk VMs.
Important
This step must be completed before running terraform plan. Terraform references vg-longhorn by name and will fail if it isn't registered in Proxmox as a storage backend.
After creating the LVM thin pool on pve-vader and pve-sidious (see storage-checklist.md § Storage Operations):
- Proxmox UI → Datacenter → Storage → Add → LVM-Thin
- Repeat for each node (vader and sidious):
- ID:
vg-longhorn - Volume group:
vg-longhorn - Thin pool:
tp-longhorn - Nodes: select only the node you're adding it to
- ID:
- Verify:
pvesm statuson each node should listvg-longhornasactive
The following files are already present in the repo. Review them before running:
terraform/
├── environments/
│ └── homelab/
│ ├── providers.tf ← bpg/proxmox provider + required_version
│ ├── variables.tf ← all input variables (API token, SSH key, etc.)
│ └── main.tf ← all VM/LXC resource definitions
└── modules/
└── vm/
├── main.tf ← proxmox_virtual_environment_vm resource
├── variables.tf ← module input variables
└── outputs.tf ← vm_id, vm_name, ip_address
| VM Name | Node | vCPU | RAM | Boot Disk | Data Disk | On Boot |
|---|---|---|---|---|---|---|
| pfSense-Primary | vader | 2 | 4GB | 32GB (local-lvm) | — | ✅ |
| k3s-master-01 | vader | 4 | 8GB | 100GB (local-lvm) | 200GB (vg-longhorn) | ✅ |
| k3s-worker-01 | sidious | 4 | 8GB | 100GB (local-lvm) | 200GB (vg-longhorn) | ✅ |
| kali-linux | maul | 4 | 8GB | 80GB (local-lvm) | — | ❌ |
| Node | NVMe Committed | SATA Committed | Status |
|---|---|---|---|
| Vader | ~132GB | 200GB | ✅ Stable |
| Sidious | ~100GB | 200GB | ✅ Stable |
| Maul | ~80GB | 0GB | ✅ Stable |
cd terraform/environments/homelab
# Download provider plugins
terraform init
# Expected: "Terraform has been successfully initialized!"
# Provider bpg/proxmox will be downloaded from registry.terraform.io# Check configuration syntax and references
terraform validate
# Expected: "Success! The configuration is valid."# Set your SSH public key for Cloud-Init injection
export TF_VAR_ssh_public_key=$(cat ~/.ssh/homelab.pub)
# Generate and review the execution plan
terraform plan -out=tfplan
# Review the output carefully:
# + resource "proxmox_virtual_environment_vm" means a VM will be created
# Read through all planned changes before applying# Apply the plan (will prompt for confirmation)
terraform apply tfplan
# Monitor progress — VM cloning can take 2–5 minutes per VM
# Expected final line: "Apply complete! Resources: N added, 0 changed, 0 destroyed."# List all VMs via Proxmox API
terraform show
# Or via Proxmox CLI on each node:
ssh root@192.168.1.11 "qm list" # vader VMs
ssh root@192.168.1.12 "qm list" # sidious VMs
ssh root@192.168.1.10 "qm list" # maul VMs
# Verify SSH access to K3s master (may take 60s for cloud-init to complete)
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "uname -a"pfSense cannot be fully automated via Terraform — the initial NIC assignment requires the console.
- Open Proxmox UI → pve-vader → VM 100 (pfSense-Primary) → Console
- Complete pfSense installer
- Assign interfaces:
vtnet0→ WAN (receives IP from home router via DHCP)vtnet1→ LAN (10.10.10.1/24)vtnet2→ OPT1 (10.20.20.1/24)
- Set firewall rule: Block OPT1 → LAN (sandbox isolation)
- In pfSense web UI (
https://10.10.10.1): set DNS to point to AdGuard (10.10.10.2)
-
vg-longhornregistered in Proxmox UI as LVM-Thin storage on both vader and sidious -
terraform initcompleted successfully -
terraform validatereturns "Success" -
terraform planreviewed — no unexpected changes -
terraform applycompleted — all VMs created -
qm liston vader shows: pfSense (100), k3s-master-01 (200) -
qm liston sidious shows: k3s-worker-01 (201) -
qm liston maul shows: kali-linux (800) with statusstopped - SSH to k3s-master-01 (
ubuntu@10.10.10.10) succeeds - pfSense initial NIC assignment completed via console
Bootstrap services on provisioned VMs and LXCs using Ansible playbooks.
This guide uses Ansible to configure the base services on all provisioned infrastructure, including K3s prerequisites, AdGuard Home, and Tailscale.
Time Required: ~30 minutes Prerequisites: Guide 03 completed, Ansible installed
Ansible Playbooks
├── adguard.yml # Configure AdGuard Home
├── tailscale.yml # Configure Tailscale
├── k3s-prereqs.yml # K3s prerequisites
└── common.yml # Common configuration
Ansible Roles
├── common/ # Common tasks
├── adguard/ # AdGuard Home
├── tailscale/ # Tailscale VPN
└── k3s/ # K3s Kubernetes
Create ansible/ansible.cfg (this file has already been created — see repo root):
Important
Before running any playbooks, install the required Ansible collections. These provide modules used throughout all roles.
ansible-galaxy collection install community.general ansible.posixcd /Volumes/Codex/Projects/homelab/ansible
# Test inventory — all nodes must be reachable before proceeding
ansible -i inventories/homelab/hosts.yml all -m ping
# Expected output:
# k3s-master-01 | SUCCESS => {...}
# k3s-worker-01 | SUCCESS => {...}
# adguard | SUCCESS => {...}
# tailscale | SUCCESS => {...}mkdir -p ansible/roles/common/{tasks,handlers,files,templates}Create ansible/roles/common/tasks/main.yml:
---
# Common system configuration tasks
- name: Update apt cache
ansible.builtin.apt:
update_cache: true
cache_valid_time: 3600
become: true
- name: Upgrade all packages
ansible.builtin.apt:
upgrade: dist
autoremove: true
autoclean: true
become: true
- name: Install common packages
ansible.builtin.apt:
name:
- curl
- wget
- git
- vim
- htop
- net-tools
- dnsutils
- jq
- ca-certificates
- gnupg
- lsb-release
- python3
- python3-pip
state: present
become: true
- name: Set timezone to UTC
community.general.timezone:
name: UTC
become: true
- name: Configure sysctl settings
ansible.posix.sysctl:
name: "{{ item.name }}"
value: "{{ item.value }}"
state: present
reload: true
loop:
- { name: net.ipv4.ip_forward, value: 1 }
- { name: net.ipv6.conf.all.forwarding, value: 1 }
- { name: net.bridge.bridge-nf-call-iptables, value: 1 }
- { name: net.bridge.bridge-nf-call-ip6tables, value: 1 }
become: true
when: inventory_hostname in groups['k3s_master'] + groups['k3s_workers']
- name: Disable swap
ansible.builtin.command: swapoff -a
become: true
changed_when: false
failed_when: false
- name: Remove swap from fstab
ansible.builtin.lineinfile:
path: /etc/fstab
regexp: '^(.*\sswap\s+sw\s+.*)$'
state: absent
become: true
- name: Load kernel modules
community.general.modprobe:
name: "{{ item }}"
state: present
loop:
- overlay
- br_netfilter
- nf_conntrack
become: true
- name: Persist kernel modules
ansible.builtin.copy:
dest: "/etc/modules-load.d/{{ item }}.conf"
content: "{{ item }}\n"
mode: '0644'
loop:
- overlay
- br_netfilter
become: true
- name: Create homelab user
ansible.builtin.user:
name: homelab
system: true
shell: /bin/bash
home: /opt/homelab
create_home: true
become: truemkdir -p ansible/roles/adguard/{tasks,handlers,templates}Create ansible/roles/adguard/tasks/main.yml:
---
# AdGuard Home installation and configuration
- name: Download AdGuard Home
ansible.builtin.get_url:
url: "https://github.com/AdguardTeam/AdGuardHome/releases/latest/download/AdGuardHome_linux_amd64.tar.gz"
dest: /tmp/AdGuardHome.tar.gz
mode: '0644'
become: true
- name: Create AdGuard directory
ansible.builtin.file:
path: /opt/AdGuardHome
state: directory
mode: '0755'
become: true
- name: Extract AdGuard Home
ansible.builtin.unarchive:
src: /tmp/AdGuardHome.tar.gz
dest: /opt/AdGuardHome
remote_src: true
become: true
- name: Install AdGuard Home
ansible.builtin.command: /opt/AdGuardHome/AdGuardHome -s install
args:
chdir: /opt/AdGuardHome
creates: /opt/AdGuardHome/AdGuardHome.service
become: true
notify: restart adguard
- name: Start AdGuard Home
ansible.builtin.systemd:
name: AdGuardHome
state: started
enabled: true
daemon_reload: true
become: true
- name: Wait for AdGuard to be ready
ansible.builtin.wait_for:
port: 3000
delay: 5
timeout: 60
- name: Display AdGuard setup URL
ansible.builtin.debug:
msg: "AdGuard Home available at http://{{ ansible_host }}:3000"Create ansible/roles/adguard/handlers/main.yml:
---
- name: restart adguard
ansible.builtin.systemd:
name: AdGuardHome
state: restarted
become: truemkdir -p ansible/roles/tailscale/{tasks,handlers}Create ansible/roles/tailscale/tasks/main.yml:
---
# Tailscale installation and configuration
# Requires: ansible.posix collection (ansible-galaxy collection install ansible.posix)
- name: Create APT keyrings directory
ansible.builtin.file:
path: /etc/apt/keyrings
state: directory
mode: '0755'
become: true
- name: Download Tailscale GPG key
ansible.builtin.get_url:
url: https://pkgs.tailscale.com/stable/ubuntu/noble.nokey.gpg
dest: /etc/apt/keyrings/tailscale.gpg
mode: '0644'
become: true
- name: Add Tailscale APT repository
ansible.builtin.apt_repository:
repo: "deb [signed-by=/etc/apt/keyrings/tailscale.gpg] https://pkgs.tailscale.com/stable/ubuntu noble main"
state: present
filename: tailscale
become: true
- name: Install Tailscale
ansible.builtin.apt:
name:
- tailscale
state: present
update_cache: true
become: true
- name: Start Tailscale daemon
ansible.builtin.systemd:
name: tailscaled
state: started
enabled: true
become: true
- name: Check Tailscale status
ansible.builtin.command: tailscale status
register: tailscale_status
failed_when: false
changed_when: false
- name: Display Tailscale authentication URL
ansible.builtin.debug:
msg: |
Tailscale is not authenticated. Please authenticate manually:
1. SSH to the node: ssh {{ ansible_user }}@{{ ansible_host }}
2. Run: sudo tailscale up --advertise-routes=10.10.10.0/24,10.20.20.0/24
3. Follow the browser prompt
when: tailscale_status.rc != 0Create ansible/playbooks/common.yml:
---
# Apply common configuration to all hosts
- name: Apply common configuration
hosts: all
become: true
gather_facts: true
roles:
- role: common
tags: commonCreate ansible/playbooks/adguard.yml:
---
# Deploy AdGuard Home
- name: Configure AdGuard Home
hosts: network_services
become: true
gather_facts: true
pre_tasks:
- name: Include common role
include_role:
name: common
tags: common
roles:
- role: adguard
tags: adguardCreate ansible/playbooks/tailscale.yml:
---
# Deploy Tailscale
- name: Configure Tailscale
hosts: network_services
become: true
gather_facts: true
pre_tasks:
- name: Include common role
include_role:
name: common
tags: common
roles:
- role: tailscale
tags: tailscaleCreate ansible/playbooks/k3s-prereqs.yml:
---
# Prepare K3s nodes
# NOTE: Do NOT install the OS 'containerd' package here.
# K3s ships its own bundled containerd. Installing the OS package creates a
# conflicting socket at /run/containerd/containerd.sock and will cause K3s to fail.
- name: Prepare K3s nodes
hosts: k3s_master, k3s_workers
become: true
gather_facts: true
roles:
- role: common
tags: common
post_tasks:
- name: Install iSCSI client (required by Longhorn)
ansible.builtin.apt:
name:
- open-iscsi
- nfs-common
state: present
update_cache: true
- name: Enable and start iSCSI daemon
ansible.builtin.systemd:
name: iscsid
state: started
enabled: true
- name: Load required kernel modules for K3s networking
community.general.modprobe:
name: "{{ item }}"
state: present
loop:
- br_netfilter
- overlay
- nf_conntrack
- name: Persist kernel modules across reboots
ansible.builtin.copy:
dest: /etc/modules-load.d/k3s.conf
content: |
br_netfilter
overlay
nf_conntrack
mode: '0644'
- name: Configure required sysctl parameters for K3s
ansible.posix.sysctl:
name: "{{ item.key }}"
value: "{{ item.value }}"
state: present
reload: true
loop:
- { key: "net.bridge.bridge-nf-call-iptables", value: "1" }
- { key: "net.bridge.bridge-nf-call-ip6tables", value: "1" }
- { key: "net.ipv4.ip_forward", value: "1" }cd /Volumes/Codex/Projects/homelab/ansible
# Apply to all hosts
ansible-playbook -i inventories/homelab/hosts.yml playbooks/common.yml
# Expected output: All tasks completed successfully# Deploy to network services
ansible-playbook -i inventories/homelab/hosts.yml playbooks/adguard.yml
# Expected output:
# TASK [adguard : Display AdGuard setup URL]
# ok: [adguard] => {
# "msg": "AdGuard Home available at http://10.10.10.2:3000"
# }# Deploy to network services
ansible-playbook -i inventories/homelab/hosts.yml playbooks/tailscale.yml
# Authenticate Tailscale manually on the node:
ssh -i ~/.ssh/homelab ubuntu@10.10.10.3
sudo tailscale up --advertise-routes=10.10.10.0/24,10.20.20.0/24# Prepare K3s nodes
ansible-playbook -i inventories/homelab/hosts.yml playbooks/k3s-prereqs.yml
# Verify iSCSI daemon is running on all K3s nodes
ansible -i inventories/homelab/hosts.yml k3s_master,k3s_workers \
-m shell -a "systemctl status iscsid" --become
# Verify required kernel modules are loaded
ansible -i inventories/homelab/hosts.yml k3s_master,k3s_workers \
-m shell -a "lsmod | grep -E 'br_netfilter|nf_conntrack'" --become# Check AdGuard is running
curl -s http://10.10.10.2:3000
# Should return HTML (AdGuard setup page)Access in browser: http://10.10.10.2:3000
Complete the setup wizard:
- Create admin password
- Configure upstream DNS (Cloudflare: 1.1.1.1)
- Enable DNS filtering lists
- Set as DHCP DNS server on pfSense
# Check Tailscale status
ssh -i ~/.ssh/homelab ubuntu@10.10.10.3 "sudo tailscale status"
# Should show:
# # Active connections:
# ...# Verify containerd
ansible -i inventories/homelab/hosts.yml k3s_master,k3s_workers -m shell -a "containerd --version" --become
# Verify kernel modules
ansible -i inventories/homelab/hosts.yml k3s_master,k3s_workers -m shell -a "lsmod | grep -E 'overlay|br_netfilter'" --become
# Verify swap is disabled
ansible -i inventories/homelab/hosts.yml k3s_master,k3s_workers -m shell -a "free -h" --becomeCreate ansible/inventories/homelab/group_vars/all.yml:
---
# Common variables for all hosts
# Timezone
homelab_timezone: UTC
# DNS servers
homelab_dns_servers:
- 10.10.10.2 # AdGuard Home
- 1.1.1.1 # Cloudflare fallback
# NTP servers
homelab_ntp_servers:
- pool.ntp.org
# Domain
homelab_domain: homelab.localCreate ansible/inventories/homelab/group_vars/k3s.yml:
---
# K3s cluster variables
# K3s version
k3s_version: "v1.28.5+k3s1"
# K3s configuration
k3s_cluster_cidr: "10.42.0.0/16"
k3s_service_cidr: "10.43.0.0/16"
# K3s server URL (for workers)
k3s_server_url: "https://10.10.10.10:6443"
# Disable Traefik (using ingress-nginx instead)
k3s_disable_traefik: trueIssue: UNREACHABLE! => {"failed": true, "msg": "Failed to connect..."}
Solution:
# Test SSH manually
ssh -i ~/.ssh/homelab -vvv ubuntu@10.10.10.10
# Check if host is reachable
ping -c 3 10.10.10.10
# Verify ansible inventory
ansible-inventory -i inventories/homelab/hosts.yml --listIssue: AdGuard service fails to start
Solution:
# Check service logs
ssh -i ~/.ssh/homelab ubuntu@10.10.10.2 "sudo journalctl -u AdGuardHome -n 50"
# Check if port is already in use
ssh -i ~/.ssh/homelab ubuntu@10.10.10.2 "sudo netstat -tulpn | grep 3000"
# Restart manually
ssh -i ~/.ssh/homelab ubuntu@10.10.10.2 "sudo systemctl restart AdGuardHome"Issue: Tailscale not authenticated
Solution:
# SSH to the node
ssh -i ~/.ssh/homelab ubuntu@10.10.10.3
# Run authentication command
sudo tailscale up --advertise-routes=10.10.10.0/24,10.20.20.0/24
# Follow browser prompt
# Approve subnet routes in Tailscale admin consoleBase services configured:
➡️ Continue to Guide 05: K3s Cluster
- Ansible configuration created
- Inventory verified
- Common role created
- AdGuard role created
- Tailscale role created
- Playbooks created
- Common configuration applied
- AdGuard Home deployed and configured
- Tailscale deployed and authenticated
- K3s prerequisites applied
- All services verified
Deploy a highly available K3s Kubernetes cluster with 1 master and 2 worker nodes.
This guide deploys a K3s Kubernetes cluster using Ansible. K3s is a lightweight, certified Kubernetes distribution perfect for edge computing and homelabs.
Time Required: ~45 minutes Prerequisites: Guide 04 completed
K3s Cluster
┌────────────────────────────────────────┐
│ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ k3s-master │ │ k3s-worker-01 │ │
│ │ 10.10.10.10 │◄─┤ 10.10.10.11 │ │
│ │ │ │ (pve-vader) │ │
│ │ Control │ │ │ │
│ │ Plane │ │ Workloads │ │
│ └──────────────┘ └─────────────────┘ │
│ │ │
│ │ │
│ ┌─────────────────┐ │
│ │ k3s-worker-02 │ │
│ │ 10.10.10.12 │ │
│ │ (pve-sidious) │ │
│ │ │ │
│ │ Workloads │ │
│ └─────────────────┘ │
└────────────────────────────────────────┘
Pod Network: 10.42.0.0/16 (Flannel)
Service Network: 10.43.0.0/16
mkdir -p ansible/roles/k3s/{tasks,handlers,templates,files}Create ansible/roles/k3s/tasks/master.yml:
---
# K3s master installation tasks
- name: Create K3s config directory
ansible.builtin.file:
path: /etc/rancher/k3s
state: directory
mode: '0755'
become: true
- name: Download K3s binary
ansible.builtin.get_url:
url: "https://github.com/k3s-io/k3s/releases/download/{{ k3s_version }}/k3s"
dest: /usr/local/bin/k3s
mode: '0755'
owner: root
group: root
become: true
- name: Create K3s systemd service
ansible.builtin.template:
src: k3s.service.j2
dest: /etc/systemd/system/k3s.service
mode: '0644'
become: true
notify: restart k3s
- name: Enable and start K3s
ansible.builtin.systemd:
name: k3s
state: started
enabled: true
daemon_reload: true
become: true
- name: Wait for K3s to be ready
ansible.builtin.wait_for:
port: 6443
delay: 10
timeout: 300
- name: Get K3s node token
ansible.builtin.slurp:
src: /var/lib/rancher/k3s/server/node-token
register: k3s_token
become: true
failed_when: false
- name: Save K3s token to file
ansible.builtin.copy:
content: "{{ k3s_token.content | b64decode }}"
dest: /tmp/k3s-token
mode: '0600'
become: true
delegate_to: localhost
when: k3s_token.content is defined
- name: Display K3s master info
ansible.builtin.debug:
msg:
- "K3s Master deployed successfully!"
- "API Server: https://{{ ansible_host }}:6443"
- "Node token saved to /tmp/k3s-token"Create ansible/roles/k3s/tasks/worker.yml:
---
# K3s worker installation tasks
- name: Download K3s binary
ansible.builtin.get_url:
url: "https://github.com/k3s-io/k3s/releases/download/{{ k3s_version }}/k3s"
dest: /usr/local/bin/k3s
mode: '0755'
owner: root
group: root
become: true
- name: Create K3s config directory
ansible.builtin.file:
path: /etc/rancher/k3s
state: directory
mode: '0755'
become: true
- name: Create K3s agent systemd service
ansible.builtin.template:
src: k3s-agent.service.j2
dest: /etc/systemd/system/k3s-agent.service
mode: '0644'
become: true
notify: restart k3s-agent
- name: Enable and start K3s agent
ansible.builtin.systemd:
name: k3s-agent
state: started
enabled: true
daemon_reload: true
become: true
- name: Wait for node to register
ansible.builtin.pause:
seconds: 30
- name: Display worker node info
ansible.builtin.debug:
msg: "K3s Worker {{ inventory_hostname }} deployed successfully!"Create ansible/roles/k3s/templates/k3s.service.j2:
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target
[Service]
Type=notify
EnvironmentFile=-/etc/default/%i
EnvironmentFile=-/etc/sysconfig/%i
EnvironmentFile=-/etc/rancher/k3s/%i.conf
ExecStartPre=/bin/sh -xc '! test -f /etc/rancher/k3s/k3s-lock || exit 1; touch /etc/rancher/k3s/k3s-lock; rm -f /etc/rancher/k3s/k3s-lock'
ExecStart=/usr/local/bin/k3s \
server \
'--tls-san={{ ansible_host }}' \
'--tls-san=k3s-master-01.homelab.local' \
'--cluster-init' \
{% if k3s_disable_traefik %}'--disable traefik'{% endif %} \
'--disable-cloud-controller' \
'--write-kubeconfig-mode 644' \
'--node-name={{ ansible_hostname }}'
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.targetCreate ansible/roles/k3s/templates/k3s-agent.service.j2:
[Unit]
Description=Lightweight Kubernetes Agent
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target
[Install]
WantedBy=multi-user.target
[Service]
Type=exec
EnvironmentFile=-/etc/default/%i
EnvironmentFile=-/etc/sysconfig/%i
ExecStart=/usr/local/bin/k3s \
agent \
'--server={{ k3s_server_url }}' \
'--token={{ k3s_token }}' \
'--node-name={{ ansible_hostname }}'
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.targetCreate ansible/roles/k3s/tasks/main.yml:
---
- name: Include master tasks
include_tasks: master.yml
when: inventory_hostname in groups['k3s_master']
tags: k3s_master
- name: Include worker tasks
include_tasks: worker.yml
when: inventory_hostname in groups['k3s_workers']
tags: k3s_workerCreate ansible/roles/k3s/handlers/main.yml:
---
- name: restart k3s
ansible.builtin.systemd:
name: k3s
state: restarted
become: true
- name: restart k3s-agent
ansible.builtin.systemd:
name: k3s-agent
state: restarted
become: trueCreate ansible/playbooks/k3s-install.yml:
---
# Deploy K3s Kubernetes cluster
- name: Deploy K3s master
hosts: k3s_master
become: true
gather_facts: true
roles:
- role: k3s
tags: k3s
- name: Get K3s token
hosts: localhost
gather_facts: false
tasks:
- name: Read K3s token
ansible.builtin.slurp:
src: /tmp/k3s-token
register: k3s_token_file
failed_when: false
- name: Set token fact
ansible.builtin.set_fact:
k3s_token: "{{ k3s_token_file.content | b64decode }}"
when: k3s_token_file.content is defined
- name: Deploy K3s workers
hosts: k3s_workers
become: true
gather_facts: true
roles:
- role: k3s
tags: k3s
vars:
k3s_token: "{{ hostvars[groups['k3s_master'][0]]['k3s_token'] | default(hostvars['localhost']['k3s_token']) }}"cd /Volumes/Codex/Projects/homelab/ansible
# Deploy K3s master only
ansible-playbook -i inventories/homelab/hosts.yml playbooks/k3s-install.yml --tags k3s_master
# Expected output:
# TASK [k3s : Display K3s master info]
# ok: [k3s-master-01] => {
# "msg": [
# "K3s Master deployed successfully!",
# "API Server: https://10.10.10.10:6443"
# ]
# }# Copy kubeconfig from master
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "sudo cat /etc/rancher/k3s/k3s.yaml" > ~/.kube/homelab-config
# Fix server address
sed -i '' 's/127.0.0.1/10.10.10.10/g' ~/.kube/homelab-config
# Set KUBECONFIG
export KUBECONFIG=$HOME/.kube/homelab-config
# Test connectivity
kubectl get nodes
# Expected output:
# NAME STATUS ROLES AGE VERSION
# k3s-master-01 Ready control-plane,etcd,master 10s v1.28.5+k3s1# Deploy K3s workers
ansible-playbook -i inventories/homelab/hosts.yml playbooks/k3s-install.yml --tags k3s_worker
# Expected output:
# TASK [k3s : Display worker node info]
# ok: [k3s-worker-01] => "K3s Worker k3s-worker-01 deployed successfully!"
# ok: [k3s-worker-02] => "K3s Worker k3s-worker-02 deployed successfully!"# Check all nodes
kubectl get nodes -o wide
# Expected output:
# NAME STATUS ROLES AGE VERSION
# k3s-master-01 Ready control-plane,etcd,master 1m v1.28.5+k3s1
# k3s-worker-01 Ready <none> 30s v1.28.5+k3s1
# k3s-worker-02 Ready <none> 30s v1.28.5+k3s1
# Check system pods
kubectl get pods -n kube-system
# Expected output: All pods Running or Completed# Download helm binary
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod +x get_helm.sh
./get_helm.sh
# Verify
helm version# Add common repositories
helm repo add jetstack https://charts.jetstack.io
helm repo add longhorn https://charts.longhorn.io
helm repo add argo https://argoproj.github.io/argo-helm
helm repo add hashicorp https://helm.releases.hashicorp.com
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update# Install CRDs
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.1/cert-manager.crds.yaml
# Create namespace
kubectl create namespace cert-manager
# Install cert-manager
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--version v1.13.1 \
--set installCRDs=true
# Verify
kubectl get pods -n cert-managerAlready disabled during installation. Verify:
kubectl get pods -n kube-system | grep traefik
# Should return nothing (Traefik is disabled)# Create test namespace
kubectl create namespace test
# Deploy nginx
kubectl create deployment nginx --image=nginx:latest -n test
# Expose service
kubectl expose deployment nginx --port=80 --type=NodePort -n test
# Get pod info
kubectl get pods -n test -o wide
# Get service info
kubectl get svc -n test# Get NodePort
NODE_PORT=$(kubectl get svc nginx -n test -o jsonpath='{.spec.ports[0].nodePort}')
# Test access from local machine
curl -I http://10.10.10.10:$NODE_PORT
# Should return: HTTP/1.1 200 OK# Delete test namespace
kubectl delete namespace test# Add to shell config
cat >> ~/.zshrc << 'EOF'
# K3s Homelab
export KUBECONFIG=$HOME/.kube/homelab-config
alias k=kubectl
source <(kubectl completion zsh)
complete -F __start_kubectl k
EOF
# Reload shell
source ~/.zshrc
# Test
k get nodesIssue: K3s service fails to start
Solution:
# Check service logs
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "sudo journalctl -u k3s -n 50 --no-pager"
# Check if port 6443 is available
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "sudo netstat -tulpn | grep 6443"
# Restart K3s
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "sudo systemctl restart k3s"Issue: Workers stuck in "NotReady" state
Solution:
# Check worker logs
ssh -i ~/.ssh/homelab ubuntu@10.10.10.11 "sudo journalctl -u k3s-agent -n 50 --no-pager"
# Verify token is correct
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "sudo cat /var/lib/rancher/k3s/server/node-token"
# Check firewall on master
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "sudo iptables -L -n | grep 6443"
# Restart worker agent
ssh -i ~/.ssh/homelab ubuntu@10.10.10.11 "sudo systemctl restart k3s-agent"Issue: kubectl get nodes returns connection refused
Solution:
# Verify kubeconfig points to correct IP
grep server: ~/.kube/homelab-config
# Should be: server: https://10.10.10.10:6443
# Test API server directly
curl -k https://10.10.10.10:6443/version
# Check if master node is ready
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "sudo systemctl status k3s"K3s cluster deployed and verified:
➡️ Continue to Guide 06: Longhorn Storage
- K3s Ansible role created
- K3s installation playbook created
- K3s master deployed
- Kubeconfig retrieved and configured
- kubectl connectivity verified
- K3s workers deployed
- All nodes in Ready state
- System pods running
- Helm installed
- Helm repositories added
- cert-manager installed
- Test deployment successful
- Test deployment cleaned up
- kubectl autocomplete configured
Install and configure Longhorn for distributed block storage across your K3s cluster.
Longhorn is a lightweight, reliable distributed block storage solution for Kubernetes. This guide installs Longhorn with proper storage classes for different workload types.
Time Required: ~30 minutes Prerequisites: Guide 05 completed, K3s cluster running
Longhorn Storage Cluster
┌────────────────────────────────────────────────┐
│ │
│ Each node contributes 80GB for storage │
│ │
│ ┌────────────┐ ┌────────────┐ ┌───────────┐ │
│ │ k3s- │ │ k3s- │ │ k3s- │ │
│ │ master │ │ worker-1 │ │ worker-2 │ │
│ │ │ │ │ │ │ │
│ │ Engine: │ │ Engine: │ │ Engine: │ │
│ │ Replica Mgr│ │ Replica Mgr│ │ Replica │ │
│ │ 80GB pool │ │ 80GB pool │ │ 80GB pool │ │
│ └────────────┘ └────────────┘ └───────────┘ │
│ │ │ │ │
│ └───────────────┴──────────────┘ │
│ │ │
│ 3-way replication │
│ (for critical data) │
└────────────────────────────────────────────────┘
Storage Classes:
├── longhorn-critical (replicas: 3) - Vault, Gitea
├── longhorn-default (replicas: 2) - Monitoring, Apps
└── longhorn-ephemeral (replicas: 1) - Cache, Temp data
# Check available disk space on each node
kubectl get nodes -o custom-columns=NAME:.metadata.name,DISK:.status.capacity[ephemeral-storage]
# SSH to nodes to verify
ssh -i ~/.ssh/homelab ubuntu@10.10.10.10 "df -h /var/lib/rancher/k3s"
ssh -i ~/.ssh/homelab ubuntu@10.10.10.11 "df -h /var/lib/rancher/k3s"
ssh -i ~/.ssh/homelab ubuntu@10.10.10.12 "df -h /var/lib/rancher/k3s"# Install NFS client on all nodes (required for Longhorn)
ansible -i ansible/inventories/homelab/hosts.yml k3s_master,k3s_workers -m shell -a "sudo apt install -y open-iscsi nfs-common" --become
# Start and enable iscsi service
ansible -i ansible/inventories/homelab/hosts.yml k3s_master,k3s_workers -m shell -a "sudo systemctl enable --now iscsid" --becomehelm repo add longhorn https://charts.longhorn.io
helm repo updatekubectl create namespace longhorn-systemhelm install longhorn longhorn/longhorn \
--namespace longhorn-system \
--set defaultSettings.defaultReplicaCount=2 \
--set defaultSettings.defaultDataPath="/var/lib/longhorn" \
--set defaultSettings.storageMinimalAvailablePercentage=10 \
--set defaultSettings.targetBackupCount=2 \
--set persistence.defaultClassReplicaCount=2 \
--set csi.kubeletRootDir=/var/lib/kubelet \
--wait# Check Longhorn pods
kubectl get pods -n longhorn-system -w
# Expected output (after ~2-3 minutes):
# NAME READY STATUS RESTARTS AGE
# longhorn-driver-deployer-xxx 1/1 Running 0 2m
# instance-manager-xxx 1/1 Running 0 2m
# engine-image-xxx 1/1 Running 0 2m
# ...Create file gitops-apps/infrastructure/longhorn-storageclasses.yaml:
---
# Longhorn Storage Classes
# Applied via: kubectl apply -f gitops-apps/infrastructure/longhorn-storageclasses.yaml
#
# Replica strategy for a 2-node active cluster (vader + sidious):
# longhorn-default = 2 replicas ← DEFAULT (survives 1 node loss)
# longhorn-critical = 3 replicas (will show Degraded when Maul is off - acceptable)
# longhorn-ephemeral = 1 replica (cache / scratch data only)
---
# Default Storage Class (2 replicas) — safe on 2-node cluster
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-default
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
numberOfReplicas: "2"
staleReplicaTimeout: "2880"
fromBackup: ""
fsType: "ext4"
dataEngine: "v1"
reclaimPolicy: Delete
volumeBindingMode: Immediate
---
# Critical Storage Class (3 replicas — HA, for Vault and Gitea)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-critical
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
numberOfReplicas: "3"
staleReplicaTimeout: "2880"
fromBackup: ""
fsType: "ext4"
dataEngine: "v1"
reclaimPolicy: Delete
volumeBindingMode: Immediate
---
# Ephemeral Storage Class (1 replica — cache/temp only)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-ephemeral
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
numberOfReplicas: "1"
staleReplicaTimeout: "2880"
fromBackup: ""
fsType: "ext4"
dataEngine: "v1"
reclaimPolicy: Delete
volumeBindingMode: Immediatekubectl apply -f gitops-apps/infrastructure/longhorn-storageclasses.yaml
# Verify storage classes
kubectl get storageclass
# Expected output:
# NAME PROVISIONER RECLAIMPOLICY
# longhorn-default (default) driver.longhorn.io Delete
# longhorn-critical driver.longhorn.io Delete
# longhorn-ephemeral driver.longhorn.io Delete# Port forward to access UI
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8080:80
# Access at: http://localhost:8080
# Default credentials: admin / (get from secret)Get admin password:
kubectl get secrets -n longhorn-system longhorn-password -o jsonpath="{.data.password}" | base64 -d# Enable automatic disk cleanup (when disk usage > 85%)
kubectl patch -n longhorn-system settings.longhorn.io system-managed-components-pods-image-pull-policy \
--type=merge -p '{"spec":{"value":"IfNotPresent"}}'
# Configure recurring snapshot limit (optional)
kubectl patch -n longhorn-system settings.longhorn.io recurring-job-max \
--type=merge -p '{"spec":{"value":"5"}}'# Create test namespace
kubectl create namespace test-storage
# Create test PVC
cat << EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
namespace: test-storage
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn-critical
resources:
requests:
storage: 1Gi
EOFcat << EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: test-pod
namespace: test-storage
spec:
containers:
- name: test
image: nginx:latest
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: test-pvc
EOF# Check volume status
kubectl get volumes.longhorn.io -n longhorn-system
# Describe volume to see replica distribution
kubectl describe volume -n longhorn-system pvc-xxx
# Expected: replicas distributed across 3 nodes# Simulate node failure (cordoning)
kubectl cordon k3s-worker-01
# Verify volume remains accessible
kubectl exec -n test-storage test-pod -- ls /data
# Uncordon node
kubectl uncordon k3s-worker-01# Delete test namespace
kubectl delete namespace test-storage
# Verify all volumes cleaned up
kubectl get volumes.longhorn.io -n longhorn-systemIf you have NFS or S3 storage for backups:
# Example: NFS backup target
kubectl patch -n longhorn-system settings.longhorn.io backup-target -p '{"value":"nfs://192.168.1.100:/backups/longhorn"}'
# Example: S3 backup target
kubectl patch -n longhorn-system settings.longhorn.io backup-target -p '{"value":"s3://backup-bucket@region/"}'# For S3, create secret
kubectl create secret generic -n longhorn-system longhorn-backup-secret \
--from-literal=AWS_ACCESS_KEY_ID=your-key \
--from-literal=AWS_SECRET_ACCESS_KEY=your-secret \
--from-literal=AWS_ENDPOINTS=https://s3.amazonaws.com
# Update settings
kubectl patch -n longhorn-system settings.longhorn.io backup-target-credential-secret -p '{"value":"longhorn-backup-secret"}'Issue: Longhorn pods stuck in Pending state
Solution:
# Describe pod to see issue
kubectl describe pod -n longhorn-system <pod-name>
# Common issue: node selector mismatch
kubectl get nodes --show-labels
# Fix: Add required labels to nodes
kubectl label nodes k3s-master-01 node.longhorn.io/create-default-disk=trueIssue: PVC stuck in Pending state
Solution:
# Check PVC events
kubectl describe pvc <pvc-name>
# Check Longhorn engine logs
kubectl logs -n longhorn-system -l app=longhorn-manager -f
# Verify node disk space
kubectl get nodes -o custom-columns=NAME:.metadata.name,DISC:.status.capacity[ephemeral-storage]Issue: Replicas not distributed evenly
Solution:
# Check node disk capacity
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable["storage\.openshift\.com/longhorn"]}'Longhorn storage installed and configured:
➡️ Continue to Guide 07: GitOps Stack
- Node storage verified
- iSCSI and NFS clients installed
- Longhorn Helm repository added
- Longhorn installed
- All Longhorn pods running
- Storage classes created
- Longhorn UI accessible
- Admin password retrieved
- Test PVC created
- Test pod deployed
- Volume replication verified
- Failover tested
- Test resources cleaned up
- Backup configured (optional)
Deploy a self-hosted Git platform and GitOps controller for declarative infrastructure management.
This guide deploys Gitea (self-hosted Git) and ArgoCD (GitOps controller) to enable Infrastructure as Code workflows with declarative Kubernetes management.
Time Required: ~45 minutes Prerequisites: Guide 06 completed, Longhorn storage available
┌─────────────────────────────────────────────────────────────┐
│ GitOps Flow │
├─────────────────────────────────────────────────────────────┤
│ │
│ Developer Gitea ArgoCD │
│ (local) (Git) (Controller) │
│ │ │ │ │
│ ├─ git push ──────►│ │ │
│ │ ├─ webhook ───────►│ │
│ │ │ │ │
│ │ │◄── fetch ─────────┤ │
│ │ │ │ │
│ │ │ ├─ sync ───────►│
│ │ │ │ K8s │
│ │ │ │ │
│ ┌──┴───────────────────┴───────────────────┴────────┐ │
│ │ kubectl get pods -A │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ Services: │
│ ├── Gitea (git.homelab.local) │
│ ├── ArgoCD (argocd.homelab.local) │
│ └── Repositories: │
│ ├── gitops-apps │
│ ├── gitops-infrastructure │
│ └── ansible-playbooks │
└─────────────────────────────────────────────────────────────┘
kubectl create namespace postgresqlhelm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
# Store passwords in a values file — do NOT inline credentials in shell history
# Create gitops-apps/infrastructure/postgresql-values.yaml (add to .gitignore!)
cat > /tmp/postgresql-values.yaml <<'EOF'
auth:
enablePostgresUser: true
postgresPassword: "<set-a-strong-password>"
database: gitea
username: gitea
password: "<set-a-strong-password>"
primary:
persistence:
enabled: true
storageClass: longhorn-critical
size: 10Gi
EOF
helm install postgresql bitnami/postgresql \
--namespace postgresql \
--create-namespace \
--values /tmp/postgresql-values.yaml \
--wait# Get PostgreSQL password
export POSTGRES_PASSWORD=$(kubectl get secret -n postgresql postgresql -o jsonpath="{.data.postgres-password}" | base64 -d)
# Get connection details
POSTGRES_HOST="postgresql.postgresql.svc.cluster.local"
POSTGRES_PORT="5432"
POSTGRES_USER="postgres"
POSTGRES_DB="gitea"
echo "PostgreSQL connection string:"
echo "postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${POSTGRES_HOST}:${POSTGRES_PORT}/${POSTGRES_DB}"kubectl create namespace giteahelm repo add gitea-charts https://dl.gitea.com/charts/
helm repo update
helm install gitea gitea-charts/gitea \
--namespace gitea \
--set gitea.config.server.DOMAIN=git.homelab.local \
--set gitea.config.server.ROOT_URL=https://git.homelab.local \
--set gitea.config.server.SSH_DOMAIN=git.homelab.local \
--set gitea.config.server.SSH_PORT=2222 \
--set gitea.config.database.DB_TYPE=postgres \
--set gitea.config.database.HOST=${POSTGRES_HOST}:${POSTGRES_PORT} \
--set gitea.config.database.NAME=${POSTGRES_DB} \
--set gitea.config.database.USER=${POSTGRES_USER} \
--set gitea.config.database.PASSWD=${POSTGRES_PASSWORD} \
--set gitea.admin.username=admin \
--set gitea.admin.password=ChangeMe!123 \
--set gitea.admin.email=admin@homelab.local \
--set persistence.enabled=true \
--set persistence.storageClass=longhorn-critical \
--set persistence.size=10Gi \
--set service.ssh.type=LoadBalancer \
--set service.ssh.ports.ssh=2222 \
--set ingress.enabled=true \
--set ingress.className=nginx \
--set ingress.hosts[0].host=git.homelab.local \
--set ingress.hosts[0].paths[0].path=/ \
--set ingress.hosts[0].paths[0].pathType=Prefix \
--set ingress.tls=true \
--set ingress.tls[0].hosts[0]=git.homelab.local \
--set ingress.tls[0].secretName=git-homelab-tls \
--waitIf ingress-nginx is not yet installed, create service via NodePort:
# Patch service to NodePort for now
kubectl patch svc gitea-http -n gitea -p '{"spec":{"type":"NodePort","ports":[{"port":3000,"targetPort":3000,"nodePort":30080}],"selector":{"app.kubernetes.io/name":"gitea"}}}'
# Access Gitea
# http://10.10.10.10:30080- Open browser:
http://10.10.10.10:30080(or via ingress if configured) - Login with admin credentials
- Username:
admin - Password:
ChangeMe!123
- Username:
Via Gitea UI:
- Navigate to Organizations > Create Organization
- Name:
homelab - Visibility: Private
- Create
Create the following repositories in the homelab organization:
| Repository | Description | Visibility |
|---|---|---|
gitops-apps |
Application manifests | Private |
gitops-infrastructure |
Infrastructure manifests | Private |
terraform-proxmox |
Terraform code | Private |
ansible-playbooks |
Ansible playbooks | Private |
# Generate personal access token in Gitea:
# User Settings > Applications > Generate Token
# Store token securely
echo "export GITEA_TOKEN=your_token_here" >> ~/.zshrc
source ~/.zshrckubectl create namespace argocdhelm install argocd argo/argo-cd \
--namespace argocd \
--set server.service.type=NodePort \
--set server.service.nodePortHttp=30081 \
--set server.service.nodePortHttps=30443 \
--wait
# After install, set a strong admin password via argocd CLI:
# argocd admin initial-password -n argocd ← get the auto-generated initial password
# argocd login 10.10.10.10:30081
# argocd account update-password# Port forward to access UI
kubectl port-forward -n argocd svc/argocd-server 8080:443
# Access at: https://localhost:8080
# Accept self-signed certificate# The password is set to "admin" via the hashed password above
# Username: admin
# Password: adminChange password on first login.
Via ArgoCD UI:
- Navigate to Settings > Repositories
- Click Connect Repo
- Select Git
- Enter details:
- Repository URL:
https://git.homelab.local/homelab/gitops-apps.git - Username:
admin - Password:
<Gitea password or token> - Skip server verification: true
- Repository URL:
- Click Connect
Create project manifest gitops-apps/argocd-apps/projects.yaml:
---
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: homelab
namespace: argocd
spec:
description: Homelab Project
sourceRepos:
- '*'
destinations:
- namespace: '*'
server: https://kubernetes.default.svc
clusterResourceWhitelist:
- group: '*'
kind: '*'
namespaceResourceWhitelist:
- group: '*'
kind: '*'
orphanedResources:
warn: falseApply the project:
kubectl apply -f gitops-apps/argocd-apps/projects.yamlCreate gitops-apps/argocd-apps/root-application.yaml:
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root-app
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: homelab
source:
repoURL: https://git.homelab.local/homelab/gitops-apps.git
targetRevision: main
path: argocd-apps
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
- PruneLast=truecd /Volumes/Codex/Projects/homelab/gitops-apps
mkdir -p argocd-apps/{infrastructure,services,monitoring,security}Create argocd-apps/infrastructure/longhorn.yaml:
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: longhorn
namespace: argocd
spec:
project: homelab
source:
repoURL: https://git.homelab.local/homelab/gitops-apps.git
targetRevision: main
path: infrastructure/longhorn
destination:
server: https://kubernetes.default.svc
namespace: longhorn-system
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true# Initialize git repository
cd /Volumes/Codex/Projects/homelab/gitops-apps
git init
git add .
git commit -m "Initial ArgoCD apps structure"
# Add Gitea remote
git remote add origin https://git.homelab.local/homelab/gitops-apps.git
# Push (will prompt for credentials)
git push -u origin mainkubectl apply -f gitops-apps/argocd-apps/root-application.yaml# Create a test application
cat > gitops-apps/services/test-nginx.yaml << EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: test-nginx
namespace: argocd
spec:
project: homelab
source:
repoURL: https://git.homelab.local/homelab/gitops-apps.git
targetRevision: main
path: apps/test
destination:
server: https://kubernetes.default.svc
namespace: test
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
EOF
# Create test deployment
mkdir -p gitops-apps/apps/test
cat > gitops-apps/apps/test/deployment.yaml << EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: nginx
spec:
selector:
app: nginx
ports:
- port: 80
targetPort: 80
EOF
# Push to git
git add gitops-apps/services/test-nginx.yaml gitops-apps/apps/test/
git commit -m "Add test application"
git push
# Watch ArgoCD sync
kubectl get application -n argocd -wIssue: Repository connection fails
Solution:
# Verify Gitea is accessible
curl -k https://git.homelab.local/api/v1/version
# Check ArgoCD repo server logs
kubectl logs -n argocd -l app.kubernetes.io/name=argocd-repo-server -f
# Verify credentials
echo "https://admin:password@git.homelab.local" | base64Issue: Gitea can't connect to PostgreSQL
Solution:
# Check PostgreSQL is running
kubectl get pods -n postgresql
# Check Gitea logs
kubectl logs -n gitea -l app.kubernetes.io/name=gitea -f
# Verify connection string
kubectl get secret -n gitea gitea -o jsonpath='{.data\.gitea\.config}' | base64 -dGitOps stack deployed:
➡️ Continue to Guide 08: Security Tooling
- PostgreSQL deployed
- PostgreSQL connection verified
- Gitea namespace created
- Gitea deployed via Helm
- Gitea accessible via browser
- Admin user created
- Organization created
- Repositories created
- Git credentials configured
- ArgoCD namespace created
- ArgoCD deployed via Helm
- ArgoCD UI accessible
- Admin password retrieved
- ArgoCD connected to Gitea
- ArgoCD projects created
- App of Apps pattern configured
- Test application deployed
- GitOps workflow verified
Deploy comprehensive security tooling including Vault, Falco, Trivy, and Kyverno.
This guide installs a complete security stack for your homelab, covering secrets management, runtime security, vulnerability scanning, and policy enforcement.
Time Required: ~60 minutes Prerequisites: Guide 07 completed, ArgoCD running
Security Stack
┌────────────────────────────────────────────┐
│ │
│ ┌───────────┐ ┌──────────────┐ │
│ │ Vault │ │ Falco │ │
│ │ Secrets │ │ Runtime │ │
│ │ Mgmt │ │ Security │ │
│ └───────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Trivy │ │ Kyverno │ │
│ │ Vulnerability│ │ Policy │ │
│ │ Scanner │ │ Engine │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Features: │
│ - Secrets injection via Vault │
│ - Real-time threat detection (Falco) │
│ - Image and manifest scanning (Trivy) │
│ - Policy enforcement (Kyverno) │
└────────────────────────────────────────────┘
kubectl create namespace vaulthelm repo add hashicorp https://helm.releases.hashicorp.com
helm repo update
helm install vault hashicorp/vault \
--namespace vault \
--set server.dev.enabled=false \
--set server.standalone.enabled=true \
--set ui.enabled=true \
--set ui.serviceType=NodePort \
--set ui.serviceNodePort=8200 \
--set injector.enabled=true \
--set server.dataStorage.enabled=true \
--set server.dataStorage.storageClass=longhorn-critical \
--set server.dataStorage.size=5Gi \
--set 'server.standalone.config=storage "file" { path = "/vault/data" }
listener "tcp" {
address = "0.0.0.0:8200"
tls_disable = true
}
api_addr = "http://0.0.0.0:8200"
disable_mlock = true' \
--wait# Port forward to access Vault
kubectl port-forward -n vault svc/vault 8200:8200 &
export VAULT_ADDR='http://127.0.0.1:8200'
# Initialise Vault (first time only — save ALL output securely!)
vault operator init -key-shares=5 -key-threshold=3
# Unseal using 3 of the 5 unseal keys returned above:
vault operator unseal # repeat 3 times with different keys
# Log in with the root token returned by init:
export VAULT_TOKEN='<root-token-from-init>'
# Verify
vault status# Enable KV secrets engine
vault secrets enable -path=homelab kv-v2
# Create test secret
vault kv put homelab/test username=admin password=ChangeMe!
# Enable Kubernetes auth
vault auth enable kubernetes
# Configure Kubernetes auth
vault write auth/kubernetes/config \
kubernetes_host="https://kubernetes.default.svc:443"
# Create policy for applications
vault policy write homelab-apps - << EOF
path "homelab/data/*" {
capabilities = ["read"]
}
EOF
# Create role for application
vault write auth/kubernetes/role/homelab-apps \
bound_service_account_names="*" \
bound_service_account_namespaces="*" \
policies=homelab-apps \
ttl=24hCreate gitops-apps/security/vault.yaml:
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: vault
namespace: argocd
spec:
project: homelab
source:
repoURL: https://git.homelab.local/homelab/gitops-apps.git
targetRevision: main
path: security/vault
destination:
server: https://kubernetes.default.svc
namespace: vault
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=truekubectl create namespace falcohelm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update
helm install falco falcosecurity/falco \
--namespace falco \
--set driver.kind=ebpf \
--set tty=true \
--set falco.jsonOutput=true \
--set falco.jsonIncludeOutputProperty=true \
--set falco.logLevel=info \
--set falco.priority=debug \
--waitCreate config map gitops-apps/security/falco/rules.yaml:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: falco-custom-rules
namespace: falco
data:
homelab_rules.yaml: |
- rule: Homelab Shell in Container
desc: Detect shell spawned in container (homelab-specific)
condition: >
spawned_process
and container
and shell_procs
and not user_expected_shell_spawn
output: >
Shell spawned in container (user=%user.name container_name=%container.name
shell=%proc.name parent=%proc.pname cmdline=%proc.cmdline image=%container.image.repository)
priority: WARNING
tags: [shell, container]Apply rules:
kubectl apply -f gitops-apps/security/falco/rules.yaml# Trigger Falco event
kubectl run test-shell --image=nginx:latest --restart=Never -i -- sh -c "whoami"
# Check Falco logs
kubectl logs -n falco -l app.kubernetes.io/name=falco -f
# Cleanup
kubectl delete pod test-shellkubectl create namespace trivy-systemhelm repo add aqua https://aquasecurity.github.io/helm-charts/
helm repo update
helm install trivy-operator aqua/trivy-operator \
--namespace trivy-system \
--set serviceMonitor.enabled=true \
--set trivy.enabled=true \
--set trivy.image.repository=ghcr.io/aquasecurity/trivy \
--set trivy.image.tag=latest \
--set trivy.server.enabled=false \
--set trivy.dbRepository=ghcr.io/aquasecurity/trivy-db \
--set operator.builtInTrivyServer=false \
--wait# Scan all deployed images
cat << EOF | kubectl apply -f -
apiVersion: aquasecurity.github.io/v1alpha1
kind: ClusterComplianceReport
metadata:
name: homelab-compliance
spec:
cron: "0 0 * * *"
reportType: summary
format: json
compliance:
checks:
- id: AVD-KSV-0015
severity: HIGH
- id: AVD-KSV-0016
severity: MEDIUM
EOF# List vulnerability reports
kubectl get vulnerabilityreports -A
# Describe specific report
kubectl describe vulnerabilityreports -n trivy-system <report-name>kubectl create namespace kyvernohelm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update
helm install kyverno kyverno/kyverno \
--namespace kyverno \
--set replicaCount=1 \
--set initContainer.resources.limits.memory=500Mi \
--waitCreate gitops-apps/security/kyverno/policies.yaml:
---
# Policy: Disallow privileged containers
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-privileged
spec:
validationFailureAction: enforce
background: true
rules:
- name: validate-privileged
match:
resources:
kinds:
- Pod
validate:
message: "Privileged mode is not allowed"
pattern:
spec:
=(containers):
- =(securityContext):
=(privileged): false
---
# Policy: Require resource limits
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-limits
spec:
validationFailureAction: enforce
background: true
rules:
- name: validate-resources
match:
resources:
kinds:
- Pod
validate:
message: "CPU and memory resource limits are required"
pattern:
spec:
=(containers):
- resources:
limits:
memory: "?*"
cpu: "?*"
---
# Policy: Disallow latest tag
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-latest-tag
spec:
validationFailureAction: enforce
background: true
rules:
- name: validate-image-tag
match:
resources:
kinds:
- Pod
validate:
message: "Using the ':latest' tag is not allowed"
foreach:
- list: request.object.spec.containers
pattern:
image: "!*:latest"
---
# Policy: Auto-generate default-deny NetworkPolicy for production namespaces
# This generate rule fires when a Namespace with label environment=production is created.
# It creates a default-deny-ingress NetworkPolicy in that namespace automatically.
# Namespaces in the exclusion list are skipped.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: add-default-deny-networkpolicy
annotations:
policies.kyverno.io/title: Add Default-Deny NetworkPolicy
policies.kyverno.io/description: >
Automatically generate a default-deny-ingress NetworkPolicy in every
namespace labelled 'environment: production'. This ensures all ingress
is blocked unless explicitly allowed by another NetworkPolicy.
spec:
rules:
- name: generate-default-deny
match:
any:
- resources:
kinds:
- Namespace
selector:
matchLabels:
environment: production
exclude:
any:
- resources:
namespaces:
- kube-system
- kyverno
- falco
- vault
- argocd
- longhorn-system
- monitoring
- logging
generate:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
name: default-deny-ingress
namespace: "{{request.object.metadata.name}}"
synchronize: true
data:
spec:
podSelector: {}
policyTypes:
- Ingress
Apply policies:
kubectl apply -f gitops-apps/security/kyverno/policies.yaml# Test privileged container policy (should fail)
kubectl run test-privileged --image=nginx:latest --privileged --restart=Never
# Expected: Error from server (Forbidden): admission webhook denied the request
# Test latest tag policy (should fail)
kubectl run test-latest --image=nginx:latest --restart=Never
# Expected: Error from server (Forbidden): admission webhook denied the request
# Test with proper image tag (should succeed)
kubectl run test-proper --image=nginx:1.25 --restart=Never
# Cleanup
kubectl delete pod test-properkubectl create namespace ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--set controller.service.type=NodePort \
--set controller.service.nodePorts.http=30080 \
--set controller.service.nodePorts.https=30443 \
--wait| Tool | URL | Credentials |
|---|---|---|
| Vault | http://10.10.10.10:8200 |
Token: dev-xxxxx |
| Falco Logs | kubectl logs -n falco -l app.kubernetes.io/name=falco |
- |
| Trivy Reports | kubectl get vulnerabilityreports -A |
- |
| Kyverno Policies | kubectl get clusterpolicies |
- |
Create application manifests for all security tools in gitops-apps/security/:
# Create applications directory
mkdir -p gitops-apps/security/{vault,falco,trivy,kyverno}
# Create ArgoCD applications for each security tool
# (Similar to Vault example in Phase 1.5)Push to Gitea:
git add gitops-apps/security/
git commit -m "Add security tooling applications"
git pushIssue: Vault stuck in pending state
Solution:
# Check Vault logs
kubectl logs -n vault -l app.kubernetes.io/name=vault -f
# Check storage
kubectl get pvc -n vault
# Restart Vault
kubectl rollout restart deployment vault -n vaultIssue: No events in Falco logs
Solution:
# Check Falco status
kubectl get pods -n falco
# Verify eBPF is loaded
kubectl exec -n falco -l app.kubernetes.io/name=falco -- lsmod | grep falco
# Check Falco version
kubectl exec -n falco -l app.kubernetes.io/name=falco -- falco --versionIssue: Trivy scans stuck in running state
Solution:
# Check Trivy operator logs
kubectl logs -n trivy-system -l app.kubernetes.io/name=trivy-operator -f
# Check available resources
kubectl top nodes
kubectl top pods -n trivy-system
# Increase timeout
kubectl patch trivy trivy -n trivy-system -p '{"spec":{"timeout":"5m"}}'Security tooling deployed:
➡️ Continue to Guide 09: Red/Blue Team
- Vault namespace created
- Vault deployed
- Vault initialized and unsealed
- KV secrets engine enabled
- Kubernetes auth configured
- Vault policies created
- Falco namespace created
- Falco installed with eBPF driver
- Custom Falco rules applied
- Falco tested with shell event
- Trivy namespace created
- Trivy operator installed
- Scan jobs created
- Vulnerability reports accessible
- Kyverno namespace created
- Kyverno installed
- Baseline policies applied
- Policies tested
- Ingress controller installed
- All tools accessible
- ArgoCD applications created
- Security stack verified
Deploy isolated security sandbox environments for red team (attack) and blue team (defense) exercises.
This guide creates isolated network segments for security testing, including attack tools, vulnerable targets, and defensive monitoring infrastructure.
Time Required: ~45 minutes Prerequisites: Guide 08 completed, pfSense configured with VNet2
┌─────────────────────────────────────────────────────────────┐
│ Security Sandbox Network │
│ (VNet2: 10.20.20.0/24) │
├─────────────────────────────────────────────────────────────┤
│ │
│ Red Team (Attack) Blue Team (Defense) │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Kali Linux │ │ Security │ │
│ │ 10.20.20.10 │ │ Onion │ │
│ │ │ │ (Wazuh) │ │
│ │ Tools: │ │ │ │
│ │ - Metasploit│ │ ELK Stack │ │
│ │ - Burp │ │ IDS/IPS │ │
│ │ - Nmap │ │ Log Analysis │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ │ Vulnerable Targets │ │
│ │ ┌──────────────────┐ │ │
│ └───►│ Metasploitable 3 │ │ │
│ │ DVWA │ │ │
│ │ Juice Shop │ │ │
│ └──────────────────┘ │ │
│ │ │
│ Firewall Rules: │ │
│ ❌ VNet2 → VNet1 (blocked) │ │
│ ✅ VNet2 → WAN (allowed) │ │
│ ⚠️ VNet1 → VNet2 (restricted) │ │
└─────────────────────────────────────────────────────────────┘
Via Proxmox console:
# Open console for pfsense-router VM
# Default credentials: admin / pfsenseOr via web UI:
http://10.10.10.1
Via pfSense Web UI:
- Navigate to Interfaces > OPT1
- Enable interface
- Configure:
- Name:
SANDBOX - IPv4: 10.20.20.1/24
- Gateway: None
- Name:
- Save and Apply
Navigate to Firewall > Rules > SANDBOX
Add rules:
| Action | Interface | Source | Destination | Port | Description |
|---|---|---|---|---|---|
| Block | SANDBOX | * | Homelab Net | * | Block access to production |
| Pass | SANDBOX | * | WAN | * | Allow internet |
| Pass | Homelab Net | Mgmt IPs | SANDBOX | * | Allow management access |
Critical: The first rule blocks all traffic from VNet2 to VNet1.
This requires manual installation via ISO:
# Get the current Kali Linux installer ISO URL
# Check https://www.kali.org/get-kali/#kali-installer-images for the latest version
# Example (update the version before running):
KALI_VERSION=$(curl -s https://cdimage.kali.org/current/ | grep -oP 'kali-linux-[0-9.]+' | head -1)
INSTALLER_ISO="kali-linux-${KALI_VERSION}-installer-amd64.iso"
# Download the installer ISO (NOT the vmware or live variant — Proxmox needs the installer)
wget "https://cdimage.kali.org/current/${INSTALLER_ISO}" -O /tmp/kali.iso
# Copy to pve-maul ISO storage
scp /tmp/kali.iso root@192.168.1.10:/var/lib/vz/template/iso/
# Create VM on pve-maul via CLI, attached to vnet-sandbox (VNet2)
qm create 8000 \
--name kali-linux \
--memory 4096 \
--cores 4 \
--net0 virtio,bridge=vnet-sandbox,tag=200 \
--scsihw virtio-scsi-pci \
--ide2 local:iso/${INSTALLER_ISO},media=cdrom \
--scsi0 local-lvm:80,format=raw \
--boot order=ide2 \
--ostype l26 \
--node maul
# Start VM and complete Kali installation interactively
qm start 8000
# Access the console in Proxmox UI: Nodes → pve-maul → 8000 (kali-linux) → Console
# After installation completes:
qm stop 8000
# Optional: Convert to template for fast cloning
qm template 8000Update Terraform to include red team VMs. Add to terraform/environments/homelab/main.tf:
# Red Team Infrastructure
# Kali Linux
module "kali_linux" {
source = "../../modules/vm"
name = "kali-linux"
target_node = "maul"
clone_template = "kali-template"
cores = 4
memory = 8192
cpu_type = "host"
network_bridge = "vnet-sandbox"
network_tag = 200
network_firewall = true
disk_size = "80G"
disk_storage = "local-lvm"
cloudinit_storage = "local-lvm"
ip_address = "10.20.20.10"
gateway = "10.20.20.1"
ssh_public_keys = [var.ssh_public_key]
onboot = false
tags = ["security", "red-team", "sandbox"]
}
# Parrot OS (alternative to Kali)
module "parrot_os" {
source = "../../modules/vm"
name = "parrot-os"
target_node = "maul"
clone_template = "parrot-template"
cores = 4
memory = 8192
cpu_type = "host"
network_bridge = "vnet-sandbox"
network_tag = 200
disk_size = "80G"
disk_storage = "local-lvm"
cloudinit_storage = "local-lvm"
ip_address = "10.20.20.11"
gateway = "10.20.20.1"
ssh_public_keys = [var.ssh_public_key]
onboot = false
tags = ["security", "red-team", "sandbox"]
}cd terraform/environments/homelab
terraform plan -out=tfplan-redteam
terraform apply tfplan-redteamMetasploitable3 requires Docker. Create via Ansible:
Create ansible/playbooks/deploy-targets.yml:
---
# Deploy vulnerable targets for red team practice
# These run on the kali-linux VM inside the sandbox network (10.20.20.0/24)
# They are intentionally vulnerable — NEVER expose these to the internet.
- name: Deploy vulnerable targets
hosts: kali_linux
become: true
gather_facts: true
tasks:
- name: Install Docker and dependencies
ansible.builtin.apt:
name:
- docker.io
- docker-compose-plugin
- python3-docker
state: present
update_cache: true
- name: Start Docker service
ansible.builtin.systemd:
name: docker
state: started
enabled: true
# Option A: Metasploitable2 (Docker-native, always works)
# Container image: tleemcjr/metasploitable2
- name: Pull Metasploitable2 image
community.docker.docker_image:
name: tleemcjr/metasploitable2
source: pull
- name: Run Metasploitable2 container
community.docker.docker_container:
name: metasploitable2
image: tleemcjr/metasploitable2
ports:
- "21:21" # FTP
- "22:22" # SSH
- "80:80" # HTTP
- "3306:3306" # MySQL
restart_policy: unless-stopped
state: started
# Option B: DVWA (Damn Vulnerable Web Application)
- name: Run DVWA container
community.docker.docker_container:
name: dvwa
image: ghcr.io/digininja/dvwa:latest
ports:
- "8080:80"
env:
DB_SERVER: "dvwa-db"
restart_policy: unless-stopped
state: started
- name: Display target info
ansible.builtin.debug:
msg:
- "Metasploitable2: http://{{ ansible_host }} (ports 21, 22, 80, 3306)"
- "DVWA: http://{{ ansible_host }}:8080"
- "NOTE: For rapid7/metasploitable3 (Ubuntu-based), use Vagrant on a local machine:"
- " git clone https://github.com/rapid7/metasploitable3 && cd metasploitable3 && vagrant up"# Deploy DVWA in K3s cluster (isolated namespace)
kubectl create namespace dvwa
cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: dvwa
namespace: dvwa
spec:
replicas: 1
selector:
matchLabels:
app: dvwa
template:
metadata:
labels:
app: dvwa
spec:
nodeSelector:
kubernetes.io/hostname: "k3s-worker-01"
containers:
- name: dvwa
image: vulnerables/web-dvwa
ports:
- containerPort: 80
env:
- name: RECAPTCHA_DISABLED
value: "true"
---
apiVersion: v1
kind: Service
metadata:
name: dvwa
namespace: dvwa
spec:
type: NodePort
selector:
app: dvwa
ports:
- port: 80
targetPort: 80
nodePort: 30880
EOF
# Access DVWA
# http://10.10.10.11:30880
# Default credentials: admin / passwordkubectl create namespace juice-shop
cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: juice-shop
namespace: juice-shop
spec:
replicas: 1
selector:
matchLabels:
app: juice-shop
template:
metadata:
labels:
app: juice-shop
spec:
nodeSelector:
kubernetes.io/hostname: "k3s-worker-01"
containers:
- name: juice-shop
image: bkimminich/juice-shop:latest
ports:
- containerPort: 3000
---
apiVersion: v1
kind: Service
metadata:
name: juice-shop
namespace: juice-shop
spec:
type: NodePort
selector:
app: juice-shop
ports:
- port: 3000
targetPort: 3000
nodePort: 30881
EOFkubectl create namespace blue-teamhelm repo add wazuh https://wazuh.github.io/wazuh-helm
helm repo update
helm install wazuh wazuh/wazuh \
--namespace blue-team \
--set wazuh-manager.enabled=true \
--set wazuh-indexer.enabled=true \
--set wazuh-dashboard.enabled=true \
--set wazuh-manager.persistence.enabled=true \
--set wazuh-manager.persistence.storageClass=longhorn-default \
--set wazuh-manager.persistence.size=20Gi \
--set wazuh-indexer.persistence.enabled=true \
--set wazuh-indexer.persistence.storageClass=longhorn-default \
--set wazuh-indexer.persistence.size=20Gi \
--waitAccess Wazuh Dashboard:
# Port forward
kubectl port-forward -n blue-team svc/wazuh-dashboard 5601:5601
# Access at: https://localhost:5601
# Default credentials: admin / adminhelm repo add elastic https://helm.elastic.co
helm repo update
helm install elasticsearch elastic/elasticsearch \
--namespace blue-team \
--set replicas=1 \
--set minimumMasterNodes=1 \
--set persistence.enabled=true \
--set persistence.storageClass=longhorn-default \
--set persistence.size=10Gi \
--wait
helm install kibana elastic/kibana \
--namespace blue-team \
--set replicas=1 \
--waitCreate gitops-apps/security/network-policies.yaml:
---
# Network policies for sandbox isolation
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-sandbox-to-production
namespace: dvwa
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: dvwa
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-blue-team-egress
namespace: blue-team
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- podSelector: {}
- to:
- namespaceSelector: {}
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-blue-team-ingress
namespace: blue-team
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: blue-team
- from:
- namespaceSelector:
matchLabels:
name: argocdApply policies:
# Label namespaces
kubectl label namespace dvwa name=dvwa
kubectl label namespace juice-shop name=juice-shop
kubectl label namespace blue-team name=blue-team
# Apply policies
kubectl apply -f gitops-apps/security/network-policies.yaml# From Kali Linux (should fail)
ssh -i ~/.ssh/homelab root@10.20.20.10 "ping -c 3 10.10.10.2"
# Expected: Packet filtered / timeout
# From production (should fail)
kubectl run test-sandbox --image=nicolaka/netshoot -n test --rm -it --restart=Never -- wget -O- http://10.20.20.10
# Expected: Connection refused# Access DVWA
curl -I http://10.10.10.11:30880
# Access Juice Shop
curl -I http://10.10.10.11:30881
# Access Metasploitable3 (from Kali)
ssh -i ~/.ssh/homelab root@10.20.20.10 "curl -I http://localhost"# Access Wazuh dashboard
# Navigate to Discover > wazuh-alerts-*
# Check for alerts from vulnerable targets
# Verify Falco events are being logged# Delete test pods
kubectl delete pod test-sandbox -n test
# Delete vulnerable targets (when not in use)
kubectl delete namespace dvwa juice-shop
# Shutdown red team VMs (when not in use)
qm stop kali-linux
qm stop parrot-osIssue: Sandbox can still access production
Solution:
# Verify network policies
kubectl get networkpolicies -A
# Check pfSense rules
# Firewall > Rules > SANDBOX
# Verify VXLAN tag
ip link show | grep vxlan | grep 200Issue: Can't access Wazuh UI
Solution:
# Check pod status
kubectl get pods -n blue-team
# Check logs
kubectl logs -n blue-team -l app=wazuh-dashboard -f
# Reset admin password
kubectl exec -it -n blue-team wazuh-dashboard-0 -- bash
# Inside: /usr/share/kibana/bin/kibana-setup-passwordsRed/Blue team infrastructure deployed:
➡️ Continue to Guide 10: Monitoring Stack
- pfSense OPT interface configured
- Firewall rules created (VNet2 → VNet1 blocked)
- Kali Linux template created
- Red team VMs deployed
- Metasploitable3 deployed
- DVWA deployed
- Juice Shop deployed
- Wazuh/ELK deployed
- Blue team namespace created
- Network policies applied
- Network isolation verified
- Vulnerable targets accessible
- Blue team logging verified
- Cleanup procedures documented
Deploy the full LGTM stack (Loki, Grafana, Tempo, Mimir/Metrics) with OpenTelemetry for enterprise-grade observability.
This guide implements a modern observability pipeline. Instead of apps talking directly to databases, everything sends data via the OpenTelemetry (OTel) Protocol to a central collector, which then routes it to the appropriate LGTM component.
Time Required: ~75 minutes Prerequisites: Guide 09 completed
Observability Pipeline
┌──────────────────────────────────────────────────────┐
│ Applications (Production / Development / Sandbox) │
└──────────┬───────────────────┬───────────────────────┘
│ (Metrics/Logs/Traces via OTLP)
▼
┌──────────────────────────────────────────────────────┐
│ OpenTelemetry (OTel) Collector │
│ (Processing, Batching, and Routing) │
└──────────┬────────┬──────────┬───────────────┬───────┘
│ │ │ │
┌──────▼───┐┌───▼────┐┌────▼─────┐ ┌──────▼──────┐
│Prometheus││ Loki ││ Tempo │ │ AlertManager│
│ (Metrics)││ (Logs) ││ (Traces) │ │ (Alerts) │
└──────┬───┘└───┬────┘└────┬─────┘ └──────┬──────┘
│ │ │ │
└────────┴────┬─────┴───────────────┘
▼
┌──────────────────┐
│ Grafana │
│ (Visualization) │
└──────────────────┘
kubectl create namespace monitoringhelm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClass=longhorn-default \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set grafana.adminPassword=admin \
--set grafana.service.type=NodePort \
--set grafana.service.nodePort=30090 \
--waitImportant
grafana/loki-stack is deprecated. Use the standalone grafana/loki chart (SingleBinary mode for homelab) with grafana/alloy as the log collector agent.
kubectl create namespace logginghelm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki \
--namespace logging \
--set loki.commonConfig.replication_factor=1 \
--set loki.storage.type=filesystem \
--set singleBinary.replicas=1 \
--set singleBinary.persistence.enabled=true \
--set singleBinary.persistence.storageClass=longhorn-default \
--set singleBinary.persistence.size=20Gi \
--waitAlloy replaces the deprecated Promtail as the recommended log shipping agent:
helm install alloy grafana/alloy \
--namespace logging \
--set alloy.clustering.enabled=false \
--set controller.type=daemonset \
--waitVerify Loki is running:
kubectl get pods -n logging # Expected: loki-0 Running, alloy-* Running on each node
helm install tempo grafana/tempo \
--namespace monitoring \
--set tempo.storage.trace.backend=local \
--set tempo.persistence.enabled=true \
--set tempo.persistence.storageClass=longhorn-default \
--set tempo.persistence.size=20Gi \
--waitVerify Tempo is running:
kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo
The OTel Collector is the single entry point for all telemetry data. It receives OTLP from applications and routes to the appropriate backend.
Create otel-values.yaml:
mode: deployment
config:
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
exporters:
# Metrics: push to Prometheus via remote_write (Prometheus is pull-based;
# use prometheusremotewrite, NOT the 'prometheus' exporter which is a scrape endpoint only)
prometheusremotewrite:
endpoint: "http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090/api/v1/write"
# Logs: push to Loki
loki:
endpoint: "http://loki.logging.svc.cluster.local:3100/loki/api/v1/push"
# Traces: forward to Tempo via OTLP gRPC
otlp/tempo:
endpoint: "tempo.monitoring.svc.cluster.local:4317"
tls:
insecure: true
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
exporters: [loki]
traces:
receivers: [otlp]
exporters: [otlp/tempo]helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-collector open-telemetry/opentelemetry-collector \
--namespace monitoring \
--create-namespace \
-f otel-values.yaml \
--waitEnable Prometheus remote_write (required for OTel metrics pipeline):
# Patch kube-prometheus-stack to enable the remote_write receiver helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring --reuse-values \ --set prometheus.prometheusSpec.enableRemoteWriteReceiver=true
Access Grafana: http://10.10.10.10:30090 (K3s master node)
Add datasources in Grafana UI (Connections → Data Sources → Add new):
| Datasource | Type | URL |
|---|---|---|
| Prometheus | Prometheus | http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090 |
| Loki | Loki | http://loki.logging.svc.cluster.local:3100 |
| Tempo | Tempo | http://tempo.monitoring.svc.cluster.local:3200 ← port 3200, not 3100 |
Caution
Tempo's HTTP API is on port 3200. Port 3100 is Loki's push endpoint. Using the wrong port will cause "No data" in Grafana Explore.
Link traces to logs by configuring Tempo → Derived Fields → Loki in Grafana datasource settings.
# Verify datasource connectivity from Grafana pods
kubectl exec -n monitoring deploy/kube-prometheus-stack-grafana -- \
wget -qO- http://loki.logging.svc.cluster.local:3100/ready
# Expected: "ready"
kubectl exec -n monitoring deploy/kube-prometheus-stack-grafana -- \
wget -qO- http://tempo.monitoring.svc.cluster.local:3200/ready
# Expected: "ready"To replicate an enterprise setup, use Namespace Labels and Resource Quotas to simulate different environments on the same physical hardware:
| Simulated Environment | Namespace Prefix | Kyverno Mode | Storage Class |
|---|---|---|---|
| Production | prd-* |
Enforce | longhorn-critical (3 replicas) |
| Development | dev-* |
Audit | longhorn-default (2 replicas) |
| Sandbox (Maul) | external |
None | local-lvm |
# Deploy a test telemetry app connected to the OTel Collector
kubectl run telemetry-test -n monitoring \
--image=ghcr.io/open-telemetry/opentelemetry-demo/productcatalogservice:latest \
--env="OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.monitoring.svc.cluster.local:4317" \
--env="OTEL_SERVICE_NAME=telemetry-test"
# Wait ~30 seconds, then check Grafana → Explore → Tempo
kubectl delete pod telemetry-test -n monitoringkubectl logs -n monitoring -l app.kubernetes.io/name=opentelemetry-collector --tail=50
# Look for "exporter/otlp/tempo" errors- Confirm Tempo pod is
Running:kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo - Verify Grafana datasource URL uses port 3200
kubectl logs -n logging -l app.kubernetes.io/name=alloy --tail=50- Check Loki is healthy:
kubectl get pods -n logging
# Verify remote_write receiver is enabled
kubectl get prometheus -n monitoring -o yaml | grep enableRemoteWriteReceiver- Prometheus deployed (via kube-prometheus-stack)
- Loki deployed (standalone chart, SingleBinary mode)
- Grafana Alloy DaemonSet running on all nodes (log collection)
- Tempo deployed
- OTel Collector routing: metrics → Prometheus remote_write, logs → Loki, traces → Tempo
- Grafana datasources configured: Prometheus (9090), Loki (3100), Tempo (3200)
- Trace-to-log correlation configured in Grafana Tempo datasource
- Multi-environment namespace simulation configured
Implement enterprise-grade Single Sign-On (SSO) and Multi-Factor Authentication (MFA) across your homelab.
This guide integrates Authelia with an LLDAP (Lightweight LDAP) backend to protect lab services behind a unified login portal. It replicates a "Zero Trust" architecture where every request is authenticated before reaching the application.
Time Required: ~60 minutes
Prerequisites: Guide 07 completed (PostgreSQL and ArgoCD running), Guide 05 (ingress-nginx running)
User
│
▼
Ingress-Nginx ──► (forward-auth check) ──► Authelia (auth.homelab.local)
│ │
│ ┌─────────────────────────────────────────┤
│ │ Session Store (Redis) │
│ │ Storage (PostgreSQL) │
│ │ Identity (LLDAP → LDAP protocol) │
│ └─────────────────────────────────────────┘
│
▼
Protected Service (Grafana / ArgoCD / Gitea)
LLDAP is a lightweight LDAP server with a modern web UI — the recommended option for DevSecOps learning.
Alternative: For enterprise Active Directory experience, provision a Windows Server 2022 VM on
pve-vader, promote it to a Domain Controller (homelab.local), and create anauthelia-bindservice account. LLDAP requires no Windows licensing and all steps below work with both.
Create gitops-apps/security/lldap.yaml:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: lldap
namespace: argocd
spec:
project: homelab
source:
repoURL: https://lldap.github.io/charts
chart: lldap
targetRevision: "*"
helm:
values: |
env:
LLDAP_LDAP_PORT: "3890"
LLDAP_HTTP_PORT: "17170"
LLDAP_LDAP_BASE_DN: "dc=homelab,dc=local"
LLDAP_JWT_SECRET:
valueFrom:
secretKeyRef:
name: lldap-secrets
key: jwt-secret
LLDAP_LDAP_USER_PASS:
valueFrom:
secretKeyRef:
name: lldap-secrets
key: admin-password
persistence:
enabled: true
storageClass: longhorn-default
size: 1Gi
service:
type: ClusterIP
destination:
server: https://kubernetes.default.svc
namespace: security
syncPolicy:
automated:
selfHeal: true
syncOptions:
- CreateNamespace=truekubectl create namespace security
kubectl create secret generic lldap-secrets \
--namespace security \
--from-literal=jwt-secret=$(openssl rand -base64 32) \
--from-literal=admin-password=$(openssl rand -base64 16)
# Store these values in Vault too (recommended):
# vault kv put homelab/lldap jwt-secret=<value> admin-password=<value># Port-forward the LLDAP web UI
kubectl port-forward -n security svc/lldap 17170:17170
# Open http://localhost:17170
# Default admin user: admin / (the password from the secret above)In the LLDAP web UI:
- Create a group:
homelab-admins - Create a group:
homelab-users - Create your primary user and add them to
homelab-admins - Create an Authelia service account:
authelia-bind(member ofhomelab-usersonly)
Important
Authelia requires Redis for session storage. Do not skip this step — Authelia will crash-loop without it.
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
helm install redis bitnami/redis \
--namespace security \
--set auth.enabled=false \
--set replica.replicaCount=1 \
--set master.persistence.enabled=true \
--set master.persistence.storageClass=longhorn-default \
--set master.persistence.size=2Gi \
--wait
# Verify Redis is running
kubectl get pods -n security -l app.kubernetes.io/name=rediskubectl create secret generic authelia-secrets \
--namespace security \
--from-literal=jwt-secret=$(openssl rand -base64 32) \
--from-literal=session-secret=$(openssl rand -base64 32) \
--from-literal=storage-encryption-key=$(openssl rand -base64 32) \
--from-literal=ldap-password=$(kubectl get secret lldap-secrets -n security \
-o jsonpath='{.data.admin-password}' | base64 -d)# Exec into the PostgreSQL pod and create the database
kubectl exec -n postgresql -it $(kubectl get pods -n postgresql -l app.kubernetes.io/name=postgresql -o name | head -1) -- \
psql -U postgres -c "CREATE DATABASE authelia;"Create gitops-apps/security/authelia-values.yaml (add to .gitignore — contains references to secrets):
domain: homelab.local
authentication_backend:
ldap:
implementation: custom
url: ldap://lldap.security.svc.cluster.local:3890
base_dn: dc=homelab,dc=local
username_attribute: uid
additional_users_dn: ou=people
users_filter: "(&({username_attribute}={input})(objectClass=person))"
additional_groups_dn: ou=groups
groups_filter: "(member={dn})"
group_name_attribute: cn
mail_attribute: mail
display_name_attribute: displayName
user: uid=authelia-bind,ou=people,dc=homelab,dc=local
password:
secret_name: authelia-secrets
secret_key: ldap-password
access_control:
default_policy: deny
rules:
# Allow unauthenticated access to the auth portal itself
- domain: auth.homelab.local
policy: bypass
# Management services require 2FA
- domain:
- argocd.homelab.local
- vault.homelab.local
policy: two_factor
subject: "group:homelab-admins"
# Internal tools require 1FA
- domain:
- grafana.homelab.local
- longhorn.homelab.local
- "*.homelab.local"
policy: one_factor
subject: "group:homelab-users"
session:
name: authelia_session
domain: homelab.local
same_site: lax
expiration: 1h
inactivity: 5m
redis:
host: redis-master.security.svc.cluster.local
port: 6379
storage:
postgres:
host: postgresql.postgresql.svc.cluster.local
port: 5432
database: authelia
schema: public
username: postgres
password:
secret_name: authelia-secrets
secret_key: storage-encryption-key
notifier:
disable_startup_check: true
filesystem:
filename: /tmp/authelia-notifications.txt
identity_providers:
oidc:
hmac_secret:
secret_name: authelia-secrets
secret_key: session-secret
issuer_private_key:
path: /config/oidc.key
clients:
- id: gitea
description: Gitea
secret: "$plaintext$<generate-with: openssl rand -hex 32>"
public: false
authorization_policy: one_factor
redirect_uris:
- https://git.homelab.local/user/oauth2/authelia/callback
scopes: [openid, profile, email, groups]
- id: argocd
description: ArgoCD
secret: "$plaintext$<generate-with: openssl rand -hex 32>"
public: false
authorization_policy: two_factor
redirect_uris:
- https://argocd.homelab.local/auth/callback
scopes: [openid, profile, email, groups]helm repo add authelia https://charts.authelia.com
helm repo update
helm install authelia authelia/authelia \
--namespace security \
--values gitops-apps/security/authelia-values.yaml \
--set secret.existingSecret=authelia-secrets \
--wait
# Verify
kubectl get pods -n security -l app.kubernetes.io/name=autheliacat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: authelia
namespace: security
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "0"
spec:
ingressClassName: nginx
rules:
- host: auth.homelab.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: authelia
port:
number: 80
EOFAdd these annotations to any Ingress object you want to protect with Authelia:
# Add to the Ingress metadata.annotations block of any service
annotations:
nginx.ingress.kubernetes.io/auth-url: "https://auth.homelab.local/api/verify"
nginx.ingress.kubernetes.io/auth-signin: "https://auth.homelab.local/?rd=$scheme://$host$request_uri"
nginx.ingress.kubernetes.io/auth-response-headers: "Remote-User,Remote-Groups,Remote-Email,Remote-Name"
nginx.ingress.kubernetes.io/proxy-buffer-size: "8k"Example — protecting Grafana:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana
namespace: monitoring
annotations:
nginx.ingress.kubernetes.io/auth-url: "https://auth.homelab.local/api/verify"
nginx.ingress.kubernetes.io/auth-signin: "https://auth.homelab.local/?rd=$scheme://$host$request_uri"
nginx.ingress.kubernetes.io/auth-response-headers: "Remote-User,Remote-Groups,Remote-Email,Remote-Name"
spec:
ingressClassName: nginx
rules:
- host: grafana.homelab.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: kube-prometheus-stack-grafana
port:
number: 80In Gitea web UI → Site Administration → Authentication Sources → Add Authentication Source:
| Field | Value |
|---|---|
| Type | OAuth2 |
| Name | Authelia |
| OAuth2 Provider | OpenID Connect |
| Client ID | gitea |
| Client Secret | (value from authelia-values.yaml) |
| OpenID Connect Auto-Discovery URL | https://auth.homelab.local/.well-known/openid-configuration |
| Additional Scopes | groups |
Patch ArgoCD ConfigMap to add Authelia as an OIDC provider:
kubectl patch configmap argocd-cm -n argocd --type=merge -p '{
"data": {
"oidc.config": "name: Authelia\nissuer: https://auth.homelab.local\nclientID: argocd\nclientSecret: $oidc.authelia.clientSecret\nrequestedScopes: [\"openid\", \"profile\", \"email\", \"groups\"]\n",
"url": "https://argocd.homelab.local"
}
}'
# Store the OIDC client secret in ArgoCD secret
kubectl patch secret argocd-secret -n argocd --type=merge \
-p '{"stringData": {"oidc.authelia.clientSecret": "<your-argocd-client-secret>"}}'
# Map the homelab-admins group to the ArgoCD admin role
kubectl patch configmap argocd-rbac-cm -n argocd --type=merge -p '{
"data": {
"policy.csv": "g, homelab-admins, role:admin\n",
"policy.default": "role:readonly"
}
}'# 1. Check all security namespace pods are Running
kubectl get pods -n security
# 2. Test LLDAP is responding to LDAP queries
kubectl exec -n security deploy/lldap -- \
ldapsearch -H ldap://localhost:3890 -x -b "dc=homelab,dc=local" -D "uid=admin,ou=people,dc=homelab,dc=local" "(objectClass=person)" uid
# 3. Test Authelia health endpoint
kubectl exec -n security deploy/authelia -- \
wget -qO- http://localhost:9091/api/health
# Expected: {"status":"OK"}End-to-end browser test:
- Open a Private/Incognito window
- Navigate to
https://argocd.homelab.local - Confirm redirect to
https://auth.homelab.local - Log in with your LLDAP user credentials
- Complete TOTP (if two_factor policy applies)
- Confirm redirect back to ArgoCD with your user logged in
- Navigate to
https://grafana.homelab.local— confirm Authelia portal appears
-
securitynamespace created - LLDAP deployed and accessible via web UI (
kubectl port-forward) - LLDAP users created: primary user +
homelab-adminsgroup +authelia-bindservice account - Redis deployed in
securitynamespace -
authelia-secretsSecret created with JWT, session, and storage-encryption keys -
autheliaPostgreSQL database created - Authelia deployed and all pods Running
-
auth.homelab.localIngress created and resolves correctly - At least one service (Grafana) protected with forward-auth annotations
- OIDC configured in Gitea — "Login with Authelia" button visible
- OIDC configured in ArgoCD — single sign-on working
- End-to-end browser test passed (redirect → login → TOTP → access granted)
Secure the software delivery pipeline with Gitea Actions, secret scanning, vulnerability gates, and automated security checks.
This guide builds a shift-left security pipeline using Gitea Actions. Every push and PR triggers automated security scans — secret detection, dependency scanning, container scanning, IaC validation, and Kubernetes manifest checks — before code reaches ArgoCD for deployment.
Time Required: ~90 minutes Prerequisites: Guide 07 (GitOps Stack) completed
CI/CD Security Pipeline
┌─────────────────────────────────────────────────┐
│ Developer Push / PR │
└──────────────────┬──────────────────────────────┘
▼
┌─────────────────────────────────────────────────┐
│ Gitea Actions Workflow │
│ │
│ Stage 1: Secret Scan (gitleaks) │
│ Stage 2: Lint & Validate (yamllint, kubeval) │
│ Stage 3: IaC Scan (checkov, tfsec) │
│ Stage 4: Dependency Scan (trivy fs) │
│ Stage 5: Container Scan (trivy image) │
│ Stage 6: K8s Manifest Validation (conftest) │
│ Stage 7: Deploy Gate (ArgoCD sync) │
└──────────────────┬──────────────────────────────┘
│ (all gates pass)
▼
┌─────────────────────────────────────────────────┐
│ ArgoCD GitOps Auto-Sync │
│ (only syncs if pipeline succeeded on branch) │
└─────────────────────────────────────────────────┘
Gitea has built-in CI/CD compatible with GitHub Actions syntax.
# Enable actions in Gitea config
# SSH into the Gitea pod or edit via ConfigMap
kubectl edit configmap -n services gitea-configAdd or update in app.ini:
[actions]
ENABLED = true
DEFAULT_ACTIONS_URL = https://gitea.comRestart Gitea:
kubectl rollout restart deployment -n services gitea
kubectl rollout status deployment -n services gitea# Via Gitea API
GITEA_TOKEN="your-admin-token"
GITEA_URL="http://gitea.services.svc.cluster.local:3000"
# Enable actions for the homelab repo
curl -X PUT "${GITEA_URL}/api/v1/repos/homelab/gitops-apps/actions/enable" \
-H "Authorization: token ${GITEA_TOKEN}"Or via UI: Repository → Settings → Actions → Enable.
kubectl create namespace cicd
kubectl label namespace cicd environment=cicdCreate runner-values.yaml:
replicaCount: 1
runner:
register: true
config: |
runner:
labels:
- "ubuntu-latest:docker://node:22-bookworm"
- "self-hosted:kubernetes"
# Register with Gitea instance
name: "homelab-runner"
token: "" # Set via --set flag
image:
repository: gitea/act_runner
tag: "0.2.11"
pullPolicy: IfNotPresent
resources:
requests:
cpu: 500m
memory: 256Mi
limits:
cpu: "2"
memory: 1Gi
persistence:
enabled: true
size: 10Gi
storageClass: longhorn-default
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: false
env:
GITEA_INSTANCE_URL: "http://gitea.services.svc.cluster.local:3000"
GITEA_RUNNER_LABELS: "ubuntu-latest:docker://node:22-bookworm,self-hosted:kubernetes"Install runner:
helm repo add gitea-charts https://dl.gitea.io/charts/
helm repo update
# Get runner token from Gitea UI: Site Administration → Actions → Runners → Register New Runner
helm install act-runner gitea-charts/act-runner \
--namespace cicd \
--set runner.token="YOUR_RUNNER_TOKEN" \
-f runner-values.yaml \
--waitVerify:
kubectl get pods -n cicd
# Expected: act-runner-0 Running# On macOS workstation
pip3 install pre-commit gitleaks tfsec trivy yamllint
# Or via brew
brew install pre-commit gitleaks tfsec trivy yamllintCreate .pre-commit-config.yaml in the repository root:
repos:
# Secret detection
- repo: https://github.com/gitleaks/gitleaks
rev: v8.21.2
hooks:
- id: gitleaks
# YAML linting
- repo: https://github.com/adrienverge/yamllint
rev: v1.35.1
hooks:
- id: yamllint
args: ['-d', '{extends: relaxed, rules: {line-length: {max: 120}}}']
# Terraform formatting and validation
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.96.1
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_tflint
- id: tfsec
args: ['--force-all-dirs', '--no-color']
# General checks
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
args: ['--unsafe']
- id: check-json
- id: check-merge-conflict
- id: detect-private-key
- id: no-commit-to-branch
args: ['--branch', 'main']cd /path/to/homelab
pre-commit install
pre-commit install --hook-type pre-push
# Run against all files to test
pre-commit run --all-filesmkdir -p gitops-apps/.gitea/workflowsCreate gitops-apps/.gitea/workflows/security-pipeline.yaml:
name: Security Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
TRIVY_SEVERITY: "HIGH,CRITICAL"
TRIVY_EXIT_CODE: "1"
TRIVY_DB_REPOSITORY: "ghcr.io/aquasecurity/trivy-db"
jobs:
# ── Stage 1: Secret Scanning ──────────────────────
secret-scan:
name: "🔍 Secret Detection"
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Gitleaks Scan
uses: gitleaks/gitleaks-action@v2
env:
GITLEAKS_LICENSE: "" # Community edition
# ── Stage 2: Lint & Validate ─────────────────────
lint:
name: "📝 Lint & Validate"
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: YAML Lint
run: |
pip install yamllint
yamllint -d '{extends: relaxed, rules: {line-length: {max: 120}}}' .
- name: Validate K8s Manifests
run: |
# Install kubeval
curl -L https://github.com/instrumenta/kubeval/releases/latest/download/kubeval-linux-amd64.tar.gz | tar xz
sudo mv kubeval /usr/local/bin/
find . -name '*.yaml' -o -name '*.yml' | grep -v '.gitea' | xargs kubeval --strict --ignore-missing-schemas
# ── Stage 3: IaC Security Scan ────────────────────
iac-scan:
name: "🏗️ IaC Security Scan"
runs-on: ubuntu-latest
needs: [lint]
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Run Checkov
run: |
pip install checkov
checkov -d . --framework terraform,kubernetes --skip-check CKV_K8S_21 \
--output cli --output junitxml --output-file-path console,checkov-results.xml
continue-on-error: false
- name: Run tfsec (Terraform only)
run: |
curl -L https://github.com/aquasecurity/tfsec/releases/latest/download/tfsec-linux-amd64 -o tfsec
chmod +x tfsec
./tfsec terraform/ --format junit --out tfsec-results.xml
continue-on-error: false
- name: Upload Scan Results
uses: actions/upload-artifact@v4
if: always()
with:
name: iac-scan-results
path: |
checkov-results.xml
tfsec-results.xml
# ── Stage 4: Dependency Scanning ──────────────────
dependency-scan:
name: "📦 Dependency Scan"
runs-on: ubuntu-latest
needs: [lint]
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Install Trivy
run: |
curl -L https://github.com/aquasecurity/trivy/releases/latest/download/trivy_linux_amd64.tar.gz | tar xz
sudo mv trivy /usr/local/bin/
- name: Filesystem Scan
run: |
trivy fs --severity HIGH,CRITICAL --exit-code 1 --format table .
- name: Generate SBOM
run: |
trivy fs --format spdx-json --output sbom.spdx.json .
continue-on-error: true
- name: Upload SBOM
uses: actions/upload-artifact@v4
if: always()
with:
name: dependency-scan-results
path: |
sbom.spdx.json
# ── Stage 5: Kubernetes Manifest Validation ───────
k8s-validate:
name: "☸️ K8s Manifest Validation"
runs-on: ubuntu-latest
needs: [lint]
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Conftest Policy Check
run: |
curl -L https://github.com/open-policy-agent/conftest/releases/latest/download/conftest_linux_amd64.tar.gz | tar xz
sudo mv conftest /usr/local/bin/
conftest test --policy policies/ gitops-apps/
# ── Stage 6: Security Gate Summary ────────────────
security-gate:
name: "🛡️ Security Gate"
runs-on: ubuntu-latest
needs: [secret-scan, iac-scan, dependency-scan, k8s-validate]
if: always()
steps:
- name: Check All Scans Passed
run: |
echo "Secret Scan: ${{ needs.secret-scan.result }}"
echo "IaC Scan: ${{ needs.iac-scan.result }}"
echo "Dependency Scan: ${{ needs.dependency-scan.result }}"
echo "K8s Validate: ${{ needs.k8s-validate.result }}"
if [[ "${{ needs.secret-scan.result }}" == "failure" || \
"${{ needs.iac-scan.result }}" == "failure" || \
"${{ needs.dependency-scan.result }}" == "failure" || \
"${{ needs.k8s-validate.result }}" == "failure" ]]; then
echo "::error::Security gate FAILED — one or more security scans detected issues"
exit 1
fi
echo "✅ All security gates passed"Create gitops-apps/.gitea/workflows/container-pipeline.yaml:
name: Container Security Pipeline
on:
push:
paths:
- 'container-images/**'
- 'Dockerfile*'
jobs:
build-and-scan:
name: "🔨 Build & Scan Container"
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Build Image
run: |
docker build -t homelab-app:${{ github.sha }} .
- name: Install Trivy
run: |
curl -L https://github.com/aquasecurity/trivy/releases/latest/download/trivy_linux_amd64.tar.gz | tar xz
sudo mv trivy /usr/local/bin/
- name: Container Vulnerability Scan
run: |
trivy image --severity HIGH,CRITICAL --exit-code 1 \
--format table \
--ignore-unfixed \
homelab-app:${{ github.sha }}
- name: Container Config Scan
run: |
trivy image --type config --severity HIGH,CRITICAL \
homelab-app:${{ github.sha }}
- name: Generate SBOM
run: |
trivy image --format spdx-json --output container-sbom.spdx.json \
homelab-app:${{ github.sha }}
- name: Generate SARIF Report
run: |
trivy image --format sarif --output trivy-results.sarif \
homelab-app:${{ github.sha }}
- name: Upload Results
uses: actions/upload-artifact@v4
if: always()
with:
name: container-scan-results
path: |
container-sbom.spdx.json
trivy-results.sarifGITEA_URL="http://gitea.services.svc.cluster.local:3000"
GITEA_TOKEN="your-admin-token"
REPO="homelab/gitops-apps"
# Enable branch protection for main
curl -X PUT "${GITEA_URL}/api/v1/repos/${REPO}/branches/main/protection" \
-H "Authorization: token ${GITEA_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"block_on_official_review_requests": true,
"block_on_outdated_branch": true,
"block_on_rejected_reviews": true,
"dismiss_stale_reviews": true,
"enable_push": false,
"enable_status_check": true,
"required_approvals_count": 1,
"status_check_contexts": [
"Security Pipeline / 🛡️ Security Gate"
],
"required_signed_commits": true
}'- Navigate to Repository → Settings → Branches
- Under Branch Protection for
main:- Enable Require Pull Request (no direct pushes)
- Enable Require Approval (minimum 1 reviewer)
- Enable Require Signed Commits
- Enable Require Status Checks: select
Security Pipeline / 🛡️ Security Gate - Enable Dismiss Stale Reviews
Create .gitleaks.toml in the repository root:
title = "Homelab Gitleaks Configuration"
[extend]
# Use default rules as base
useDefault = true
# Allowlist patterns specific to the homelab
[[rules]]
id = "generic-api-key"
description = "Generic API Key"
regex = '''(?i)(?:key|api|token|secret|password|pwd|pw|auth)['"''\s]*(?::|=|\s+is\s+|->)\s*['"''']?[0-9a-zA-Z\-_.]{20,}'''
[[allowlist]]
description = "Allow Proxmox API URL (not a secret)"
regexes = ['''https://192\.168\.1\.11:8006''']
[[allowlist]]
description = "Allow K3s API server URL"
regexes = ['''https://10\.10\.10\.10:6443''']
[[allowlist]]
description = "Allow homelab domain references"
regexes = ['''homelab\.local''', '''\.svc\.cluster\.local''']
[[allowlist]]
paths = [
'''^\.gitleaks\.toml$''',
'''^docs/.*$''',
'''^\.pre-commit-config\.yaml$'''
]# Scan entire repo
gitleaks detect --source . --verbose
# Scan with custom config
gitleaks detect --source . --config-path .gitleaks.toml
# Generate report
gitleaks detect --source . --report-format json --report-path gitleaks-report.jsonModify the ArgoCD root Application to only auto-sync when the pipeline passes:
# gitops-apps/argocd-apps/root-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root-application
namespace: argocd
annotations:
# Only sync when security pipeline has passed on the commit
argocd.argoproj.io/sync-options: ServerSideApply=true
spec:
project: homelab
source:
repoURL: https://git.homelab.local/homelab/gitops-apps.git
targetRevision: main
path: argocd-apps
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
selfHeal: true
prune: false
syncOptions:
- ServerSideApply=true
# Retry with backoff on transient failures
retry:
limit: 3
backoff:
duration: 5s
factor: 2
maxDuration: 3m# Configure ArgoCD notifications for failed syncs
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-notifications-cm
namespace: argocd
data:
trigger.on-sync-failed: |
- description: Application sync failed
send:
- app-sync-failed
when: app.status.operationState.phase in ['Failed', 'Error']
template.app-sync-failed: |
message: |
🔴 Application {{.app.metadata.name}} sync failed.
Health: {{.app.status.health.status}}
Sync Status: {{.app.status.sync.status}}
Error: {{.app.status.operationState.message}}
EOFCreate gitops-apps/infrastructure/cicd/application.yaml:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: act-runner
namespace: argocd
spec:
project: homelab
source:
repoURL: https://git.homelab.local/homelab/gitops-apps.git
targetRevision: main
path: infrastructure/cicd
destination:
server: https://kubernetes.default.svc
namespace: cicd
syncPolicy:
automated:
selfHeal: true
prune: false
syncOptions:
- ServerSideApply=true
- CreateNamespace=trueCreate gitops-apps/infrastructure/cicd/runner-deployment.yaml:
---
apiVersion: v1
kind: Secret
metadata:
name: act-runner-token
namespace: cicd
type: Opaque
stringData:
token: "" # Set after registration via: kubectl edit secret act-runner-token -n cicd
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: act-runner-data
namespace: cicd
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: longhorn-default
resources:
requests:
storage: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: act-runner
namespace: cicd
labels:
app.kubernetes.io/name: act-runner
environment: cicd
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: act-runner
template:
metadata:
labels:
app.kubernetes.io/name: act-runner
environment: cicd
spec:
containers:
- name: runner
image: gitea/act_runner:0.2.11
env:
- name: GITEA_INSTANCE_URL
value: "http://gitea.services.svc.cluster.local:3000"
- name: GITEA_RUNNER_NAME
value: "homelab-runner"
- name: GITEA_RUNNER_LABELS
value: "ubuntu-latest:docker://node:22-bookworm,self-hosted:kubernetes"
- name: GITEA_RUNNER_REGISTRATION_TOKEN
valueFrom:
secretKeyRef:
name: act-runner-token
key: token
resources:
requests:
cpu: 500m
memory: 256Mi
limits:
cpu: "2"
memory: 1Gi
volumeMounts:
- name: runner-data
mountPath: /data
- name: docker-sock
mountPath: /var/run/docker.sock
- name: workspace
mountPath: /home/runner/_work
volumes:
- name: runner-data
persistentVolumeClaim:
claimName: act-runner-data
- name: docker-sock
hostPath:
path: /var/run/docker.sock
- name: workspace
emptyDir: {}
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: act-runner
namespace: cicdNote
The runner needs Docker access to execute job containers. If Docker isn't available on K3s nodes, use DinD (Docker-in-Docker) sidecar or switch to kubernetes mode where each job runs as a separate pod.
Create gitops-apps/infrastructure/cicd/runner-dind-deployment.yaml (alternative):
apiVersion: apps/v1
kind: Deployment
metadata:
name: act-runner-dind
namespace: cicd
labels:
app.kubernetes.io/name: act-runner
environment: cicd
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: act-runner
template:
metadata:
labels:
app.kubernetes.io/name: act-runner
environment: cicd
spec:
containers:
- name: runner
image: gitea/act_runner:0.2.11
env:
- name: GITEA_INSTANCE_URL
value: "http://gitea.services.svc.cluster.local:3000"
- name: GITEA_RUNNER_NAME
value: "homelab-runner"
- name: GITEA_RUNNER_LABELS
value: "ubuntu-latest:docker://node:22-bookworm,self-hosted:kubernetes"
- name: GITEA_RUNNER_REGISTRATION_TOKEN
valueFrom:
secretKeyRef:
name: act-runner-token
key: token
- name: DOCKER_HOST
value: "tcp://localhost:2376"
- name: DOCKER_CERT_PATH
value: "/certs/client"
- name: DOCKER_TLS_VERIFY
value: "1"
resources:
requests:
cpu: 500m
memory: 256Mi
limits:
cpu: "2"
memory: 1Gi
volumeMounts:
- name: runner-data
mountPath: /data
- name: docker-certs
mountPath: /certs/client
readOnly: true
- name: workspace
mountPath: /home/runner/_work
- name: dind
image: docker:dind
securityContext:
privileged: true
env:
- name: DOCKER_TLS_CERTDIR
value: "/certs"
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "2"
memory: 2Gi
volumeMounts:
- name: docker-certs
mountPath: /certs/client
- name: docker-storage
mountPath: /var/lib/docker
volumes:
- name: runner-data
persistentVolumeClaim:
claimName: act-runner-data
- name: docker-certs
emptyDir: {}
- name: docker-storage
emptyDir: {}
- name: workspace
emptyDir: {}After deploying, register the runner with Gitea:
# Get registration token from Gitea admin
# UI: Site Administration → Actions → Runners → Generate Registration Token
# Or API:
GITEA_URL="http://gitea.services.svc.cluster.local:3000"
ADMIN_TOKEN="your-admin-token"
REG_TOKEN=$(curl -s "${GITEA_URL}/api/v1/admin/runners/registration-token" \
-H "Authorization: token ${ADMIN_TOKEN}" | jq -r '.token')
# Update the secret
kubectl create secret generic act-runner-token \
--namespace cicd \
--from-literal=token="${REG_TOKEN}" \
--dry-run=client -o yaml | kubectl apply -f -
# Restart runner to pick up the token
kubectl rollout restart deployment -n cicd act-runnerStore pipeline secrets in Gitea:
REPO="homelab/gitops-apps"
# Registry credentials (for container push/pull)
curl -X PUT "${GITEA_URL}/api/v1/repos/${REPO}/actions/secrets/REGISTRY_USER" \
-H "Authorization: token ${ADMIN_TOKEN}" \
-H "Content-Type: application/json" \
-d '{"data": "admin"}'
curl -X PUT "${GITEA_URL}/api/v1/repos/${REPO}/actions/secrets/REGISTRY_TOKEN" \
-H "Authorization: token ${ADMIN_TOKEN}" \
-H "Content-Type: application/json" \
-d "{\"data\": \"${GITEA_TOKEN}\"}"
# Vault token (for Cosign key retrieval)
curl -X PUT "${GITEA_URL}/api/v1/repos/${REPO}/actions/secrets/VAULT_TOKEN" \
-H "Authorization: token ${ADMIN_TOKEN}" \
-H "Content-Type: application/json" \
-d "{\"data\": \"${VAULT_TOKEN}\"}"Create a ConfigMap for pipeline metrics collection:
apiVersion: v1
kind: ConfigMap
metadata:
name: pipeline-metrics-config
namespace: monitoring
data:
# Prometheus scrape config for Gitea Actions metrics
# Gitea exposes /api/v1/admin/actions/metrics
scrape-config.yaml: |
- job_name: 'gitea-actions'
static_configs:
- targets: ['gitea.services.svc.cluster.local:3000']
metrics_path: '/api/v1/admin/actions/metrics'
bearer_token: '${GITEA_TOKEN}'Import the following dashboard in Grafana (Dashboard → Import → JSON):
Use the Gitea Actions dashboard ID or create a custom dashboard tracking:
- Pipeline success/failure rate
- Average pipeline duration
- Security scan findings by severity (HIGH/CRITICAL)
- Secret detection events
- IaC scan violations over time
# Verify Gitea Actions is enabled
curl -s "http://gitea.services.svc.cluster.local:3000/api/v1/repos/homelab/gitops-apps/actions/runs" \
-H "Authorization: token ${GITEA_TOKEN}" | jq '.total_count'
# Verify runner is registered
kubectl get pods -n cicd
kubectl logs -n cicd -l app.kubernetes.io/name=act-runner --tail=20
# Test pre-commit hooks
echo "password=supersecret123" > test-secret.yaml
git add test-secret.yaml
git commit -m "test" 2>&1
# Expected: gitleaks should block this commit
rm test-secret.yaml
# Trigger pipeline manually
curl -X POST "${GITEA_URL}/api/v1/repos/homelab/gitops-apps/actions/runs" \
-H "Authorization: token ${GITEA_TOKEN}"
# Check pipeline status
kubectl logs -n cicd -l app.kubernetes.io/name=act-runner --tail=50kubectl logs -n cicd -l app.kubernetes.io/name=act-runner
# Check: GITEA_INSTANCE_URL is correct, token is valid
# Verify: curl http://gitea.services.svc.cluster.local:3000 from runner pod
kubectl exec -n cicd deploy/act-runner -- curl -s http://gitea.services.svc.cluster.local:3000/api/v1/version# Run tfsec locally to reproduce
tfsec terraform/ --verbose
# Add exceptions in .tfsec.json for acceptable risks# Update .gitleaks.toml allowlist
# Test: gitleaks detect --source . --config-path .gitleaks.toml --verbose# Admin can bypass in Gitea UI or use admin token
# Disable protection temporarily:
curl -X DELETE "${GITEA_URL}/api/v1/repos/${REPO}/branches/main/protection" \
-H "Authorization: token ${GITEA_TOKEN}"- Gitea Actions enabled on the repository
- Act Runner deployed in
cicdnamespace and registered - Runner registration token stored in Secret
act-runner-token - Docker-in-Docker sidecar running (or host Docker socket mounted)
- ArgoCD Application for act-runner deployed via GitOps
- Repository secrets configured (REGISTRY_USER, REGISTRY_TOKEN, VAULT_TOKEN)
- Pre-commit hooks installed locally (gitleaks, yamllint, tfsec)
-
.pre-commit-config.yamlconfigured - Security pipeline workflow created (
.gitea/workflows/security-pipeline.yaml) - Secret scanning (gitleaks) running in pipeline
- IaC scanning (checkov + tfsec) running in pipeline
- Dependency scanning (trivy fs) running in pipeline
- K8s manifest validation (conftest) running in pipeline
- Container pipeline workflow created for image builds
- Branch protection enabled on
main(require PR, approvals, status checks, signed commits) - Gitleaks configuration tuned (
.gitleaks.toml) - Pipeline fails on HIGH/CRITICAL findings
- ArgoCD conditional sync configured
- Grafana security dashboard created
- Pipeline tested end-to-end with a sample push
Implement image signing, SBOM generation, and verification with Cosign, Syft, Grype, and Sigstore to secure the container supply chain.
This guide implements a complete software supply chain security pipeline. Every container image is scanned for vulnerabilities, documented with an SBOM, signed with Cosign, and verified at the Kubernetes admission stage before deployment.
Time Required: ~90 minutes Prerequisites: Guide 12 (CI/CD Pipeline Security) completed
Software Supply Chain Security
┌─────────────────────────────────────────────┐
│ Container Build Pipeline │
│ │
│ Build → Grype Scan → Syft SBOM → Cosign │
│ Sign │
└──────────────────┬──────────────────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────────┐
│ Registry │ │ SBOM │ │ Transparency │
│ (Gitea) │ │ Store │ │ Log │
└────┬─────┘ └──────────┘ │ (Rekor) │
│ └──────────────┘
▼
┌─────────────────────────────────────────────┐
│ Kubernetes Admission Control │
│ │
│ Kyverno Policy: Verify Cosign Signature │
│ → Only signed images allowed in production │
└─────────────────────────────────────────────┘
# macOS
brew install cosign syft grype
# Verify versions
cosign version
syft version
grype versionThe tools will be installed inline in the Gitea Actions workflows. No separate deployment needed.
Gitea has a built-in OCI-compliant container registry.
# Verify container registry is enabled in Gitea config
# Edit via ConfigMap or Gitea admin UI
# In app.ini:
[packages]
ENABLED = true
# Registry URL: git.homelab.local (port 3000)
# Push format: git.homelab.local:3000/homelab/<image>:<tag># Login to Gitea container registry
docker login git.homelab.local:3000 -u admin -p "${GITEA_TOKEN}"
# Test push
docker pull alpine:latest
docker tag alpine:latest git.homelab.local:3000/homelab/alpine:latest
docker push git.homelab.local:3000/homelab/alpine:latest
# Verify
curl -s "http://gitea.services.svc.cluster.local:3000/api/v1/packages/homelab" \
-H "Authorization: token ${GITEA_TOKEN}" | jq .# Generate key pair — store private key securely
cosign generate-key-pair
# Files created:
# cosign.key (private key — NEVER commit to git)
# cosign.pub (public key — safe to commit)
# Move private key to Vault
kubectl port-forward -n security svc/vault 8200:8200 &
VAULT_ADDR="http://127.0.0.1:8200"
# Store private key in Vault
vault kv put secret/supply-chain/cosign \
private-key="$(cat cosign.key)"
# Commit public key to repository
cp cosign.pub gitops-apps/security/cosign/cosign.pub
rm cosign.key cosign.pub# Pull and push image to Gitea registry
export COSIGN_PASSWORD="" # Set if key is password-protected
# Retrieve private key from Vault for signing
vault kv get -field=private-key secret/supply-chain/cosign > /tmp/cosign.key
# Sign the image
cosign sign --key /tmp/cosign.key \
git.homelab.local:3000/homelab/alpine:latest
# Verify the signature
cosign verify --key gitops-apps/security/cosign/cosign.pub \
git.homelab.local:3000/homelab/alpine:latest
# Clean up
rm /tmp/cosign.keyFor air-gapped or fully local environments, use key-based signing. For public/internet-connected setups, Sigstore Fulcio provides ephemeral key signing:
# Keyless signing (requires internet for Sigstore)
cosign sign \
--yes \
git.homelab.local:3000/homelab/alpine:latest
# Verify with keyless (checks Rekor transparency log)
cosign verify \
git.homelab.local:3000/homelab/alpine:latestNote
Keyless signing requires outbound internet to sigstore.dev. For the homelab's isolated network, key-based signing with Vault is the recommended approach.
# SPDX format
syft git.homelab.local:3000/homelab/alpine:latest -o spdx-json > alpine-sbom.spdx.json
# CycloneDX format
syft git.homelab.local:3000/homelab/alpine:latest -o cyclonedx-json > alpine-sbom.cyclonedx.json
# Table format (human-readable)
syft git.homelab.local:3000/homelab/alpine:latest -o table# Attach SBOM as OCI artifact to the image in registry
cosign attach sbom --sbom alpine-sbom.spdx.json \
git.homelab.local:3000/homelab/alpine:latest
# Verify SBOM is attached
cosign download sbom git.homelab.local:3000/homelab/alpine:latest# Create signed attestation for the SBOM
cosign attest --predicate alpine-sbom.spdx.json --type spdx \
--key /tmp/cosign.key \
git.homelab.local:3000/homelab/alpine:latest
# Verify the attestation
cosign verify-attestation --type spdx \
--key gitops-apps/security/cosign/cosign.pub \
git.homelab.local:3000/homelab/alpine:latest# Scan image directly
grype git.homelab.local:3000/homelab/alpine:latest
# Scan with severity filter
grype git.homelab.local:3000/homelab/alpine:latest --fail-on high
# Output SARIF format
grype git.homelab.local:3000/homelab/alpine:latest -o sarif > grype-results.sarif
# Output JSON for programmatic processing
grype git.homelab.local:3000/homelab/alpine:latest -o json > grype-results.json# Scan from SBOM file (no registry access needed)
grype sbom:./alpine-sbom.spdx.json
# Useful for air-gapped scanning workflows
grype sbom:./alpine-sbom.spdx.json --fail-on criticalCreate .grype.yaml:
# Fail pipeline on these severity levels
fail-on-severity: "high"
# Ignore specific vulnerabilities (with justification)
ignore:
- vulnerability: CVE-2023-XXXXX
fix-state: not-fixed
reason: "No fix available; acceptable risk in homelab"
# Only show fixed vulnerabilities
only-fixed: false
# Registry auth for private registry
registry:
auth:
- authority: git.homelab.local:3000
username: admin
password: "${GITEA_TOKEN}"Create gitops-apps/.gitea/workflows/supply-chain.yaml:
name: Supply Chain Security
on:
push:
branches: [main]
paths:
- 'container-images/**'
- 'Dockerfile*'
env:
REGISTRY: git.homelab.local:3000
IMAGE_NAME: homelab/app
jobs:
build-scan-sign:
name: "🔐 Build → Scan → Sign"
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Install Tools
run: |
# Cosign
curl -L https://github.com/sigstore/cosign/releases/latest/download/cosign-linux-amd64 -o /usr/local/bin/cosign
chmod +x /usr/local/bin/cosign
# Syft
curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
# Grype
curl -sSfL https://raw.githubusercontent.com/anchore/grype/main/install.sh | sh -s -- -b /usr/local/bin
- name: Login to Registry
run: |
docker login ${REGISTRY} -u ${{ secrets.REGISTRY_USER }} -p ${{ secrets.REGISTRY_TOKEN }}
- name: Build Image
run: |
docker build -t ${REGISTRY}/${IMAGE_NAME}:${{ github.sha }} .
docker push ${REGISTRY}/${IMAGE_NAME}:${{ github.sha }}
# Step 1: Vulnerability Scan
- name: Grype Vulnerability Scan
run: |
grype ${REGISTRY}/${IMAGE_NAME}:${{ github.sha }} \
--fail-on high \
-o sarif > grype-results.sarif \
-o json > grype-results.json
# Step 2: Generate SBOM
- name: Generate SBOM (SPDX)
run: |
syft ${REGISTRY}/${IMAGE_NAME}:${{ github.sha }} -o spdx-json > sbom.spdx.json
syft ${REGISTRY}/${IMAGE_NAME}:${{ github.sha }} -o cyclonedx-json > sbom.cyclonedx.json
# Step 3: Attach SBOM
- name: Attach SBOM to Image
run: |
cosign attach sbom --sbom sbom.spdx.json \
${REGISTRY}/${IMAGE_NAME}:${{ github.sha }}
# Step 4: Sign Image
- name: Sign Image with Cosign
run: |
# Retrieve signing key from Vault
export VAULT_ADDR="http://vault.security.svc.cluster.local:8200"
export VAULT_TOKEN="${{ secrets.VAULT_TOKEN }}"
vault kv get -field=private-key secret/supply-chain/cosign > /tmp/cosign.key
cosign sign --key /tmp/cosign.key --yes \
${REGISTRY}/${IMAGE_NAME}:${{ github.sha }}
rm -f /tmp/cosign.key
# Step 5: Sign SBOM Attestation
- name: Attest SBOM
run: |
export VAULT_ADDR="http://vault.security.svc.cluster.local:8200"
export VAULT_TOKEN="${{ secrets.VAULT_TOKEN }}"
vault kv get -field=private-key secret/supply-chain/cosign > /tmp/cosign.key
cosign attest --predicate sbom.spdx.json --type spdx \
--key /tmp/cosign.key \
${REGISTRY}/${IMAGE_NAME}:${{ github.sha }}
rm -f /tmp/cosign.key
- name: Upload Artifacts
uses: actions/upload-artifact@v4
if: always()
with:
name: supply-chain-artifacts
path: |
grype-results.sarif
grype-results.json
sbom.spdx.json
sbom.cyclonedx.json
- name: Verify Signature
run: |
cosign verify --key security/cosign/cosign.pub \
${REGISTRY}/${IMAGE_NAME}:${{ github.sha }}# Add repository secrets for the pipeline
GITEA_URL="http://gitea.services.svc.cluster.local:3000"
REPO="homelab/gitops-apps"
TOKEN="your-admin-token"
curl -X PUT "${GITEA_URL}/api/v1/repos/${REPO}/actions/secrets/REGISTRY_USER" \
-H "Authorization: token ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{"data": "admin"}'
curl -X PUT "${GITEA_URL}/api/v1/repos/${REPO}/actions/secrets/REGISTRY_TOKEN" \
-H "Authorization: token ${TOKEN}" \
-H "Content-Type: application/json" \
-d "{\"data\": \"${GITEA_TOKEN}\"}"Update gitops-apps/security/kyverno/policies.yaml to add signature verification:
---
# Only allow Cosign-signed images in production namespaces
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-image-signatures
annotations:
policies.kyverno.io/title: Verify Image Signatures
policies.kyverno.io/category: Supply Chain Security
policies.kyverno.io/severity: high
policies.kyverno.io/description: >-
Only allow container images that have been signed with the homelab
Cosign key. Prevents deployment of unsigned or tampered images.
spec:
validationFailureAction: Audit # Change to Enforce after testing
background: false
rules:
- name: verify-cosign-signature
match:
any:
- resources:
kinds:
- Pod
namespaces:
- production
- services
- monitoring
- security
exclude:
any:
- resources:
namespaces:
- kube-system
- longhorn-system
- argocd
- kyverno
- falco
- logging
- cicd
verifyImages:
- imageReferences:
- "git.homelab.local:3000/*"
attestors:
- entries:
- keys:
publicKeys: |-
-----BEGIN PUBLIC KEY-----
# Paste contents of cosign.pub here
-----END PUBLIC KEY-----
attestations:
- type: https://spdx.dev/Document
conditions:
- all:
- key: "{{ contents[].SPDXID }}"
operator: NotEquals
value: ""kubectl apply -f gitops-apps/security/kyverno/policies.yaml
# Verify policy is active
kubectl get clusterpolicy verify-image-signatures -o wide# Deploy an unsigned image — should be logged (audit mode)
kubectl run test-unsigned --image=git.homelab.local:3000/homelab/alpine:latest -n services
# Check Kyverno audit logs
kubectl logs -n kyverno -l app.kubernetes.io/name=kyverno --tail=20
# Deploy a signed image — should pass
# (After signing the image per Phase 3)
kubectl run test-signed --image=git.homelab.local:3000/homelab/alpine:latest -n services
# Switch to Enforce mode when ready
# Edit the policy: validationFailureAction: EnforceCreate a CronJob to periodically scan images and push metrics to Prometheus:
apiVersion: batch/v1
kind: CronJob
metadata:
name: supply-chain-scanner
namespace: monitoring
spec:
schedule: "0 */6 * * *" # Every 6 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: scanner
image: anchore/grype:latest
command:
- /bin/sh
- -c
- |
# Scan all running container images in the cluster
IMAGES=$(kubectl get pods -A -o json | jq -r '.items[].spec.containers[].image' | sort -u)
for IMG in $IMAGES; do
echo "Scanning: $IMG"
grype "$IMG" -o json >> /tmp/scan-results.json
done
env:
- name: DOCKER_CONFIG
value: /tmp/.docker
restartPolicy: OnFailureCreate a Grafana dashboard showing:
- Total images scanned vs signed
- Vulnerability count by severity over time
- SBOM coverage (images with/without SBOM)
- Unsigned deployment attempts (from Kyverno audit logs)
- Most vulnerable images (top 10)
Import via Grafana provisioning ConfigMap or UI.
# Verify Cosign installation
cosign version
# Verify signing key in Vault
vault kv get secret/supply-chain/cosign
# Verify public key in repo
cat gitops-apps/security/cosign/cosign.pub
# Test full pipeline: sign, verify, scan
export IMAGE="git.homelab.local:3000/homelab/alpine:latest"
# Sign
vault kv get -field=private-key secret/supply-chain/cosign > /tmp/cosign.key
cosign sign --key /tmp/cosign.key --yes ${IMAGE}
rm /tmp/cosign.key
# Verify
cosign verify --key gitops-apps/security/cosign/cosign.pub ${IMAGE}
# Generate SBOM
syft ${IMAGE} -o table
# Scan with Grype
grype ${IMAGE} --fail-on critical
# Verify Kyverno policy is active
kubectl get clusterpolicy verify-image-signatures# Test registry connectivity
curl -v http://gitea.services.svc.cluster.local:3000/v2/
# Check: DNS resolves, port is correct, auth is configured# Check policy is in audit mode first
kubectl get clusterpolicy verify-image-signatures -o yaml | grep validationFailureAction
# Start with Audit, verify logs, then switch to Enforce# Grype downloads vulnerability DB from github.com
# For air-gapped: use grype -db <path-to-local-db>
grype db update# Use CycloneDX (more compact) instead of SPDX
syft ${IMAGE} -o cyclonedx-json > sbom.json
# Or filter to packages only
syft ${IMAGE} -o spdx-json --exclude-patterns="**/test/**" > sbom.json- Cosign installed and key pair generated
- Private key stored in Vault (
secret/supply-chain/cosign) - Public key committed to
gitops-apps/security/cosign/cosign.pub - Gitea container registry enabled and tested
- Syft generates SBOMs in SPDX and CycloneDX formats
- Grype scans images with HIGH severity gate
- Supply chain CI workflow created (build → scan → SBOM → sign → attest)
- Cosign signs images and attaches SBOM attestations
- Kyverno policy
verify-image-signaturesdeployed in Audit mode - Unsigned image deployment detected by Kyverno
- Signed image verification passes
- Supply chain artifacts uploaded to Gitea Actions
- Grafana supply chain dashboard created
- CronJob for periodic image scanning deployed
-
.grype.yamlconfiguration committed
Scan Terraform and Kubernetes manifests with tfsec, Checkov, and Terrascan to catch misconfigurations before deployment.
This guide implements shift-left IaC security scanning. Every Terraform plan and Kubernetes manifest is scanned for misconfigurations, compliance violations, and security risks — both locally via pre-commit hooks and in the CI/CD pipeline.
Time Required: ~60 minutes Prerequisites: Guide 03 (Terraform Infrastructure), Guide 12 (CI/CD Pipeline Security) completed
IaC Security Scanning Pipeline
┌───────────────────────────────────────────────┐
│ Developer Push / PR │
└──────────────────┬────────────────────────────┘
▼
┌───────────────────────────────────────────────┐
│ Pre-Commit Hooks │
│ tfsec · Checkov · Terrascan · Conftest │
└──────────────────┬────────────────────────────┘
│ (local pass)
▼
┌───────────────────────────────────────────────┐
│ Gitea Actions Pipeline │
│ │
│ tfsec → Terraform scanning │
│ Checkov → Terraform + K8s scanning │
│ Terrascan → OPA-based IaC scanning │
│ Conftest → K8s manifest policy checks │
└──────────────────┬────────────────────────────┘
│ (all pass)
▼
┌───────────────────────────────────────────────┐
│ ArgoCD GitOps Deployment │
└───────────────────────────────────────────────┘
# macOS
brew install tfsec checkov terrascan conftest
# Verify installations
tfsec --version
checkov --version
terrascan version
conftest --version# Run via Docker if preferred
docker run --rm -v $(pwd):/src aquasec/tfsec /src
docker run --rm -v $(pwd):/src bridgecrew/checkov -d /src
docker run --rm -v $(pwd):/src accurics/terrascan scan -d /src
docker run --rm -v $(pwd)/policies:/policies -v $(pwd)/manifests:/manifests openpolicyagent/conftest test /manifestscd /path/to/homelab
# Scan all Terraform code
tfsec terraform/ --verbose
# Output in JSON format
tfsec terraform/ --format json --out tfsec-results.json
# Output in SARIF (for CI integration)
tfsec terraform/ --format sarif --out tfsec-results.sarifCreate terraform/.tfsec.json:
{
"exclude": [
"GEN001",
"GEN003"
],
"severity_overrides": {
"DS002": "HIGH",
"AWS001": "CRITICAL"
}
}Add inline suppressions to Terraform code where needed:
# tfsec:ignore:GEN001 Proxmox local-only — no remote state backend needed
terraform {
required_providers {
proxmox = {
source = "bpg/proxmox"
version = "~> 0.73"
}
}
}
# tfsec:ignore:DS002 Using HTTP for internal Proxmox API — no public exposure
provider "proxmox" {
endpoint = "https://192.168.1.11:8006"
insecure = true
}Create terraform/.tfsec/custom_checks/ directory with custom checks specific to the homelab:
# terraform/.tfsec/custom_checks/require_longhorn_storage.yaml
checks:
- code: HOMELAB001
description: "K3s VMs must use Longhorn storage class"
requiredTypes:
- resource
requiredLabels:
- proxmox_vm_qemu
matchSpec:
name: disk
action: contains
value: "longhorn"
severity: MEDIUM
relatedLinks:
- "https://homelab.local/docs/storage"# Scan Terraform code
checkov -d terraform/ --framework terraform --output cli
# Scan with all frameworks
checkov -d . --output cli
# Output in JSON
checkov -d terraform/ --output json --output-file-path checkov-tf.json
# Output in SARIF
checkov -d terraform/ --output sarif --output-file-path checkov-tf.sarif# Scan K8s manifests
checkov -d gitops-apps/ --framework kubernetes --output cli
# Scan specific directory
checkov -d gitops-apps/security/ --framework kubernetesCreate .checkov.yaml in the repository root:
# Checkov configuration for homelab
branch: main
# Skip specific checks that don't apply to the homelab
skip-check:
- CKV_K8S_21 # "The default namespace should not be used" — acceptable for system namespaces
- CKV_K8S_38 # "Ensure that Service Account Tokens are only mounted where necessary"
- CKV_TF_1 # "Use HTTPS for Proxmox API" — internal network, self-signed cert
- CKV_TF_2 # "Use remote state" — local state acceptable for homelab
# Soft fail — report but don't fail the pipeline on these
soft-fail-on:
- CKV_K8S_14 # "Image tag should be specified" — some test images use latest
- CKV_K8S_43 # "Image pull policy should be 'Always'"
# Framework-specific settings
framework:
- terraform
- kubernetes
# Output directory for results
output: cli
compact: trueIn Terraform files:
# checkov:skip=CKV_TF_1 Internal Proxmox API on isolated network
provider "proxmox" {
endpoint = "https://192.168.1.11:8006"
insecure = true
}In Kubernetes manifests:
metadata:
annotations:
checkov.io/skip: "CKV_K8S_21=System namespace, CKV_K8S_38=Service account token required for Vault auth"Create policies/checkov/ directory with custom Python policies:
# policies/checkov/require_longhorn_storage.py
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
class RequireLonghornStorage(BaseResourceCheck):
def __init__(self):
name = "Ensure K8s PVCs use Longhorn storage classes"
id = "CKV_HOMELAB_1"
supported_resources = ["kubernetes_persistent_volume_claim"]
categories = ["storage"]
super().__init__(name=name, id=id, categories=categories, supported_resources=supported_resources)
def scan_resource_conf(self, conf):
storage_class = conf.get("spec", [{}])[0].get("storage_class_name", "")
allowed = ["longhorn-critical", "longhorn-default", "longhorn-ephemeral"]
if storage_class in allowed:
return CheckResult.PASSED
return CheckResult.FAILED
check = RequireLonghornStorage()# Scan Terraform code
terrascan scan -d terraform/ -t terraform
# Scan Kubernetes manifests
terrascan scan -d gitops-apps/ -t kubernetes
# Output in JSON
terrascan scan -d terraform/ -t terraform -o json > terrascan-results.json
# Output in YAML
terrascan scan -d gitops-apps/ -t kubernetes -o yaml > terrascan-k8s.yamlCreate terrascan-config.toml:
[notifications]
[rules]
# Skip rules not applicable to homelab
skip-rules = [
"AC_K8S_37", # Require resource limits — handled by Kyverno
"AC_K8S_38", # Service account token mounting
]
# Severity threshold
severity = "medium"
# Categories to scan
categories = [
"Security",
"Compliance",
]Create custom Rego policies in policies/terrascan/:
# policies/terrascan/require_homelab_labels.rego
package custom.kubernetes.require_labels
import future.keywords.in
__rego_metadata__ := {
"id": "HOMELAB_001",
"avd_id": "AVD-HOMELAB-0001",
"title": "Resources must have homelab labels",
"short_code": "require-labels",
"version": "v1.0.0",
"severity": "LOW",
"type": "Kubernetes",
"description": "All Kubernetes resources must have app.kubernetes.io/name label",
"recommended_actions": "Add app.kubernetes.io/name label to resources",
"url": "https://homelab.local/docs/labels",
}
deny[cause] {
resource := input.resource
not has_required_label(resource)
cause := sprintf("Resource '%s' missing required label 'app.kubernetes.io/name'", [resource.metadata.name])
}
has_required_label(resource) {
resource.metadata.labels["app.kubernetes.io/name"]
}Create policies/conftest/ directory:
# policies/conftest/require_labels.rego
package main
deny[msg] {
input.kind == "Deployment"
not input.metadata.labels["app.kubernetes.io/name"]
msg := sprintf("Deployment '%s' must have app.kubernetes.io/name label", [input.metadata.name])
}
deny[msg] {
input.kind == "Deployment"
not input.spec.template.spec.containers[_].resources.limits.cpu
msg := sprintf("Deployment '%s' containers must have CPU limits", [input.metadata.name])
}
deny[msg] {
input.kind == "Deployment"
not input.spec.template.spec.containers[_].resources.limits.memory
msg := sprintf("Deployment '%s' containers must have memory limits", [input.metadata.name])
}# policies/conftest/disallow_latest.rego
package main
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
endswith(container.image, ":latest")
msg := sprintf("Container '%s' in Deployment '%s' uses :latest tag — use specific version", [container.name, input.metadata.name])
}# policies/conftest/require_storage_class.rego
package main
deny[msg] {
input.kind == "PersistentVolumeClaim"
not input.spec.storageClassName
msg := sprintf("PVC '%s' must specify a storageClassName — use longhorn-default or longhorn-critical", [input.metadata.name])
}
deny[msg] {
input.kind == "PersistentVolumeClaim"
allowed := {"longhorn-critical", "longhorn-default", "longhorn-ephemeral"}
not input.spec.storageClassName in allowed
msg := sprintf("PVC '%s' uses unsupported storageClass '%s' — use Longhorn classes", [input.metadata.name, input.spec.storageClassName])
}# Test all K8s manifests against policies
conftest test --policy policies/conftest/ gitops-apps/
# Test specific file
conftest test --policy policies/conftest/ gitops-apps/security/kyverno/policies.yaml
# Output in JSON
conftest test --policy policies/conftest/ --output json gitops-apps/
# Combine with Kubernetes rendering (if using Helm)
helm template my-chart ./chart | conftest test --policy policies/conftest/ -Create policies/conftest/require_labels_test.rego:
package main
test_deployment_has_label {
allow with input as {
"kind": "Deployment",
"metadata": {
"name": "test",
"labels": {"app.kubernetes.io/name": "test"}
}
}
}
test_deployment_missing_label {
deny[msg] with input as {
"kind": "Deployment",
"metadata": {"name": "test"}
}
}# Run policy unit tests
conftest verify --policy policies/conftest/Create gitops-apps/.gitea/workflows/iac-security.yaml:
name: IaC Security Scan
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
tfsec:
name: "🔒 tfsec"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run tfsec
run: |
curl -L https://github.com/aquasecurity/tfsec/releases/latest/download/tfsec-linux-amd64 -o tfsec
chmod +x tfsec
./tfsec terraform/ --format json --out tfsec-results.json --no-color
- name: Upload Results
uses: actions/upload-artifact@v4
if: always()
with:
name: tfsec-results
path: tfsec-results.json
checkov:
name: "🛡️ Checkov"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Checkov
run: |
pip install checkov
checkov -d . --framework terraform,kubernetes \
--output cli \
--output junitxml \
--output-file-path console,checkov-results.xml \
--compact \
--config-file .checkov.yaml
- name: Upload Results
uses: actions/upload-artifact@v4
if: always()
with:
name: checkov-results
path: checkov-results.xml
terrascan:
name: "🔍 Terrascan"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Terrascan
run: |
curl -L "$(curl -s https://api.github.com/repos/accurics/terrascan/releases/latest | grep -o -E 'https://.+?_Linux_x86_64.tar.gz')" -o terrascan.tar.gz
tar -xf terrascan.tar.gz terrascan && rm terrascan.tar.gz
./terrascan scan -d . -t terraform,kubernetes -o json > terrascan-results.json
- name: Upload Results
uses: actions/upload-artifact@v4
if: always()
with:
name: terrascan-results
path: terrascan-results.json
conftest:
name: "📋 Conftest"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Conftest
run: |
curl -L https://github.com/open-policy-agent/conftest/releases/latest/download/conftest_linux_amd64.tar.gz | tar xz
sudo mv conftest /usr/local/bin/
conftest test --policy policies/conftest/ --output table gitops-apps/
- name: Conftest Verify (Unit Tests)
run: |
conftest verify --policy policies/conftest/Add IaC scanners to .pre-commit-config.yaml:
repos:
# Terraform security
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.96.1
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_tflint
- id: tfsec
args: ['--force-all-dirs', '--no-color', '--config-file', 'terraform/.tfsec.json']
# Checkov
- repo: https://github.com/bridgecrewio/checkov
rev: "3.2.232"
hooks:
- id: checkov
args: ['--directory', '.', '--framework', 'terraform,kubernetes', '--compact', '--quiet']
# YAML linting
- repo: https://github.com/adrienverge/yamllint
rev: v1.35.1
hooks:
- id: yamllint
args: ['-d', '{extends: relaxed, rules: {line-length: {max: 120}}}']# Store scan results in Gitea Packages as generic artifacts
GITEA_URL="http://gitea.services.svc.cluster.local:3000"
TOKEN="your-token"
# Upload scan results after each pipeline run
for REPORT in tfsec-results.json checkov-results.xml terrascan-results.json; do
curl -X PUT "${GITEA_URL}/api/v1/packages/homelab/generic/iac-reports/${GITHUB_SHA}/${REPORT}" \
-H "Authorization: token ${TOKEN}" \
--data-binary "@${REPORT}"
doneCreate a simple script to track scan results over time:
#!/bin/bash
# scripts/track-iac-metrics.sh
# Push IaC scan metrics to Prometheus Pushgateway for Grafana visualization
PUSHGATEWAY="http://kube-prometheus-stack-prometheus-pushgateway.monitoring.svc.cluster.local:9091"
# Count findings by severity
TFSEC_HIGH=$(jq '[.results[] | select(.severity=="HIGH")] | length' tfsec-results.json 2>/dev/null || echo "0")
TFSEC_MED=$(jq '[.results[] | select(.severity=="MEDIUM")] | length' tfsec-results.json 2>/dev/null || echo "0")
TFSEC_LOW=$(jq '[.results[] | select(.severity=="LOW")] | length' tfsec-results.json 2>/dev/null || echo "0")
# Push to Pushgateway
cat <<EOF | curl --data-binary @- ${PUSHGATEWAY}/metrics/job/iac_security
iac_tfsec_findings{severity="high"} ${TFSEC_HIGH}
iac_tfsec_findings{severity="medium"} ${TFSEC_MED}
iac_tfsec_findings{severity="low"} ${TFSEC_LOW}
EOF# Run all scanners locally and verify they produce results
echo "=== tfsec ==="
tfsec terraform/ --no-color
echo ""
echo "=== Checkov (Terraform) ==="
checkov -d terraform/ --framework terraform --compact
echo ""
echo "=== Checkov (Kubernetes) ==="
checkov -d gitops-apps/ --framework kubernetes --compact
echo ""
echo "=== Terrascan ==="
terrascan scan -d terraform/ -t terraform
echo ""
echo "=== Conftest ==="
conftest test --policy policies/conftest/ gitops-apps/
# Verify config files exist
ls -la .checkov.yaml terraform/.tfsec.json terrascan-config.toml
# Verify policy directories
ls -la policies/conftest/
ls -la policies/checkov/
# Run pre-commit
pre-commit run --all-files# Ensure Terraform files are valid
cd terraform && terraform init && terraform validate
# tfsec scans valid Terraform — if files have syntax errors, tfsec skips them# Use --compact flag and configure .checkov.yaml skip-check list
checkov -d . --compact --skip-check CKV_K8S_21,CKV_K8S_38
# Add permanent skips to .checkov.yaml# Terrascan requires valid syntax — validate files first
terrascan scan -d . -t terraform --verbose
# Use --skip-rules for non-applicable checks
terrascan scan -d . -t terraform --skip-rules="AC_K8S_37"# Test Rego syntax
conftest parse gitops-apps/argocd-apps/root-application.yaml
# Verify policy with unit tests
conftest verify --policy policies/conftest/
# Debug with trace flag
conftest test --policy policies/conftest/ --trace gitops-apps/- tfsec installed and scanning Terraform code
-
.tfsec.jsonconfiguration created with homelab-specific exceptions - Checkov scanning both Terraform and Kubernetes manifests
-
.checkov.yamlconfiguration created - Terrascan installed and scanning with OPA policies
- Terrascan config (
terrascan-config.toml) created - Conftest policies created in
policies/conftest/ - Policy unit tests pass (
conftest verify) - Custom Rego policies for homelab requirements (labels, storage class)
- IaC scanning pipeline integrated into Gitea Actions
- Pre-commit hooks include all IaC scanners
- False positives suppressed with inline annotations
- Scan results uploaded as CI artifacts
- Metrics pushed to Prometheus Pushgateway
- Grafana IaC security dashboard created
Automate TLS certificate provisioning with cert-manager and a private CA for all homelab services.
This guide sets up cert-manager with a self-signed root CA and a CA issuer to automatically provision TLS certificates for every *.homelab.local service. Certificates are auto-renewed, monitored in Grafana, and trusted across all workstations.
Time Required: ~60 minutes Prerequisites: Guide 07 (GitOps Stack) completed
Certificate Management Architecture
┌──────────────────────────────────────────────────┐
│ cert-manager │
│ │
│ ┌─────────────┐ ┌──────────────────────────┐ │
│ │ Self-Signed │───>│ CA Issuer │ │
│ │ Root CA │ │ (signed by Root CA) │ │
│ └─────────────┘ └────────────┬─────────────┘ │
│ │ │
│ ┌─────────────┼────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌───────┐ │
│ │ Gitea │ │ ArgoCD │ │Vault │ │
│ │ TLS crt │ │ TLS crt │ │TLS crt│ │
│ └─────────┘ └─────────┘ └───────┘ │
│ ┌─────────┐ ┌─────────┐ ┌───────┐ │
│ │ Grafana │ │Authelia │ │Longhor│ │
│ │ TLS crt │ │ TLS crt │ │n TLS │ │
│ └─────────┘ └─────────┘ └───────┘ │
└──────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ ingress-nginx (TLS Termination) │
│ *.homelab.local → wildcard TLS certificate │
└──────────────────────────────────────────────────┘
cert-manager is included in kube-prometheus-stack. Verify:
kubectl get pods -n cert-manager
# If not found, install separately:
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set crds.enabled=true \
--set replicaCount=1 \
--set prometheus.enabled=true \
--set prometheus.servicemonitor.enabled=true \
--set webhook.timeoutSeconds=10 \
--waitkubectl get crd | grep cert-manager
# Expected: certificates.cert-manager.io
# certificaterequests.cert-manager.io
# challenges.acme.cert-manager.io
# clusterissuers.cert-manager.io
# issuers.cert-manager.io
# orders.acme.cert-manager.io# Generate Root CA key and certificate
openssl genrsa -out homelab-root-ca.key 4096
openssl req -x509 -new -nodes -key homelab-root-ca.key \
-sha256 -days 3650 \
-subj "/C=US/ST=Homelab/L=Homelab/O=Homelab CA/CN=Homelab Root CA" \
-addext "basicConstraints=critical,CA:TRUE" \
-addext "keyUsage=critical,keyCertSign,cRLSign" \
-out homelab-root-ca.crtkubectl create namespace security
kubectl create secret tls homelab-root-ca \
--namespace security \
--cert=homelab-root-ca.crt \
--key=homelab-root-ca.key \
--dry-run=client -o yaml | kubectl apply -f -Create gitops-apps/infrastructure/cert-manager/cluster-issuer.yaml:
---
# Self-signed issuer to bootstrap the Root CA
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: selfsigned-issuer
spec:
selfSigned: {}
---
# Root CA certificate (self-signed, 10-year validity)
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: homelab-root-ca
namespace: security
spec:
isCA: true
commonName: Homelab Root CA
secretName: homelab-root-ca-tls
duration: 87600h # 10 years
renewBefore: 720h # Renew 30 days before expiry
subject:
organizations:
- Homelab
organizationalUnits:
- Certificate Authority
dnsNames:
- homelab-root-ca
issuerRef:
name: selfsigned-issuer
kind: ClusterIssuer
group: cert-manager.io
---
# CA Issuer backed by the Root CA certificate
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: homelab-ca-issuer
spec:
ca:
secretName: homelab-root-ca-tlskubectl apply -f gitops-apps/infrastructure/cert-manager/cluster-issuer.yaml
# Wait for CA certificate to be issued
kubectl wait --for=condition=Ready certificate/homelab-root-ca -n security --timeout=60s
# Verify issuers
kubectl get clusterissuer
# Expected: selfsigned-issuer Ready
# homelab-ca-issuer Ready# gitops-apps/infrastructure/cert-manager/wildcard-certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: homelab-wildcard
namespace: security
spec:
secretName: homelab-wildcard-tls
duration: 8760h # 1 year
renewBefore: 360h # Renew 15 days before expiry
subject:
organizations:
- Homelab
dnsNames:
- homelab.local
- "*.homelab.local"
issuerRef:
name: homelab-ca-issuer
kind: ClusterIssuer
group: cert-manager.iokubectl apply -f gitops-apps/infrastructure/cert-manager/wildcard-certificate.yaml
# Wait for issuance
kubectl wait --for=condition=Ready certificate/homelab-wildcard -n security --timeout=60s
# Verify
kubectl get certificate -n security
kubectl describe certificate homelab-wildcard -n securityCreate a TLS secret in the ingress namespace by copying from security:
# gitops-apps/infrastructure/ingress/tls-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: homelab-wildcard-tls
namespace: ingress-nginx
annotations:
# Tell cert-manager to replicate this secret
replicator.v1.mittwald.de/replicate-to: "ingress-nginx,argocd,services,monitoring,security,logging"
type: kubernetes.io/tls
data: {}Install cert-manager-trust or use a simple CronJob to sync the wildcard secret to all namespaces:
apiVersion: batch/v1
kind: CronJob
metadata:
name: sync-tls-secret
namespace: security
spec:
schedule: "*/15 * * * *"
jobTemplate:
spec:
template:
spec:
serviceAccountName: secret-syncer
containers:
- name: sync
image: bitnami/kubectl:latest
command:
- /bin/bash
- -c
- |
NAMESPACES="ingress-nginx argocd services monitoring security logging"
for NS in $NAMESPACES; do
kubectl get secret homelab-wildcard-tls -n security -o yaml \
| sed "s/namespace: security/namespace: $NS/" \
| kubectl apply -f -
done
restartPolicy: OnFailurePatch ingress-nginx to use the wildcard certificate as default:
helm upgrade ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--reuse-values \
--set controller.extraArgs.default-ssl-certificate="security/homelab-wildcard-tls"apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: gitea
namespace: services
annotations:
cert-manager.io/cluster-issuer: homelab-ca-issuer
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
spec:
ingressClassName: nginx
tls:
- hosts:
- git.homelab.local
secretName: gitea-tls
rules:
- host: git.homelab.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: gitea
port:
number: 3000Create gitops-apps/infrastructure/cert-manager/service-certificates.yaml:
---
# Gitea
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: gitea-tls
namespace: services
spec:
secretName: gitea-tls
duration: 8760h
renewBefore: 360h
dnsNames:
- git.homelab.local
- gitea.services.svc.cluster.local
issuerRef:
name: homelab-ca-issuer
kind: ClusterIssuer
---
# ArgoCD
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: argocd-tls
namespace: argocd
spec:
secretName: argocd-tls
duration: 8760h
renewBefore: 360h
dnsNames:
- argocd.homelab.local
- argocd-server.argocd.svc.cluster.local
issuerRef:
name: homelab-ca-issuer
kind: ClusterIssuer
---
# Vault
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: vault-tls
namespace: security
spec:
secretName: vault-tls
duration: 8760h
renewBefore: 360h
dnsNames:
- vault.homelab.local
- vault.security.svc.cluster.local
- vault-internal.security.svc.cluster.local
issuerRef:
name: homelab-ca-issuer
kind: ClusterIssuer
---
# Grafana
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: grafana-tls
namespace: monitoring
spec:
secretName: grafana-tls
duration: 8760h
renewBefore: 360h
dnsNames:
- grafana.homelab.local
- kube-prometheus-stack-grafana.monitoring.svc.cluster.local
issuerRef:
name: homelab-ca-issuer
kind: ClusterIssuer
---
# Authelia
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: authelia-tls
namespace: security
spec:
secretName: authelia-tls
duration: 8760h
renewBefore: 360h
dnsNames:
- auth.homelab.local
- authelia.security.svc.cluster.local
issuerRef:
name: homelab-ca-issuer
kind: ClusterIssuer
---
# Longhorn UI
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: longhorn-tls
namespace: longhorn-system
spec:
secretName: longhorn-tls
duration: 8760h
renewBefore: 360h
dnsNames:
- longhorn.homelab.local
- longhorn-frontend.longhorn-system.svc.cluster.local
issuerRef:
name: homelab-ca-issuer
kind: ClusterIssuerkubectl apply -f gitops-apps/infrastructure/cert-manager/service-certificates.yaml
# Wait for all certificates
kubectl get certificates -A# Store Root CA cert in Vault for distribution
vault kv put secret/certificates/root-ca \
certificate="$(kubectl get secret homelab-root-ca-tls -n security -o jsonpath='{.data.ca\.crt}' | base64 -d)"
# Retrieve from any machine
vault kv get -field=certificate secret/certificates/root-ca > homelab-ca.crt# Get the Root CA certificate
kubectl get secret homelab-root-ca-tls -n security -o jsonpath='{.data.ca\.crt}' | base64 -d > homelab-ca.crt
# Add to macOS System Keychain (requires admin password)
sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain homelab-ca.crt
# Verify
security find-certificate -c "Homelab Root CA" /Library/Keychains/System.keychain# Via Ansible — add to all K3s nodes
cat > ansible/playbooks/trust-ca.yml <<'EOF'
---
- name: Trust Homelab CA on all nodes
hosts: k3s_cluster
become: true
tasks:
- name: Copy CA certificate
ansible.builtin.copy:
src: homelab-ca.crt
dest: /usr/local/share/ca-certificates/homelab-ca.crt
mode: '0644'
- name: Update CA certificates
ansible.builtin.command: update-ca-certificates
changed_when: "'Added' in ca_update.stdout"
register: ca_update
EOF
ansible-playbook -i ansible/inventories/homelab/hosts.yml ansible/playbooks/trust-ca.yml# Create ConfigMap with CA cert for pods that need it
kubectl create configmap homelab-ca \
--from-file=ca.crt=homelab-ca.crt \
--namespace=security \
--dry-run=client -o yaml | kubectl apply -f -cert-manager exposes Prometheus metrics. Create a PrometheusRule:
# gitops-apps/monitoring/cert-manager-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cert-manager-alerts
namespace: monitoring
spec:
groups:
- name: cert-manager
rules:
- alert: CertificateExpirySoon
expr: certmanager_certificate_expiration_timestamp_seconds - time() < 86400 * 30
for: 1h
labels:
severity: warning
annotations:
summary: "Certificate {{ $labels.name }} expires in less than 30 days"
description: "Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} expires on {{ $value | humanizeTimestamp }}"
- alert: CertificateExpiryCritical
expr: certmanager_certificate_expiration_timestamp_seconds - time() < 86400 * 7
for: 1h
labels:
severity: critical
annotations:
summary: "Certificate {{ $labels.name }} expires in less than 7 days"
description: "Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} expires on {{ $value | humanizeTimestamp }}"
- alert: CertificateFailedIssuance
expr: certmanager_certificate_ready_status{condition="False"} == 1
for: 10m
labels:
severity: critical
annotations:
summary: "Certificate {{ $labels.name }} failed to issue"
description: "Certificate {{ $labels.name }} in namespace {{ $labels.namespace }} is not ready"Import the official cert-manager dashboard (Dashboard ID: 20842) or create a custom one showing:
- Certificate status (Ready / Not Ready)
- Time until expiry per certificate
- Issuance success/failure rate
- CA issuer health
# Import via Grafana UI: Dashboards → Import → 20842Create gitops-apps/infrastructure/cert-manager/application.yaml:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: cert-manager
namespace: argocd
spec:
project: homelab
source:
repoURL: https://git.homelab.local/homelab/gitops-apps.git
targetRevision: main
path: infrastructure/cert-manager
destination:
server: https://kubernetes.default.svc
namespace: cert-manager
syncPolicy:
automated:
selfHeal: true
prune: false
syncOptions:
- ServerSideApply=true
- CreateNamespace=true# Check cert-manager is running
kubectl get pods -n cert-manager
# Check ClusterIssuers
kubectl get clusterissuer
# Expected: selfsigned-issuer Ready, homelab-ca-issuer Ready
# Check all certificates
kubectl get certificates -A
# All should show Ready: True
# Check certificate details
kubectl describe certificate homelab-wildcard -n security
# Verify TLS on a service
curl -v --cacert homelab-ca.crt https://git.homelab.local
# Expected: TLS handshake succeeds, certificate verified
# Check certificate expiry
kubectl get certificate -A -o custom-columns=NAME:.metadata.name,NS:.metadata.namespace,NOT_AFTER:.status.notAfter,READY:.status.conditions[0].status
# Verify Prometheus metrics
kubectl exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=certmanager_certificate_expiration_timestamp_seconds' | jq .kubectl describe certificate <name> -n <namespace>
# Check Events section for issuance errors
# Common: ClusterIssuer not Ready, CA secret missingkubectl describe clusterissuer homelab-ca-issuer
# Check: secret homelab-root-ca-tls exists in security namespace
kubectl get secret homelab-root-ca-tls -n security# Re-import the CA certificate
# macOS: Keychain Access → System → Certificates → import homelab-ca.crt
# Set trust to "Always Trust"
# Restart browser after import# Check cert-manager logs
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager --tail=50
# Verify renewBefore is set (default is 2/3 of duration)- cert-manager installed with CRDs
- Self-signed Root CA generated (10-year validity)
- CA ClusterIssuer (
homelab-ca-issuer) created and Ready - Wildcard certificate
*.homelab.localissued - ingress-nginx configured with default TLS certificate
- TLS certificates issued for: Gitea, ArgoCD, Vault, Grafana, Authelia, Longhorn
- Root CA stored in Vault (
secret/certificates/root-ca) - Root CA trusted on macOS workstation
- Root CA trusted on K3s VMs (via Ansible)
- Certificate expiry alerts configured (30-day warning, 7-day critical)
- cert-manager Grafana dashboard imported
- Secret sync CronJob replicating TLS to all namespaces
- ArgoCD Application for cert-manager deployed
- All services accessible via HTTPS with valid certificates
Implement Velero for Kubernetes backups, Proxmox Backup Server for VM-level snapshots, and documented disaster recovery procedures.
This guide implements a three-tier backup strategy: Kubernetes resources and volumes via Velero, VM-level snapshots via Proxmox Backup Server, and application-specific backups for Gitea and Vault. Includes tested restore procedures and RPO/RTO targets.
Time Required: ~90 minutes Prerequisites: Guide 06 (Longhorn Storage) completed
Backup & Recovery Architecture
┌─────────────────────────────────────────────────┐
│ Backup Sources │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │
│ │ Velero │ │ PBS │ │ App-Specific │ │
│ │ (K8s) │ │ (VMs) │ │ Vault/Gitea │ │
│ └────┬────┘ └────┬────┘ └───────┬─────────┘ │
└───────┼────────────┼───────────────┼─────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────┐
│ Storage Targets │
│ │
│ ┌──────────────┐ ┌────────────────────────┐ │
│ │ Longhorn NFS │ │ Proxmox Backup Server │ │
│ │ / S3 Bucket │ │ (deduplicated, Zstd) │ │
│ └──────────────┘ └────────────────────────┘ │
└─────────────────────────────────────────────────┘
| Component | RPO (max data loss) | RTO (max downtime) | Backup Frequency |
|---|---|---|---|
| K3s cluster resources | 1 hour | 30 minutes | Hourly |
| Monitoring data (Prometheus/Loki) | 24 hours | 2 hours | Daily |
| Gitea (repos + DB) | 6 hours | 1 hour | Every 6 hours |
| Vault (secrets) | 1 hour | 15 minutes | Hourly |
| Proxmox VMs (full) | 24 hours | 1 hour | Daily |
| Longhorn volumes | 24 hours | 30 minutes | Daily |
Create an NFS share on pve-vader for Velero backups:
# On pve-vader
mkdir -p /mnt/data/velero-backups
# Export via NFS (add to /etc/exports)
echo "/mnt/data/velero-backups 10.10.10.0/24(rw,sync,no_subtree_check,no_root_squash)" >> /etc/exports
exportfs -arvOr use Longhorn to provision a PVC for backups.
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update
# Create Velero namespace
kubectl create namespace velero
kubectl label namespace velero environment=infrastructure
# Install Velero with NFS storage
cat > velero-values.yaml <<'EOF'
configuration:
backupStorageLocation:
- name: default
provider: aws
bucket: velero
default: true
config:
region: minio
s3ForcePathStyle: true
publicUrl: http://minio.services.svc.cluster.local:9000
credential:
name: velero-cloud-credentials
namespace: velero
volumeSnapshotLocation:
- name: default
provider: aws
config:
region: minio
# Use local storage (NFS via restic/fs-backup)
# Alternative: use filesystem-based backup
initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:v1.10.0
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /target
name: plugins
deployNodeAgent: true
metrics:
enabled: true
serviceMonitor:
enabled: true
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# For local/NFS backup without S3, use restic
# Or configure with local provider
EOFFor a simpler setup using local filesystem backup:
# Install with restic for volume backup
helm install velero vmware-tanzu/velero \
--namespace velero \
--set configuration.backupStorageLocation[0].name=default \
--set configuration.backupStorageLocation[0].provider=aws \
--set configuration.backupStorageLocation[0].bucket=velero \
--set configuration.backupStorageLocation[0].default=true \
--set configuration.volumeSnapshotLocation[0].name=default \
--set configuration.volumeSnapshotLocation[0].provider=aws \
--set deployNodeAgent=true \
--set metrics.enabled=true \
--set metrics.serviceMonitor.enabled=true \
--set snapshotsEnabled=true \
--waitkubectl get pods -n velero
# Expected: velero Running, node-agent DaemonSet Running on all nodes
velero version
velero backup-location get
# Expected: default Available# Hourly: Critical namespaces (Vault, ArgoCD config)
velero schedule create critical-hourly \
--include-namespaces security,argocd \
--include-cluster-scopes=true \
--schedule="0 * * * *" \
--ttl=72h \
--snapshot-volumes=true
# Daily: Full cluster backup
velero schedule create full-daily \
--include-namespaces '*' \
--exclude-namespaces velero,kube-system \
--include-cluster-scopes=true \
--schedule="15 2 * * *" \
--ttl=168h \
--snapshot-volumes=true
# Weekly: Long-term retention
velero schedule create weekly-archive \
--include-namespaces '*' \
--exclude-namespaces velero,kube-system \
--include-cluster-scopes=true \
--schedule="0 3 * * 0" \
--ttl=720h \
--snapshot-volumes=truevelero schedule get
# Expected:
# critical-hourly 0 * * * * 72h
# full-daily 15 2 * * * 168h
# weekly-archive 0 3 * * 0 720h# Test with a manual backup
velero backup create test-backup \
--include-namespaces monitoring \
--snapshot-volumes=true \
--wait
# Check status
velero backup describe test-backup --details
velero backup logs test-backupUse this after total cluster loss (all nodes down):
# Step 1: Rebuild K3s cluster (Guide 05)
# Step 2: Install Velero (this guide, Phase 1)
# Step 3: Restore from backup
# List available backups
velero backup get
# Restore full cluster from latest daily backup
velero restore create full-restore \
--from-backup full-daily-<TIMESTAMP> \
--wait
# Monitor restore progress
velero restore describe full-restore --detailsUse this for accidental deletion of a namespace:
# List backups containing the namespace
velero backup get -o wide
# Restore specific namespace
velero restore create restore-monitoring \
--from-backup full-daily-<TIMESTAMP> \
--include-namespaces monitoring \
--wait
# Verify restored resources
velero restore describe restore-monitoring --details# Extract specific resource from backup
velero restore create restore-single-deployment \
--from-backup full-daily-<TIMESTAMP> \
--include-namespaces services \
--include-resources deployments \
--selector app=gitea \
--wait# Download Proxmox Backup Server ISO
# Create VM via Terraform or Proxmox UI:
# - 2 CPU, 4GB RAM
# - 100GB disk (use NVMe storage)
# - Attached to vnet-homelab (10.10.10.0/24)
# - IP: 10.10.10.5
# Add to Terraform if desired:
# terraform/environments/homelab/main.tf# After PBS installation, access web UI at https://10.10.10.5:8007
# Create datastore for backups
# UI: Datastore → Add → name: "homelab-backups", path: /mnt/data/homelab-backups
# Create backup user
# UI: Configuration → Access Control → Add user: backup@pbs
# Assign DatastoreBackup role on homelab-backups datastoreOn each Proxmox node:
# Add PBS as backup storage
# On pve-vader:
cat >> /etc/pve/storage.cfg <<'EOF'
pbs: pbs-backup
server 10.10.10.5
datastore homelab-backups
username backup@pbs
password <pbs-password>
fingerprint <pbs-fingerprint>
content backup
EOF
# Add to all nodes (or via Proxmox cluster config sync)# On any Proxmox node:
# Daily backup of critical VMs (Vader)
vzdump 100,200 --mode snapshot --storage pbs-backup \
--compress zstd --mailto root --mailto failure-only \
--schedule "02:00"
# Daily backup of worker VMs (Sidious)
vzdump 201 --mode snapshot --storage pbs-backup \
--compress zstd --schedule "03:00"
# Maul (hack box) — weekly only
vzdump 800 --mode snapshot --storage pbs-backup \
--compress zstd --schedule "Sun 04:00"
# Verify backups
proxmox-backup-client snapshot list --repository backup@pbs@10.10.10.5:homelab-backups# In PBS UI: Datastore → homelab-backups → Prune Options
# Keep:
# - Last 7 daily backups
# - Last 4 weekly backups
# - Last 3 monthly backups# Vault uses Raft storage — snapshot the Raft data
kubectl port-forward -n security svc/vault 8200:8200 &
VAULT_ADDR="http://127.0.0.1:8200"
# Create Raft snapshot
vault operator raft snapshot save vault-snapshot-$(date +%Y%m%d).snap
# Automated via CronJob:
cat <<'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
name: vault-backup
namespace: security
spec:
schedule: "0 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: vault-backup
image: hashicorp/vault:latest
command:
- /bin/sh
- -c
- |
export VAULT_ADDR="http://vault.security.svc.cluster.local:8200"
export VAULT_TOKEN="$(cat /vault/token)"
vault operator raft snapshot save /backups/vault-$(date +%Y%m%d-%H%M).snap
volumeMounts:
- name: token
mountPath: /vault
- name: backups
mountPath: /backups
volumes:
- name: token
secret:
secretName: vault-root-token
- name: backups
persistentVolumeClaim:
claimName: vault-backups-pvc
restartPolicy: OnFailure
EOFCreate the PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vault-backups-pvc
namespace: security
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: longhorn-default
resources:
requests:
storage: 5Gi# Gitea dump (repos + database + config)
cat <<'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
name: gitea-backup
namespace: services
spec:
schedule: "0 */6 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: gitea-backup
image: gitea/gitea:latest
command:
- /bin/sh
- -c
- |
/usr/local/bin/gitea dump \
-c /data/gitea/conf/app.ini \
--file /backups/gitea-dump-$(date +%Y%m%d-%H%M).zip
volumeMounts:
- name: gitea-data
mountPath: /data
- name: backups
mountPath: /backups
volumes:
- name: gitea-data
persistentVolumeClaim:
claimName: gitea-data
- name: backups
persistentVolumeClaim:
claimName: gitea-backups-pvc
restartPolicy: OnFailure
EOF# Configure Longhorn recurring backups via UI or kubectl
# Settings → General → Default Backup Store
# Set backup target: nfs://10.10.10.5:/mnt/data/longhorn-backups
# Create recurring backup job for critical volumes
cat <<'EOF' | kubectl apply -f -
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
name: daily-backup
namespace: longhorn-system
spec:
cron: "0 3 * * *"
task: backup
groups:
- default
retain: 7
concurrency: 2
EOF## DR-001: Total Cluster Loss
### Trigger
- All 3 Proxmox nodes powered off or hardware failure
- Complete data center loss (e.g., power outage + UPS failure)
### Steps
1. Power on pve-vader first, then pve-sidious
2. Wait for Proxmox cluster quorum (2/3 nodes)
3. Restore VMs from PBS:
- Restore pfSense VM (ID 100) from latest backup
- Restore K3s master VM (ID 200) from latest backup
- Restore K3s worker VM (ID 201) from latest backup
4. Wait for K3s cluster to stabilize
5. Install Velero on new cluster
6. Point Velero to backup storage location
7. Restore from latest full-daily backup:
velero restore create full-restore --from-backup full-daily-<LATEST>
8. Verify all namespaces and workloads
9. Restore Vault from Raft snapshot if needed
10. Check all services: Gitea, ArgoCD, Grafana, Authelia
### RTO Target: 2-4 hours## DR-002: Single Node Failure (pve-vader or pve-sidious)
### Trigger
- One Proxmox node becomes unresponsive
- Hardware failure on a single node
### Steps (if pve-sidious fails):
1. K3s worker pods reschedule to master (if resources allow)
2. Longhorn replicas on failed node become degraded
3. Replace/repair hardware
4. Reboot node — Proxmox rejoins cluster automatically
5. VMs restart with `onboot=yes`
6. Longhorn rebuilds replicas from healthy copies
### Steps (if pve-vader fails):
1. K3s master is down — cluster is read-only
2. Longhorn 2/3 replicas remain (on sidious + virtual disks)
3. Repair/reboot vader
4. K3s master resumes — pods reschedule
5. Longhorn rebuilds vader replicas
### RTO Target: 30-60 minutes (reboot) / 2-4 hours (hardware)## DR-003: Namespace Deletion
### Trigger
- kubectl delete namespace <name> run accidentally
- ArgoCD sync removes resources incorrectly
### Steps
1. Identify the deleted namespace and timestamp
2. List available backups:
velero backup get
3. Restore from most recent backup:
velero restore create restore-<namespace> \
--from-backup <latest-backup> \
--include-namespaces <namespace> \
--wait
4. Verify restored resources:
kubectl get all -n <namespace>
5. Check ArgoCD sync status
### RTO Target: 15-30 minutesVelero exposes Prometheus metrics. Import the Velero dashboard (Dashboard ID: 16871) in Grafana.
# gitops-apps/monitoring/backup-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: backup-alerts
namespace: monitoring
spec:
groups:
- name: backup
rules:
- alert: VeleroBackupFailed
expr: velero_backup_attempt_total - velero_backup_success_total > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Velero backup {{ $labels.schedule }} failed"
description: "Backup {{ $labels.backup_name }} has failed. Check Velero logs."
- alert: VeleroBackupStale
expr: time() - velero_backup_last_successful_timestamp_seconds > 86400 * 2
for: 1h
labels:
severity: critical
annotations:
summary: "No successful Velero backup in 48 hours"
description: "Schedule {{ $labels.schedule }} has not had a successful backup in over 48 hours."
- alert: ProxmoxBackupFailed
expr: increase(pbs_backup_failed_total[24h]) > 0
for: 1h
labels:
severity: warning
annotations:
summary: "Proxmox backup job failed"# Test restore procedure monthly
# scripts/monthly-restore-drill.sh
#!/bin/bash
echo "=== Monthly Restore Drill: $(date) ==="
# 1. Create a test namespace with resources
kubectl create namespace drill-test
kubectl run nginx --image=nginx -n drill-test
kubectl expose pod nginx --port=80 -n drill-test
# 2. Wait for backup to capture it
echo "Waiting for next scheduled backup..."
sleep 3600
# 3. Delete the namespace
kubectl delete namespace drill-test
# 4. Restore from Velero
LATEST=$(velero backup get --sort-by=.metadata.creationTimestamp -o json | jq -r '.items[-1].metadata.name')
velero restore create drill-restore --from-backup "$LATEST" --include-namespaces drill-test --wait
# 5. Verify
kubectl get pods -n drill-test
kubectl get svc -n drill-test
# 6. Clean up
kubectl delete namespace drill-test
echo "=== Drill Complete ==="# gitops-apps/infrastructure/velero/application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: velero
namespace: argocd
spec:
project: homelab
source:
repoURL: https://git.homelab.local/homelab/gitops-apps.git
targetRevision: main
path: infrastructure/velero
destination:
server: https://kubernetes.default.svc
namespace: velero
syncPolicy:
automated:
selfHeal: true
prune: false
syncOptions:
- ServerSideApply=true
- CreateNamespace=true# Velero status
velero version
velero backup-location get
velero schedule get
# Run a manual backup
velero backup create verification-backup \
--include-namespaces security \
--snapshot-volumes=true \
--wait
# Check backup
velero backup describe verification-backup --details
# Test restore
velero restore create verification-restore \
--from-backup verification-backup \
--include-namespaces security \
--wait
velero restore describe verification-restore
# PBS status (from Proxmox node)
proxmox-backup-client snapshot list --repository backup@pbs@10.10.10.5:homelab-backups
# Vault backup
vault operator raft snapshot save /tmp/test-snapshot.snap
ls -la /tmp/test-snapshot.snapvelero backup describe <backup-name> --details
velero backup logs <backup-name>
# Common: storage location unreachable, credentials invalidkubectl get pods -n velero -l name=node-agent
kubectl logs -n velero -l name=node-agent --tail=50
# Check: node-agent DaemonSet is running on all nodes# Verify PBS is running
curl -k https://10.10.10.5:8007/api2/json/status
# Check PBS service: systemctl status proxmox-backup-server# Ensure Vault is unsealed before restore
vault status
# Restore: vault operator raft snapshot restore <snapshot-file>- Velero installed in
veleronamespace - Backup storage location configured and Available
- Node agent (restic) DaemonSet running on all nodes
- Hourly backup schedule for critical namespaces (Vault, ArgoCD)
- Daily full cluster backup schedule
- Weekly long-term retention schedule
- Manual backup and restore tested successfully
- Proxmox Backup Server deployed (VM on vader, 10.10.10.5)
- PBS datastore configured with pruning policy
- Proxmox VM backup jobs scheduled (daily critical, weekly hack box)
- Vault backup CronJob running hourly (Raft snapshots)
- Gitea backup CronJob running every 6 hours
- Longhorn recurring backup job configured
- DR runbooks written: total loss, single node, namespace deletion
- Backup alerts configured in Prometheus
- Velero Grafana dashboard imported
- Monthly restore drill script created
- RPO/RTO targets documented
- ArgoCD Application for Velero deployed
Run CIS benchmarks with kube-bench, harden K3s nodes, and automate compliance scanning with OpenSCAP and Kyverno policies.
This guide hardens every layer of the homelab stack: Proxmox hosts, K3s nodes, and Kubernetes workloads. Uses CIS benchmarks as the compliance standard, with automated scanning and remediation through Ansible, kube-bench, and Kyverno.
Time Required: ~90 minutes Prerequisites: Guide 05 (K3s Cluster), Guide 08 (Security Tooling) completed
Compliance & Hardening Stack
┌─────────────────────────────────────────────────┐
│ Compliance Layers │
│ │
│ Layer 1: Proxmox Host Hardening │
│ SSH · fail2ban · auditd · sysctl · firewall │
│ │
│ Layer 2: K3s Node Hardening │
│ CIS Benchmark · kube-bench · kernel params │
│ │
│ Layer 3: Kubernetes Workload Policies │
│ Kyverno · Pod Security · RBAC · NetworkPolicy │
│ │
│ Layer 4: Compliance Reporting │
│ OpenSCAP · kube-bench → Grafana dashboards │
└─────────────────────────────────────────────────┘
Create ansible/playbooks/harden-hosts.yml:
---
- name: Harden Proxmox Hosts
hosts: proxmox
become: true
tasks:
- name: Configure SSH daemon
ansible.builtin.lineinfile:
path: /etc/ssh/sshd_config
regexp: "{{ item.regexp }}"
line: "{{ item.line }}"
validate: "sshd -t -f %s"
loop:
- { regexp: "^#?PermitRootLogin", line: "PermitRootLogin prohibit-password" }
- { regexp: "^#?PasswordAuthentication", line: "PasswordAuthentication no" }
- { regexp: "^#?PubkeyAuthentication", line: "PubkeyAuthentication yes" }
- { regexp: "^#?X11Forwarding", line: "X11Forwarding no" }
- { regexp: "^#?MaxAuthTries", line: "MaxAuthTries 3" }
- { regexp: "^#?ClientAliveInterval", line: "ClientAliveInterval 300" }
- { regexp: "^#?ClientAliveCountMax", line: "ClientAliveCountMax 2" }
- { regexp: "^#?LoginGraceTime", line: "LoginGraceTime 30" }
- { regexp: "^#?PermitEmptyPasswords", line: "PermitEmptyPasswords no" }
- { regexp: "^#?AllowAgentForwarding", line: "AllowAgentForwarding no" }
- { regexp: "^#?AllowTcpForwarding", line: "AllowTcpForwarding no" }
notify: Restart SSHD
- name: Configure SSH ciphers
ansible.builtin.blockinfile:
path: /etc/ssh/sshd_config
block: |
Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com,aes128-gcm@openssh.com
MACs hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com
KexAlgorithms curve25519-sha256,curve25519-sha256@libssh.org
notify: Restart SSHD
handlers:
- name: Restart SSHD
ansible.builtin.systemd:
name: sshd
state: restartedAdd to ansible/playbooks/harden-hosts.yml:
- name: Install fail2ban
ansible.builtin.apt:
name: fail2ban
state: present
update_cache: true
- name: Configure fail2ban for SSH
ansible.builtin.copy:
dest: /etc/fail2ban/jail.local
content: |
[DEFAULT]
bantime = 3600
findtime = 600
maxretry = 3
backend = systemd
[sshd]
enabled = true
port = ssh
filter = sshd
logpath = /var/log/auth.log
maxretry = 3
bantime = 3600
mode: '0644'
notify: Restart fail2ban
handlers:
- name: Restart fail2ban
ansible.builtin.systemd:
name: fail2ban
state: restarted - name: Kernel hardening via sysctl
ansible.posix.sysctl:
name: "{{ item.key }}"
value: "{{ item.value }}"
sysctl_set: true
state: present
reload: true
loop:
- { key: "net.ipv4.ip_forward", value: "1" } # Required for Proxmox SDN
- { key: "net.ipv4.conf.all.send_redirects", value: "0" }
- { key: "net.ipv4.conf.default.send_redirects", value: "0" }
- { key: "net.ipv4.conf.all.accept_redirects", value: "0" }
- { key: "net.ipv4.conf.default.accept_redirects", value: "0" }
- { key: "net.ipv4.conf.all.accept_source_route", value: "0" }
- { key: "net.ipv4.conf.default.accept_source_route", value: "0" }
- { key: "net.ipv6.conf.all.accept_redirects", value: "0" }
- { key: "kernel.dmesg_restrict", value: "1" }
- { key: "kernel.kptr_restrict", value: "2" }
- { key: "kernel.unprivileged_bpf_disabled", value: "1" }
- { key: "fs.suid_dumpable", value: "0" }
- { key: "net.core.bpf_jit_harden", value: "2" } - name: Install auditd
ansible.builtin.apt:
name:
- auditd
- audispd-plugins
state: present
- name: Configure auditd rules
ansible.builtin.copy:
dest: /etc/audit/rules.d/homelab.rules
content: |
# Monitor privileged commands
-a always,exit -F arch=b64 -S execve -F euid=0 -F auid>=1000 -F auid!=4294967295 -k privileged
# Monitor SSH config changes
-w /etc/ssh/sshd_config -p wa -k ssh_config
# Monitor sudoers changes
-w /etc/sudoers -p wa -k sudoers
# Monitor Proxmox config changes
-w /etc/pve/ -p wa -k proxmox_config
# Monitor login events
-w /var/log/auth.log -p wa -k logins
# Monitor cron jobs
-w /etc/cron* -p wa -k cron
# Monitor system time changes
-a exit,always -F arch=b64 -S clock_settime -k time_change
mode: '0600'
notify: Restart auditd
handlers:
- name: Restart auditd
ansible.builtin.command: augenrules --load
changed_when: trueansible-playbook -i ansible/inventories/homelab/hosts.yml ansible/playbooks/harden-hosts.ymlCreate ansible/inventories/homelab/group_vars/k3s.yml additions:
# CIS Benchmark compliance settings
k3s_server_args:
- "--kube-apiserver-arg=enable-admission-plugins=NodeRestriction,PodSecurityPolicy"
- "--kube-apiserver-arg=audit-log-path=/var/log/k3s/audit.log"
- "--kube-apiserver-arg=audit-log-maxage=30"
- "--kube-apiserver-arg=audit-log-maxbackup=10"
- "--kube-apiserver-arg=audit-log-maxsize=100"
- "--kube-apiserver-arg=authorization-mode=Node,RBAC"
- "--kube-apiserver-arg=profiling=false"
- "--kube-controller-manager-arg=profiling=false"
- "--kube-scheduler-arg=profiling=false"
- "--kubelet-arg=protect-kernel-defaults=true"
- "--kubelet-arg=rotate-server-certificates=true"
- "--write-kubeconfig-mode=644"Create ansible/playbooks/harden-k3s.yml:
---
- name: Harden K3s Nodes
hosts: k3s_cluster
become: true
tasks:
# Kernel parameters required by CIS
- name: CIS-required kernel parameters
ansible.posix.sysctl:
name: "{{ item.key }}"
value: "{{ item.value }}"
sysctl_set: true
state: present
loop:
- { key: "vm.overcommit_memory", value: "1" }
- { key: "vm.panic_on_oom", value: "0" }
- { key: "kernel.panic", value: "10" }
- { key: "kernel.panic_on_oops", value: "1" }
- { key: "net.ipv4.tcp_max_syn_backlog", value: "12800" }
# Create audit log directory
- name: Create K3s audit log directory
ansible.builtin.file:
path: /var/log/k3s
state: directory
mode: '0700'
# etcd data directory permissions
- name: Secure etcd data directory
ansible.builtin.file:
path: /var/lib/rancher/k3s/server/db/etcd
mode: '0700'
when: "'k3s_master' in group_names"
# K3s config file permissions
- name: Secure K3s config
ansible.builtin.file:
path: /etc/rancher/k3s/config.yaml
mode: '0600'
ignore_errors: trueansible-playbook -i ansible/inventories/homelab/hosts.yml ansible/playbooks/harden-k3s.yml# Run as a Job in the cluster
kubectl apply -f - <<'EOF'
apiVersion: batch/v1
kind: Job
metadata:
name: kube-bench
namespace: monitoring
spec:
template:
spec:
hostPID: true
containers:
- name: kube-bench
image: docker.io/aquasec/kube-bench:latest
command: ["kube-bench", "run", "--targets", "master,node", "--benchmark", "k3s-cis-1.8"]
volumeMounts:
- name: var-lib-etcd
mountPath: /var/lib/etcd
- name: var-lib-kubelet
mountPath: /var/lib/kubelet
- name: etc-systemd
mountPath: /etc/systemd
- name: etc-kubernetes
mountPath: /etc/kubernetes
- name: usr-bin
mountPath: /usr/local/bin
volumes:
- name: var-lib-etcd
hostPath:
path: /var/lib/rancher/k3s/server/db/etcd
- name: var-lib-kubelet
hostPath:
path: /var/lib/kubelet
- name: etc-systemd
hostPath:
path: /etc/systemd
- name: etc-kubernetes
hostPath:
path: /etc/rancher/k3s
- name: usr-bin
hostPath:
path: /usr/local/bin
restartPolicy: Never
nodeSelector:
node-role.kubernetes.io/control-plane: "true"
EOF
# Wait and get results
kubectl wait --for=condition=complete job/kube-bench -n monitoring --timeout=120s
kubectl logs job/kube-bench -n monitoringapiVersion: batch/v1
kind: CronJob
metadata:
name: kube-bench-weekly
namespace: monitoring
spec:
schedule: "0 3 * * 1" # Weekly Monday 3 AM
jobTemplate:
spec:
template:
spec:
hostPID: true
containers:
- name: kube-bench
image: docker.io/aquasec/kube-bench:latest
command:
- /bin/sh
- -c
- |
kube-bench run --targets master,node --benchmark k3s-cis-1.8 \
--json > /results/kube-bench-$(date +%Y%m%d).json
volumeMounts:
- name: results
mountPath: /results
- name: var-lib-etcd
mountPath: /var/lib/etcd
- name: var-lib-kubelet
mountPath: /var/lib/kubelet
- name: etc-systemd
mountPath: /etc/systemd
- name: etc-kubernetes
mountPath: /etc/kubernetes
volumes:
- name: results
persistentVolumeClaim:
claimName: kube-bench-results
- name: var-lib-etcd
hostPath:
path: /var/lib/rancher/k3s/server/db/etcd
- name: var-lib-kubelet
hostPath:
path: /var/lib/kubelet
- name: etc-systemd
hostPath:
path: /etc/systemd
- name: etc-kubernetes
hostPath:
path: /etc/rancher/k3s
restartPolicy: NeverParse kube-bench JSON results and push metrics to Prometheus Pushgateway:
# Script to push kube-bench results to Prometheus
RESULTS=$(kubectl logs job/kube-bench -n monitoring --tail=-1)
# Extract pass/fail counts
PASS=$(echo "$RESULTS" | jq '[.Controls[].tests[].results[] | select(.status=="PASS")] | length')
FAIL=$(echo "$RESULTS" | jq '[.Controls[].tests[].results[] | select(.status=="FAIL")] | length')
WARN=$(echo "$RESULTS" | jq '[.Controls[].tests[].results[] | select(.status=="WARN")] | length')
INFO=$(echo "$RESULTS" | jq '[.Controls[].tests[].results[] | select(.status=="INFO")] | length')
# Push to Pushgateway
PUSHGATEWAY="http://kube-prometheus-stack-prometheus-pushgateway.monitoring.svc.cluster.local:9091"
cat <<EOF | curl --data-binary @- ${PUSHGATEWAY}/metrics/job/kube-bench
cis_benchmark_results{status="pass"} ${PASS}
cis_benchmark_results{status="fail"} ${FAIL}
cis_benchmark_results{status="warn"} ${WARN}
cis_benchmark_results{status="info"} ${INFO}
EOF# Add to ansible/playbooks/harden-k3s.yml
- name: Install OpenSCAP
ansible.builtin.apt:
name:
- openscap-scanner
- scap-security-guide
state: present
update_cache: true
- name: Run OpenSCAP CIS scan
ansible.builtin.command: >
oscap xccdf eval
--profile xccdf_org.ssgproject.content_profile_cis
--results /tmp/oscap-results.xml
--report /tmp/oscap-report.html
/usr/share/xml/scap/ssg/content/ssg-ubuntu2204-ds.xml
register: oscap_scan
failed_when: false
changed_when: false
- name: Fetch OpenSCAP report
ansible.builtin.fetch:
src: /tmp/oscap-report.html
dest: "reports/oscap-{{ inventory_hostname }}.html"
flat: truemkdir -p reports
ansible-playbook -i ansible/inventories/homelab/hosts.yml ansible/playbooks/oscap-scan.yml
# View reports
open reports/oscap-k3s-master-01.html
open reports/oscap-k3s-worker-01.htmlAdd to gitops-apps/security/kyverno/policies.yaml:
---
# Require app.kubernetes.io labels on all workloads
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-kubernetes-labels
annotations:
policies.kyverno.io/title: Require Kubernetes Standard Labels
policies.kyverno.io/category: Compliance
spec:
validationFailureAction: Audit
rules:
- name: require-app-label
match:
any:
- resources:
kinds:
- Deployment
- StatefulSet
- DaemonSet
validate:
message: "All workloads must have app.kubernetes.io/name label"
pattern:
spec:
template:
metadata:
labels:
app.kubernetes.io/name: "?*"
---
# Require read-only root filesystem
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-readonly-rootfs
annotations:
policies.kyverno.io/title: Require Read-Only Root Filesystem
policies.kyverno.io/category: CIS 5.2.4
spec:
validationFailureAction: Audit
rules:
- name: validate-rootfs
match:
any:
- resources:
kinds:
- Deployment
- StatefulSet
validate:
message: "Containers must use read-only root filesystem (securityContext.readOnlyRootFilesystem=true)"
pattern:
spec:
template:
spec:
containers:
- securityContext:
readOnlyRootFilesystem: true
---
# Disallow hostPath mounts
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-hostpath
annotations:
policies.kyverno.io/title: Disallow hostPath Volume Mounts
policies.kyverno.io/category: CIS 5.2.5
spec:
validationFailureAction: Audit
rules:
- name: prevent-hostpath
match:
any:
- resources:
kinds:
- Deployment
- StatefulSet
- Pod
validate:
message: "hostPath volume mounts are not allowed"
pattern:
spec:
template:
spec:
volumes:
- X(hostPath): null
---
# Require resource quotas on production namespaces
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-quota
annotations:
policies.kyverno.io/title: Require ResourceQuota on Production Namespaces
policies.kyverno.io/category: Compliance
spec:
validationFailureAction: Audit
rules:
- name: check-resource-quota
match:
any:
- resources:
kinds:
- Namespace
validate:
message: "Production namespaces must have a ResourceQuota"
deny:
conditions:
all:
- key: "{{ request.object.metadata.labels.environment || '' }}"
operator: Equals
value: "production"
- key: "{{ has_resource_quota }}"
operator: Equals
value: falseDocument CIS controls that are intentionally skipped for the homelab:
| CIS Control | Reason for Skipping |
|---|---|
| 1.2.1 (API server anonymous auth) | K3s requires anonymous auth for health checks |
| 1.4.1 (TLS certificates via CA) | Using self-signed CA (Guide 15) — acceptable for homelab |
| 3.2.1 (etcd encryption) | Not configured — Vault handles secret encryption |
| 4.2.1 (kubelet anonymous auth) | K3s default — acceptable in isolated network |
| 5.1.1 (PodSecurityPolicy) | PSP deprecated in K8s 1.25+; using Kyverno instead |
| 5.3.2 (Seccomp profile) | Using RuntimeDefault via Falco recommendations |
# Verify SSH hardening on Proxmox hosts
ssh root@192.168.1.11 "grep -E 'PermitRootLogin|PasswordAuthentication|X11Forwarding' /etc/ssh/sshd_config"
# Verify fail2ban
ssh root@192.168.1.11 "fail2ban-client status sshd"
# Verify kernel parameters
ssh root@192.168.1.11 "sysctl kernel.kptr_restrict kernel.dmesg_restrict net.core.bpf_jit_harden"
# Verify auditd
ssh root@192.168.1.11 "auditctl -l"
# Run kube-bench
kubectl logs job/kube-bench -n monitoring | grep -E "PASS|FAIL|WARN" | head -30
# Check Kyverno compliance policies
kubectl get clusterpolicy -o wide
# Verify OpenSCAP report exists
ls -la reports/oscap-*.html# K3s has specific paths — use --benchmark k3s-cis-1.8
# Check kube-bench supports K3s version
kube-bench run --targets master --benchmark k3s-cis-1.8 --debug# Recovery: connect via Proxmox web console (no SSH needed)
# Fix sshd_config via console:
vi /etc/ssh/sshd_config
systemctl restart sshd# Check banned IPs
fail2ban-client status sshd
# Unban an IP
fail2ban-client set sshd unbanip <IP>
# Whitelist Tailscale range
# Add to /etc/fail2ban/jail.local: ignoreip = 127.0.0.1/8 100.64.0.0/10# Check Kyverno webhook
kubectl get validatingwebhookconfiguration -o yaml | grep kyverno
# Check Kyverno logs
kubectl logs -n kyverno -l app.kubernetes.io/name=kyverno --tail=50- SSH hardened on all Proxmox hosts (key-only auth, no root login, restricted ciphers)
- fail2ban installed and active on all hosts
- Kernel hardening sysctl parameters applied
- auditd configured with homelab-specific rules
- K3s CIS config parameters applied via Ansible
- kube-bench runs successfully against K3s CIS benchmark
- kube-bench weekly CronJob scheduled
- kube-bench results pushed to Prometheus (Grafana dashboard)
- OpenSCAP scans K3s VMs against CIS Ubuntu profile
- OpenSCAP reports generated and reviewed
- Kyverno compliance policies deployed (labels, read-only FS, no hostPath, resource quotas)
- CIS controls not applied are documented with justification
- All hardening playbooks committed to Gitea
Deploy OWASP ZAP to automatically scan vulnerable applications (DVWA, Juice Shop) and integrate results into the security pipeline.
This guide deploys OWASP ZAP as an automated DAST scanner targeting the vulnerable applications already running in the sandbox namespace (DVWA, Juice Shop). ZAP runs scheduled scans and pushes results to Loki and Grafana for centralized security visibility.
Time Required: ~60 minutes Prerequisites: Guide 09 (Red/Blue Team) completed
DAST Testing Architecture
┌─────────────────────────────────────────────────┐
│ Sandbox Namespace (maul) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ DVWA │ │ Juice │ │Metasploitable│ │
│ │ │ │ Shop │ │ 2 │ │
│ └─────┬────┘ └─────┬────┘ └──────┬───────┘ │
│ │ │ │ │
│ └──────┬──────┘──────────────┘ │
│ │ (HTTP targets) │
│ ▼ │
│ ┌──────────────┐ │
│ │ OWASP ZAP │ ◄── Scheduled scans │
│ │ (scanner) │ ◄── CI/CD triggered │
│ └──────┬───────┘ │
│ │ │
└───────────────┼──────────────────────────────────┘
▼
┌─────────────────────────────────────────────────┐
│ Results Pipeline │
│ ZAP JSON → Parse → Loki (logs) │
│ → Prometheus (metrics) │
│ → Grafana (dashboards) │
└─────────────────────────────────────────────────┘
Create gitops-apps/security/zap/deployment.yaml:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: zap-config
namespace: sandbox
data:
# Scan targets (internal service URLs)
targets: |
- name: juice-shop
url: http://juice-shop.sandbox.svc.cluster.local:3000
- name: dvwa
url: http://dvwa.sandbox.svc.cluster.local:80
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: zap-scanner
namespace: sandbox
labels:
app.kubernetes.io/name: zap-scanner
environment: sandbox
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: zap-scanner
template:
metadata:
labels:
app.kubernetes.io/name: zap-scanner
environment: sandbox
spec:
containers:
- name: zap
image: zaproxy/zap-stable:latest
ports:
- containerPort: 8080
name: api
- containerPort: 8090
name: proxy
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: "2"
memory: 2Gi
env:
- name: ZAP_PORT
value: "8080"
- name: ZAP_API_KEY
valueFrom:
secretKeyRef:
name: zap-api-key
key: api-key
volumeMounts:
- name: zap-session
mountPath: /home/zap/.ZAP/session
- name: zap-reports
mountPath: /zap/reports
volumes:
- name: zap-session
emptyDir: {}
- name: zap-reports
persistentVolumeClaim:
claimName: zap-reports-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: zap-reports-pvc
namespace: sandbox
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: longhorn-ephemeral
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
name: zap-scanner
namespace: sandbox
spec:
selector:
app.kubernetes.io/name: zap-scanner
ports:
- name: api
port: 8080
targetPort: 8080
- name: proxy
port: 8090
targetPort: 8090kubectl create secret generic zap-api-key \
--namespace sandbox \
--from-literal=api-key="$(openssl rand -hex 32)" \
--dry-run=client -o yaml | kubectl apply -f -kubectl apply -f gitops-apps/security/zap/deployment.yaml
# Wait for ZAP to be ready
kubectl wait --for=condition=available deployment/zap-scanner -n sandbox --timeout=120s
# Verify
kubectl get pods -n sandbox -l app.kubernetes.io/name=zap-scanner# Port-forward ZAP API
kubectl port-forward -n sandbox svc/zap-scanner 8080:8080 &
# Run baseline scan against Juice Shop
kubectl exec -n sandbox deploy/zap-scanner -- \
zap-cli quick-scan \
--self-contained \
--cmd-options "-addoninstallall" \
-f json \
-o "-config scanner.strength=INSANE" \
http://juice-shop.sandbox.svc.cluster.local:3000
# Alternative: Use ZAP API directly
ZAP_API="http://127.0.0.1:8080"
API_KEY=$(kubectl get secret zap-api-key -n sandbox -o jsonpath='{.data.api-key}' | base64 -d)
# Start scan via API
curl -s "${ZAP_API}/JSON/ascan/action/scan/?apikey=${API_KEY}&url=http://juice-shop.sandbox.svc.cluster.local:3000&recurse=true"# DVWA requires authentication — configure ZAP context
kubectl exec -n sandbox deploy/zap-scanner -- \
zap-cli quick-scan \
--self-contained \
-f json \
http://dvwa.sandbox.svc.cluster.local:80# List alerts
kubectl exec -n sandbox deploy/zap-scanner -- \
zap-cli report -o /zap/reports/baseline-$(date +%Y%m%d).json -f json
# Copy report locally
kubectl cp sandbox/$(kubectl get pod -n sandbox -l app.kubernetes.io/name=zap-scanner -o jsonpath='{.items[0].metadata.name}'):/zap/reports/ ./zap-reports/API_KEY=$(kubectl get secret zap-api-key -n sandbox -o jsonpath='{.data.api-key}' | base64 -d)
ZAP_API="http://127.0.0.1:8080"
# Create a new context
CONTEXT_ID=$(curl -s "${ZAP_API}/JSON/context/action/newContext/?apikey=${API_KEY}&contextName=dvwa-scan" | jq -r '.contextId')
# Include in context
curl -s "${ZAP_API}/JSON/context/action/includeInContext/?apikey=${API_KEY}&contextName=dvwa-scan®ex=http://dvwa\.sandbox\.svc\.cluster\.local.*"
# Set authentication method (form-based)
curl -s "${ZAP_API}/JSON/authentication/action/setAuthenticationMethod/?apikey=${API_KEY}&contextId=${CONTEXT_ID}&authMethodName=formBasedAuthentication&authMethodConfigParams=loginUrl=http://dvwa.sandbox.svc.cluster.local/login.php%20username={%25username%25}%26password={%25password%25}%26Login=Login"
# Set credentials
curl -s "${ZAP_API}/JSON/users/action/newUser/?apikey=${API_KEY}&contextId=${CONTEXT_ID}&name=admin"
USER_ID=$(curl -s "${ZAP_API}/JSON/users/action/newUser/?apikey=${API_KEY}&contextId=${CONTEXT_ID}&name=admin" | jq -r '.userId')
curl -s "${ZAP_API}/JSON/users/action/setAuthenticationCredentials/?apikey=${API_KEY}&contextId=${CONTEXT_ID}&userId=${USER_ID}&authCredentialsConfigParams=username=admin&password=password"
# Enable user
curl -s "${ZAP_API}/JSON/users/action/setUserEnabled/?apikey=${API_KEY}&contextId=${CONTEXT_ID}&userId=${USER_ID}&enabled=true"
# Set user for forced mode
curl -s "${ZAP_API}/JSON/forcedUser/action/setForcedUser/?apikey=${API_KEY}&contextId=${CONTEXT_ID}&userId=${USER_ID}"
curl -s "${ZAP_API}/JSON/forcedUser/action/setForcedUserModeEnabled/?apikey=${API_KEY}&enabled=true"# Spider the application first
curl -s "${ZAP_API}/JSON/spider/action/scan/?apikey=${API_KEY}&url=http://dvwa.sandbox.svc.cluster.local&contextName=dvwa-scan"
# Wait for spider to complete
sleep 30
# Run active scan
curl -s "${ZAP_API}/JSON/ascan/action/scan/?apikey=${API_KEY}&url=http://dvwa.sandbox.svc.cluster.local&recurse=true&contextName=dvwa-scan"apiVersion: batch/v1
kind: CronJob
metadata:
name: zap-weekly-scan
namespace: sandbox
spec:
schedule: "0 2 * * 0" # Weekly Sunday 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: zap-scan
image: zaproxy/zap-stable:latest
command:
- /bin/sh
- -c
- |
DATE=$(date +%Y%m%d)
# Juice Shop baseline scan
zap-cli quick-scan --self-contained \
-f json \
-o "-config scanner.strength=INSANE" \
-o "-addoninstallall" \
/zap/reports/juiceshop-${DATE}.json \
http://juice-shop.sandbox.svc.cluster.local:3000
# DVWA baseline scan
zap-cli quick-scan --self-contained \
-f json \
/zap/reports/dvwa-${DATE}.json \
http://dvwa.sandbox.svc.cluster.local:80
# Parse and push metrics
TOTAL_ALERTS=$(jq '[.[].alerts | length] | add' /zap/reports/juiceshop-${DATE}.json 2>/dev/null || echo "0")
HIGH_ALERTS=$(jq '[.[] | select(.riskdesc | startswith("High"))] | length' /zap/reports/juiceshop-${DATE}.json 2>/dev/null || echo "0")
echo "zap_scan_alerts_total{target=\"juice-shop\",severity=\"total\"} ${TOTAL_ALERTS}" > /tmp/metrics
echo "zap_scan_alerts_total{target=\"juice-shop\",severity=\"high\"} ${HIGH_ALERTS}" >> /tmp/metrics
# Push to Pushgateway
wget --post-file=/tmp/metrics \
http://kube-prometheus-stack-prometheus-pushgateway.monitoring.svc.cluster.local:9091/metrics/job/zap-scan
volumeMounts:
- name: reports
mountPath: /zap/reports
volumes:
- name: reports
persistentVolumeClaim:
claimName: zap-reports-pvc
restartPolicy: OnFailureAdd to gitops-apps/.gitea/workflows/dast-scan.yaml:
name: DAST Scan
on:
push:
branches: [main]
paths:
- 'sandbox-apps/**'
jobs:
zap-baseline:
name: "🔍 ZAP Baseline Scan"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run ZAP Baseline Scan
run: |
docker run -t zaproxy/zap-stable:latest \
zap-cli quick-scan --self-contained \
-f json \
-o "-addoninstallall" \
/zap/reports/scan-results.json \
http://juice-shop.sandbox.svc.cluster.local:3000 || true
- name: Check for High Alerts
run: |
HIGH=$(jq '[.[] | select(.riskdesc | startswith("High"))] | length' scan-results.json)
if [ "$HIGH" -gt 0 ]; then
echo "::error::ZAP found ${HIGH} high-severity vulnerabilities"
exit 1
fi
- name: Upload Results
uses: actions/upload-artifact@v4
if: always()
with:
name: zap-scan-results
path: scan-results.jsonCreate a sidecar that ships ZAP reports to Loki:
# Add to ZAP deployment as sidecar
- name: report-shipper
image: grafana/alloy:latest
volumeMounts:
- name: zap-reports
mountPath: /reportsOr use a simpler approach — push results via curl to Loki after each scan:
# Push ZAP results to Loki
LOKI_URL="http://loki.logging.svc.cluster.local:3100"
REPORT_FILE="/zap/reports/juiceshop-$(date +%Y%m%d).json"
# Convert ZAP JSON to Loki-compatible format
jq -c '{stream: {labels: {job: "zap-scanner", target: "juice-shop"}}, values: [[(now | tostring), (. | tostring)]]}' \
"$REPORT_FILE" | curl -X POST "${LOKI_URL}/loki/api/v1/push" -H "Content-Type: application/json" --data-binary @-Create a dashboard showing:
- Total vulnerabilities by severity (High/Medium/Low/Info)
- Vulnerabilities per target application
- Vulnerability trends over time
- Top 10 most common vulnerability types
- OWASP Top 10 coverage (which categories are triggered)
Create a combined vulnerability dashboard comparing:
- Trivy (static): Container image vulnerabilities
- ZAP (dynamic): Runtime application vulnerabilities
- Falco (runtime): Active security events
This gives a complete view: static analysis + dynamic testing + runtime monitoring.
# Verify ZAP is running
kubectl get pods -n sandbox -l app.kubernetes.io/name=zap-scanner
# Run a quick test scan
kubectl exec -n sandbox deploy/zap-scanner -- \
zap-cli quick-scan --self-contained -f json \
http://juice-shop.sandbox.svc.cluster.local:3000
# Check scan results
kubectl exec -n sandbox deploy/zap-scanner -- ls -la /zap/reports/
# Copy report to local machine
POD=$(kubectl get pod -n sandbox -l app.kubernetes.io/name=zap-scanner -o jsonpath='{.items[0].metadata.name}')
kubectl cp sandbox/${POD}:/zap/reports/ ./zap-reports/
# View report
cat ./zap-reports/*.json | jq '.[0].alerts[:5]'
# Verify CronJob is scheduled
kubectl get cronjob -n sandbox zap-weekly-scan# Check ZAP pod resources
kubectl top pod -n sandbox -l app.kubernetes.io/name=zap-scanner
# ZAP is memory-hungry — increase memory limit if needed
# Also check: target application is reachable from sandbox namespace
kubectl exec -n sandbox deploy/zap-scanner -- curl -sI http://juice-shop.sandbox.svc.cluster.local:3000# Ensure spider finds the application pages
kubectl exec -n sandbox deploy/zap-scanner -- \
zap-cli spider http://juice-shop.sandbox.svc.cluster.local:3000
# Check: target URL is correct, service is running
kubectl get svc -n sandbox# Test Loki connectivity from ZAP pod
kubectl exec -n sandbox deploy/zap-scanner -- \
curl -s http://loki.logging.svc.cluster.local:3100/ready
# Expected: ready- OWASP ZAP deployed in sandbox namespace
- ZAP API accessible and functional
- Baseline scan completed against Juice Shop
- Baseline scan completed against DVWA
- Authenticated scan configured for DVWA (form-based login)
- Active scan tested with authentication
- Weekly automated scan CronJob scheduled
- CI/CD triggered DAST scan workflow created
- ZAP results pushed to Loki for log analysis
- ZAP metrics pushed to Prometheus Pushgateway
- Grafana ZAP dashboard created
- Combined vulnerability dashboard (Trivy + ZAP + Falco) created
- Scan reports stored on persistent volume (zap-reports-pvc)
Deploy Chaos Mesh to deliberately inject failures and test the resilience of the security and monitoring stack.
This guide installs Chaos Mesh and runs controlled experiments against the homelab infrastructure. Each experiment tests a specific resilience property: pod recovery, network partition handling, storage degradation, and alert pipeline integrity.
Time Required: ~75 minutes Prerequisites: Guide 10 (Monitoring Stack) completed
Chaos Engineering Architecture
┌─────────────────────────────────────────────────┐
│ Chaos Mesh Dashboard │
│ (chaos-mesh.homelab.local) │
└──────────────────┬──────────────────────────────┘
│
┌──────────────────┼──────────────────────────────┐
│ Experiment Library │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ PodChaos │ │ NetworkChaos │ │
│ │ kill/fail │ │ delay/part │ │
│ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ StressChaos │ │ IOChaos │ │
│ │ CPU/mem │ │ latency/fault│ │
│ └──────────────┘ └──────────────┘ │
└──────────────────┬──────────────────────────────┘
│ inject failures
┌──────────────────┼──────────────────────────────┐
│ Target Services │
│ │
│ Falco · Vault · Prometheus · Loki · Tempo │
│ Grafana · AlertManager · OTel Collector │
└─────────────────────────────────────────────────┘
kubectl create namespace chaos-mesh
kubectl label namespace chaos-mesh environment=infrastructurehelm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/k3s/containerd/containerd.sock \
--set dashboard.create=true \
--set dashboard.ingress.enabled=false \
--set prometheus.create=true \
--set prometheus.serviceMonitor.enabled=true \
--set controllerManager.serviceAccount.name=chaos-controller-manager \
--set dnsServer.create=true \
--waitkubectl get pods -n chaos-mesh
# Expected:
# chaos-controller-manager Running
# chaos-daemon Running (on each node)
# chaos-dashboard Running
# chaos-dns-server Running
# Port-forward dashboard for setup
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333 &
# Access: http://localhost:2333apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: chaos-mesh-dashboard
namespace: chaos-mesh
annotations:
cert-manager.io/cluster-issuer: homelab-ca-issuer
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
spec:
ingressClassName: nginx
tls:
- hosts:
- chaos-mesh.homelab.local
secretName: chaos-mesh-tls
rules:
- host: chaos-mesh.homelab.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: chaos-dashboard
port:
number: 2333Chaos Mesh uses ServiceAccount and RBAC to limit what chaos experiments can do. Always set:
- Duration limits: Every experiment must have an explicit
durationfield - Namespace selectors: Only target specific namespaces
- Emergency stop: Know how to halt all experiments immediately
# Emergency stop — delete all chaos experiments
kubectl delete networkchaos,podchaos,stresschaos,iochaos,timechaos,dnschaos --all -A
# Or use the dashboard's "Pause All" button# Label namespaces that chaos experiments can target
kubectl label namespace monitoring chaos-ready=true
kubectl label namespace logging chaos-ready=true
kubectl label namespace security chaos-ready=true
kubectl label namespace sandbox chaos-ready=true
# Do NOT label: kube-system, longhorn-system, argocd, cert-managerTests: Falco self-healing and monitoring continuity
# chaos-experiments/falco-pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: falco-pod-kill
namespace: chaos-mesh
labels:
experiment: falco-resilience
spec:
action: pod-kill
mode: one
selector:
namespaces:
- security
labelSelectors:
app.kubernetes.io/name: falco
scheduler:
cron: "@every 10m"
duration: "30s"# Apply the experiment
kubectl apply -f chaos-experiments/falco-pod-kill.yaml
# Watch Falco pod get killed and restart
kubectl get pods -n security -l app.kubernetes.io/name=falco -w
# Verify Falco recovers and is functional
# Wait 2 minutes, then:
kubectl exec -n security daemonset/falco -- falco --list-source=syscall
# Check: monitoring still receives Falco events
kubectl logs -n security -l app.kubernetes.io/name=falco --tail=10
# Clean up
kubectl delete -f chaos-experiments/falco-pod-kill.yamlTests: Application behavior when Vault is unreachable
# chaos-experiments/vault-network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: vault-partition
namespace: chaos-mesh
labels:
experiment: vault-isolation
spec:
action: partition
direction: both
mode: all
selector:
namespaces:
- monitoring
labelSelectors:
app.kubernetes.io/name: prometheus
external:
targets:
- mode: all
selector:
namespaces:
- security
labelSelectors:
app: vault
duration: "60s"kubectl apply -f chaos-experiments/vault-network-partition.yaml
# Observe: applications using Vault secrets should handle the disconnect
# Check for error logs in services that depend on Vault
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus --tail=20 | grep -i vault
# Verify: network partition is active
kubectl describe networkchaos vault-partition -n chaos-mesh
# After 60s, verify connectivity is restored
kubectl exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
wget -qO- http://vault.security.svc.cluster.local:8200/v1/sys/health
# Clean up (auto-cleans after duration, or manual)
kubectl delete -f chaos-experiments/vault-network-partition.yamlTests: Alert pipeline resilience under network degradation
# chaos-experiments/monitoring-network-delay.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: monitoring-delay
namespace: chaos-mesh
labels:
experiment: alert-pipeline
spec:
action: delay
mode: all
selector:
namespaces:
- monitoring
labelSelectors:
app.kubernetes.io/name: prometheus
delay:
latency: "500ms"
correlation: "50"
jitter: "100ms"
duration: "120s"kubectl apply -f chaos-experiments/monitoring-network-delay.yaml
# Observe: Prometheus scrape latency increase
kubectl exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=scrape_duration_seconds' 2>/dev/null | jq .
# Check: alerts still fire (may be delayed)
kubectl get alerts -A
# After 120s, verify latency returns to normal
sleep 120
kubectl exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=scrape_duration_seconds' 2>/dev/null | jq .Tests: Pod eviction and resource handling under load
# chaos-experiments/worker-cpu-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: worker-cpu-stress
namespace: chaos-mesh
labels:
experiment: resource-pressure
spec:
mode: one
selector:
namespaces:
- monitoring
labelSelectors:
app.kubernetes.io/name: prometheus
stressors:
cpu:
workers: 2
load: 80
duration: "60s"kubectl apply -f chaos-experiments/worker-cpu-stress.yaml
# Watch CPU usage spike
kubectl top pods -n monitoring
# Check: Prometheus still responds (may be slower)
kubectl exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
wget -qO- http://localhost:9090/-/healthy
# Clean up after 60sTests: Storage degradation handling
# chaos-experiments/longhorn-io-latency.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: longhorn-io-delay
namespace: chaos-mesh
labels:
experiment: storage-resilience
spec:
action: latency
mode: one
selector:
namespaces:
- monitoring
labelSelectors:
app.kubernetes.io/name: prometheus
delay: "200ms"
methods:
- READ
- WRITE
path: "/data"
duration: "60s"kubectl apply -f chaos-experiments/longhorn-io-latency.yaml
# Check: Prometheus TSDB write latency
kubectl exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=prometheus_tsdb_compaction_duration_seconds' 2>/dev/null | jq .
# Check: Longhorn volume health
kubectl get volumes -n longhorn-system
# Clean up after 60s# chaos-experiments/game-day.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: weekly-game-day
namespace: chaos-mesh
spec:
schedule: "0 4 * * 1" # Monday 4 AM
historyLimit: 3
concurrencyPolicy: Forbid
type: PodChaos
podChaos:
action: pod-kill
mode: one
selector:
namespaces:
- monitoring
- logging
- security
labelSelectors:
chaos-ready: "true"
duration: "30s"Create a Grafana dashboard with panels:
- Active Experiments: Current running chaos experiments (from Chaos Mesh metrics)
- Pod Recovery Time: Time from pod kill to pod Ready (kube_pod_status_phase metric)
- Service Availability During Chaos: Target service uptime during experiments
- Alert Delivery During Chaos: AlertManager alerts fired vs delivered during experiments
Import Chaos Mesh dashboard (Dashboard ID: 16463) or create custom.
# Verify Chaos Mesh installation
kubectl get pods -n chaos-mesh
# Run all experiments sequentially (with cleanup between)
for EXP in falco-pod-kill vault-network-partition monitoring-network-delay worker-cpu-stress; do
echo "=== Running: ${EXP} ==="
kubectl apply -f chaos-experiments/${EXP}.yaml
sleep 90
kubectl delete -f chaos-experiments/${EXP}.yaml
echo "=== Cleaned up: ${EXP} ==="
sleep 30
done
# Verify all services are healthy after experiments
kubectl get pods -A | grep -v Running | grep -v Completed
# Check monitoring is still functional
kubectl exec -n monitoring deploy/kube-prometheus-stack-prometheus -- \
wget -qO- http://localhost:9090/-/healthy# Check containerd socket path (K3s uses non-default path)
kubectl logs -n chaos-mesh -l app.kubernetes.io/component=chaos-daemon
# Fix: set chaosDaemon.socketPath=/run/k3s/containerd/containerd.sock# Force delete
kubectl delete podchaos,stresschaos,networkchaos,iochaos <name> -n chaos-mesh --force --grace-period=0
# Or pause via dashboard# Check service
kubectl get svc -n chaos-mesh chaos-dashboard
# Port-forward if ingress not configured
kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333- Chaos Mesh installed in
chaos-meshnamespace - Dashboard accessible via ingress or port-forward
- Safety labels applied to target namespaces (
chaos-ready=true) - Emergency stop procedure documented and tested
- Experiment 1: Falco pod kill — verified auto-recovery
- Experiment 2: Vault network partition — verified graceful degradation
- Experiment 3: Monitoring network delay — verified alert pipeline resilience
- Experiment 4: CPU stress — verified pod eviction handling
- Experiment 5: Longhorn IO latency — verified storage degradation handling
- Weekly game day schedule created
- Chaos experiment YAML files committed to gitops-apps
- Grafana chaos dashboard created
- All services healthy after running all experiments
Implement OPA Gatekeeper for Kubernetes admission control and Conftest for CI pipeline policy enforcement.
This guide deploys OPA Gatekeeper alongside Kyverno and creates a policy-as-code framework. Gatekeeper handles admission control with Rego-based constraint templates, while Conftest validates manifests in the CI pipeline. Kyverno remains for YAML-native policies — Gatekeeper adds Rego flexibility for complex policies.
Time Required: ~75 minutes Prerequisites: Guide 08 (Security Tooling), Guide 12 (CI/CD Pipeline Security) completed
Policy as Code Architecture
┌─────────────────────────────────────────────────┐
│ Policy Sources (Git) │
│ │
│ policies/ │
│ ├── conftest/ (CI policy checks) │
│ ├── gatekeeper/ (K8s admission) │
│ └── kyverno/ (K8s admission, YAML-native) │
└──────────────────┬──────────────────────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────────┐
│ Conftest │ │Gatekeeper│ │ Kyverno │
│ (CI/CD) │ │ (K8s API)│ │ (K8s API) │
└──────────┘ └──────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────────────────────┐
│ Pipeline │ │ Kubernetes Cluster │
│ Gate │ │ (Admission Control) │
└──────────┘ └──────────────────────────┘
| Aspect | Kyverno | OPA Gatekeeper |
|---|---|---|
| Policy language | YAML (native K8s) | Rego (OPA) |
| Complexity | Simple policies | Complex logic, loops, data joins |
| Learning curve | Low | Medium-High |
| Best for | Labels, limits, image rules | Cross-resource validation, data-driven policies |
| Policy testing | Manual | OPA test framework |
| Audit | Per-policy | Centralized audit |
helm repo add gatekeeper https://open-policy-agent.github.io/gatekeeper/charts
helm repo update
helm install gatekeeper gatekeeper/gatekeeper \
--namespace gatekeeper-system \
--create-namespace \
--set auditInterval=60 \
--set auditFromCache=true \
--set logLevel=INFO \
--set emitAdmissionEvents=true \
--set emitAuditEvents=true \
--set validatingWebhookTimeoutSeconds=10 \
--set disabledBuiltins={"http.send"} \
--waitkubectl get pods -n gatekeeper-system
# Expected: gatekeeper-audit Running, gatekeeper-controller-manager Running
kubectl get crd | grep gatekeeper
# Expected: constrainttemplates.templates.gatekeeper.sh
# configs.config.gatekeeper.sh
# constraintpodstatuses.status.gatekeeper.sh# gitops-apps/security/gatekeeper/application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: gatekeeper
namespace: argocd
spec:
project: homelab
source:
repoURL: https://git.homelab.local/homelab/gitops-apps.git
targetRevision: main
path: security/gatekeeper
destination:
server: https://kubernetes.default.svc
namespace: gatekeeper-system
syncPolicy:
automated:
selfHeal: true
prune: false
syncOptions:
- ServerSideApply=true
- CreateNamespace=trueCreate gitops-apps/security/gatekeeper/templates/required-labels.yaml:
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8srequiredlabels
annotations:
description: "Require specific labels on Kubernetes resources"
spec:
crd:
spec:
names:
kind: K8sRequiredLabels
validation:
openAPIV3Schema:
type: object
properties:
labels:
type: array
items:
type: string
message:
type: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8srequiredlabels
violation[{"msg": msg}] {
provided := {label | input.review.object.metadata.labels[label]}
required := {label | label := input.parameters.labels[_]}
missing := required - provided
count(missing) > 0
msg := input.parameters.message
}Create gitops-apps/security/gatekeeper/templates/banned-registries.yaml:
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8sbannedregistries
annotations:
description: "Block images from unauthorized registries"
spec:
crd:
spec:
names:
kind: K8sBannedRegistries
validation:
openAPIV3Schema:
type: object
properties:
registries:
type: array
items:
type: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8sbannedregistries
violation[{"msg": msg}] {
container := input.review.object.spec.template.spec.containers[_]
banned := input.parameters.registries[_]
startswith(container.image, banned)
msg := sprintf("Container image <%v> uses banned registry <%v>", [container.image, banned])
}
violation[{"msg": msg}] {
container := input.review.object.spec.template.spec.initContainers[_]
banned := input.parameters.registries[_]
startswith(container.image, banned)
msg := sprintf("Init container image <%v> uses banned registry <%v>", [container.image, banned])
}Create gitops-apps/security/gatekeeper/templates/required-probes.yaml:
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8srequiredprobes
annotations:
description: "Require liveness and readiness probes on containers"
spec:
crd:
spec:
names:
kind: K8sRequiredProbes
validation:
openAPIV3Schema:
type: object
properties:
probeTypes:
type: array
items:
type: string
enum: ["livenessProbe", "readinessProbe", "startupProbe"]
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8srequiredprobes
violation[{"msg": msg}] {
probeType := input.parameters.probeTypes[_]
container := input.review.object.spec.template.spec.containers[_]
not container[probeType]
msg := sprintf("Container <%v> missing <%v>", [container.name, probeType])
}Homelab-specific: require Longhorn storage classes on PVCs.
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: homelabstorageclass
annotations:
description: "Require Longhorn storage classes on PVCs"
spec:
crd:
spec:
names:
kind: HomelabStorageClass
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package homelabstorageclass
allowed := {"longhorn-critical", "longhorn-default", "longhorn-ephemeral"}
violation[{"msg": msg}] {
input.review.kind.kind == "PersistentVolumeClaim"
sc := input.review.object.spec.storageClassName
not allowed[sc]
msg := sprintf("PVC <%v> uses unsupported storageClass <%v>. Use: longhorn-critical, longhorn-default, or longhorn-ephemeral", [input.review.object.metadata.name, sc])
}
violation[{"msg": msg}] {
input.review.kind.kind == "PersistentVolumeClaim"
not input.review.object.spec.storageClassName
msg := sprintf("PVC <%v> must specify a storageClassName", [input.review.object.metadata.name])
}kubectl apply -f gitops-apps/security/gatekeeper/templates/
# Verify templates
kubectl get constrainttemplatesCreate gitops-apps/security/gatekeeper/constraints/:
# gitops-apps/security/gatekeeper/constraints/required-labels.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
name: require-app-labels
spec:
enforcementAction: dryrun # Start in dryrun, change to deny after testing
match:
kinds:
- kinds: ["Deployment", "StatefulSet", "DaemonSet"]
excludedNamespaces:
- kube-system
- gatekeeper-system
- longhorn-system
parameters:
labels:
- "app.kubernetes.io/name"
message: "All workloads must have app.kubernetes.io/name label"# gitops-apps/security/gatekeeper/constraints/banned-registries.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sBannedRegistries
metadata:
name: ban-dockerhub-latest
spec:
enforcementAction: dryrun
match:
kinds:
- kinds: ["Deployment", "StatefulSet", "DaemonSet", "Pod"]
excludedNamespaces:
- kube-system
- gatekeeper-system
parameters:
registries:
- "docker.io/library/" # Ban official Docker Hub images (require registry mirror)# gitops-apps/security/gatekeeper/constraints/required-probes.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredProbes
metadata:
name: require-health-probes
spec:
enforcementAction: dryrun
match:
kinds:
- kinds: ["Deployment", "StatefulSet"]
excludedNamespaces:
- kube-system
- gatekeeper-system
- longhorn-system
parameters:
probeTypes:
- "livenessProbe"
- "readinessProbe"# gitops-apps/security/gatekeeper/constraints/storage-class.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: HomelabStorageClass
metadata:
name: require-longhorn-storage
spec:
enforcementAction: dryrun
match:
kinds:
- kinds: ["PersistentVolumeClaim"]
excludedNamespaces:
- kube-system
- gatekeeper-system
- longhorn-system
- velerokubectl apply -f gitops-apps/security/gatekeeper/constraints/
# Verify constraints
kubectl get constraints
kubectl describe K8sRequiredLabels require-app-labelspolicies/
├── conftest/
│ ├── require_labels.rego
│ ├── require_limits.rego
│ ├── disallow_latest.rego
│ ├── require_storage_class.rego
│ └── require_authelia_annotation.rego
├── conftest-tests/ (unit tests)
│ ├── require_labels_test.rego
│ └── disallow_latest_test.rego
├── gatekeeper/
│ ├── templates/
│ └── constraints/
└── kyverno/ (existing)
# policies/conftest/require_authelia_annotation.rego
package main
# Require forward-auth annotation on all ingresses in production
warn[msg] {
input.kind == "Ingress"
ns := input.metadata.namespace
prod_ns := {"services", "monitoring", "security"}
ns in prod_ns
not input.metadata.annotations["nginx.ingress.kubernetes.io/auth-url"]
msg := sprintf("Ingress '%s' in namespace '%s' must have Authelia forward-auth annotation", [input.metadata.name, ns])
}# policies/conftest-tests/require_labels_test.rego
package main
test_pass_with_label {
allow with input as {
"kind": "Deployment",
"metadata": {
"name": "test-app",
"namespace": "services",
"labels": {"app.kubernetes.io/name": "test-app"}
},
"spec": {
"template": {
"spec": {
"containers": [{
"name": "test",
"image": "test:1.0",
"resources": {
"limits": {"cpu": "100m", "memory": "128Mi"}
}
}]
}
}
}
}
}
test_fail_without_label {
deny[msg] with input as {
"kind": "Deployment",
"metadata": {"name": "test-app", "namespace": "services"}
}
}# Run policy unit tests
conftest verify --policy policies/conftest/ --policy policies/conftest-tests/Add to the security pipeline (from Guide 12):
# Add to security-pipeline.yaml
conftest:
name: "📋 Policy Check"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Conftest
run: |
curl -L https://github.com/open-policy-agent/conftest/releases/latest/download/conftest_linux_amd64.tar.gz | tar xz
sudo mv conftest /usr/local/bin/
- name: Verify Policy Tests Pass
run: |
conftest verify --policy policies/conftest/
- name: Validate All Manifests
run: |
conftest test --policy policies/conftest/ --output table gitops-apps/
- name: Validate Helm Charts
run: |
# If using Helm charts, render and validate
for chart in charts/*/; do
helm template "${chart}" | conftest test --policy policies/conftest/ -
doneGatekeeper runs periodic audits. View results:
# Check audit results
kubectl get constraints -o yaml | grep -A5 "totalViolations"
# View violations per constraint
kubectl describe K8sRequiredLabels require-app-labelsGatekeeper exposes metrics at :8888/metrics. Create a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: gatekeeper-metrics
namespace: gatekeeper-system
spec:
selector:
matchLabels:
app: gatekeeper
endpoints:
- port: metrics
interval: 30sCreate a Gatekeeper compliance dashboard showing:
- Total violations per constraint
- Constraint enforcement actions (dryrun vs deny)
- Audit run duration
- Admission webhook latency
Import OPA Gatekeeper dashboard (Dashboard ID: 16922).
# Verify Gatekeeper installation
kubectl get pods -n gatekeeper-system
kubectl get constrainttemplates
# Verify constraints
kubectl get constraints
kubectl describe K8sRequiredLabels require-app-labels
# Test constraint enforcement (dryrun mode — should log but not block)
kubectl run test-no-label --image=nginx -n services
# Check: violation logged in constraint status
kubectl describe K8sRequiredLabels require-app-labels
# Test Conftest locally
conftest test --policy policies/conftest/ gitops-apps/
conftest verify --policy policies/conftest/
# Switch to deny mode after testing
kubectl patch K8sRequiredLabels require-app-labels \
--type merge -p '{"spec":{"enforcementAction":"deny"}}'
# Test: should now be blocked
kubectl run test-no-label-2 --image=nginx -n services
# Expected: admission webhook denied the request
# Switch back to dryrun for safety
kubectl patch K8sRequiredLabels require-app-labels \
--type merge -p '{"spec":{"enforcementAction":"dryrun"}}'# Temporarily disable webhook
kubectl delete validatingwebhookconfiguration gatekeeper-validating-webhook-configuration
# Re-enable:
helm upgrade gatekeeper gatekeeper/gatekeeper --namespace gatekeeper-system --reuse-values# Check audit pod logs
kubectl logs -n gatekeeper-system -l control-plane=audit-controller --tail=50
# Verify auditInterval is set (default 60s)
kubectl get config -n gatekeeper-system config -o yaml# Debug with trace
conftest test --policy policies/conftest/ --trace gitops-apps/
# Verify Rego syntax
conftest parse gitops-apps/argocd-apps/root-application.yaml- OPA Gatekeeper installed in
gatekeeper-systemnamespace - ConstraintTemplate
K8sRequiredLabelscreated - ConstraintTemplate
K8sBannedRegistriescreated - ConstraintTemplate
K8sRequiredProbescreated - ConstraintTemplate
HomelabStorageClasscreated (homelab-specific) - Constraints deployed in dryrun mode
- Constraint violations reviewed and acceptable ones documented
- Conftest policies in
policies/conftest/directory - Authelia annotation policy (forward-auth on production ingresses)
- Policy unit tests pass (
conftest verify) - CI pipeline includes Conftest gate
- Gatekeeper ServiceMonitor configured
- Grafana Gatekeeper dashboard imported
- All policy files committed to Gitea
- Documented when to use Kyverno vs Gatekeeper
Route Falco runtime alerts through AlertManager to Grafana, create incident runbooks, and configure Wazuh active response for automated remediation.
This guide builds the complete alert pipeline: Falco detects runtime threats → falcosidekick routes to AlertManager → AlertManager notifies via Grafana and webhook. Includes six incident runbooks and Wazuh active response automation.
Time Required: ~90 minutes Prerequisites: Guide 08 (Security Tooling), Guide 10 (Monitoring Stack) completed
Incident Response Pipeline
┌─────────────────────────────────────────────────┐
│ Detection Sources │
│ │
│ Falco (runtime) Trivy (vulns) Kyverno │
│ kube-bench (CIS) cert-manager Longhorn │
└────────┬──────────────┬──────────────┬──────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────┐
│ falcosidekick (router) │
│ Routes events to multiple outputs │
└────────┬──────────────┬──────────────┬──────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ AlertManager │ │ Loki │ │ Wazuh Active │
│ (alerts) │ │ (logs) │ │ Response │
└──────┬───────┘ └──────────────┘ └──────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Notification Channels │
│ Grafana · Webhook · Slack (optional) │
└─────────────────────────────────────────────────┘
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update
cat > falcosidekick-values.yaml <<'EOF'
config:
alertmanager:
hostport: "http://kube-prometheus-stack-alertmanager.monitoring.svc.cluster.local:9093"
minimumpriority: "warning"
loki:
hostport: "http://loki.logging.svc.cluster.local:3100"
minimumpriority: "informational"
# Webhook for custom integrations
webhook:
address: "http://wazuh-manager.sandbox.svc.cluster.local:55000/webhook"
minimumpriority: "critical"
# Slack (optional — requires internet)
# slack:
# webhookurl: "https://hooks.slack.com/services/XXX"
# minimumpriority: "warning"
# Custom fields for all outputs
customfields:
environment: "homelab"
cluster: "k3s-homelab"
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi
EOF
helm install falcosidekick falcosecurity/falcosidekick \
--namespace security \
-f falcosidekick-values.yaml \
--wait# Update Falco config to output JSON to falcosidekick
helm upgrade falco falcosecurity/falco \
--namespace security \
--reuse-values \
--set falco.jsonOutput=true \
--set falco.programOutput.enabled=true \
--set falco.programOutput.program="curl -s http://falcosidekick.security.svc.cluster.local:2801 -X POST -H 'Content-Type: application/json' -d @-"# Trigger a Falco event
kubectl run shell-test --image=alpine -n monitoring -- sh -c "sleep 3600"
kubectl exec -n monitoring shell-test -- sh -c "cat /etc/shadow"
# Check Falco logs
kubectl logs -n security daemonset/falco --tail=10
# Check falcosidekick logs
kubectl logs -n security deploy/falcosidekick --tail=10
# Check AlertManager received the alert
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
curl -s http://localhost:9093/api/v2/alerts | jq .
# Check Loki received the event
kubectl exec -n monitoring deploy/kube-prometheus-stack-grafana -- \
wget -qO- 'http://loki.logging.svc.cluster.local:3100/loki/api/v1/query?query={job="falco"}' | jq .
# Clean up test pod
kubectl delete pod shell-test -n monitoring --forceCreate gitops-apps/monitoring/alertmanager-config.yaml:
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: monitoring
type: Opaque
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'namespace', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'grafana-notifications'
routes:
# Critical alerts — immediate notification
- match:
severity: critical
receiver: 'grafana-critical'
repeat_interval: 15m
group_wait: 10s
# Falco security alerts
- match_re:
alertname: Falco.*
receiver: 'grafana-security'
repeat_interval: 30m
# Backup failures
- match_re:
alertname: VeleroBackup.*|ProxmoxBackup.*
receiver: 'grafana-critical'
receivers:
- name: 'grafana-notifications'
# Uses Grafana unified alerting — no separate webhook needed
# Alerts appear in Grafana Alerting UI
- name: 'grafana-critical'
# Same as above — severity label routes them
- name: 'grafana-security'
# Security-specific alerts from FalcoCreate gitops-apps/monitoring/security-alerts.yaml:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: security-alerts
namespace: monitoring
spec:
groups:
- name: falco.rules
rules:
- alert: FalcoRuntimeAlert
expr: increase(falco_events{priority="Critical"}[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Falco critical runtime event detected"
description: "Falco detected {{ $value }} critical event(s) in the last 5 minutes. Rule: {{ $labels.rule }}"
- alert: FalcoHighEventRate
expr: rate(falco_events[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Unusually high Falco event rate"
description: "Falco is generating {{ $value }} events/second. This may indicate an active attack."
- name: trivy.rules
rules:
- alert: TrivyCriticalVulnerability
expr: trivy_image_vulnerabilities{severity="Critical"} > 0
for: 1h
labels:
severity: warning
annotations:
summary: "Critical vulnerability in image {{ $labels.image }}"
description: "Image {{ $labels.image }} in namespace {{ $labels.namespace }} has {{ $value }} critical vulnerabilities."
- name: cluster.rules
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} is down"
description: "Node {{ $labels.instance }} has been unreachable for 5 minutes."
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value }} times in 15 minutes."
- alert: CertificateExpiringSoon
expr: certmanager_certificate_expiration_timestamp_seconds - time() < 86400 * 14
for: 1h
labels:
severity: warning
annotations:
summary: "Certificate {{ $labels.name }} expires in less than 14 days"
- alert: LonghornVolumeDegraded
expr: longhorn_volume_status{state="degraded"} == 1
for: 10m
labels:
severity: warning
annotations:
summary: "Longhorn volume {{ $labels.volume }} is degraded"| Level | Name | Response Time | Examples |
|---|---|---|---|
| SEV1 | Critical | Immediate (< 15 min) | Node down, active intrusion, data loss |
| SEV2 | High | < 1 hour | Critical vulnerability, cert expiring, backup failure |
| SEV3 | Medium | < 4 hours | Warning alert, degraded storage, policy violation |
| SEV4 | Low | Next business day | Info alert, audit finding, non-critical CIS failure |
## IR-001: Unauthorized Shell Spawned in Container
**Severity:** SEV1 — Critical
**Source:** Falco rule "Terminal shell in container"
### Detection
Falco alert: `Terminal shell in container`
AlertManager: Critical severity
Grafana: Security dashboard → Falco events
### Investigation Steps
1. Identify the affected pod and namespace:
```bash
# From Falco event
kubectl logs -n security daemonset/falco | grep "shell in container"- Check who spawned the shell:
kubectl describe pod <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> --previous
- Check ArgoCD sync history (was this deployed recently?):
argocd app history <app-name>
- Review Kubernetes audit logs:
kubectl logs -n kube-system -l component=kube-apiserver --tail=100
- If unauthorized — isolate the pod:
kubectl label pod <pod-name> compromised=true -n <namespace> # Apply emergency NetworkPolicy to block egress kubectl apply -f - <<NETPOL apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: quarantine spec: podSelector: matchLabels: compromised: "true" policyTypes: ["Ingress", "Egress"] NETPOL
- Capture forensics:
kubectl exec <pod-name> -n <namespace> -- ps aux > /tmp/forensics-ps.log kubectl exec <pod-name> -n <namespace> -- netstat -tlnp > /tmp/forensics-net.log
- Delete the compromised pod
- Review the container image for tampering
- Check Kyverno audit logs for policy violations
- Rotate any credentials the pod had access to (Vault)
- Document in Gitea:
docs/incidents/YYYY-MM-DD-IR-001.md - Update Falco rules if needed
- Update Kyverno policies to prevent recurrence
### 4.2 Runbook: Critical Vulnerability Detected
```markdown
## IR-002: Critical Vulnerability (Trivy)
**Severity:** SEV2 — High
**Source:** Trivy Operator vulnerability scan
### Detection
Trivy alert: `TrivyCriticalVulnerability`
PrometheusRule: Critical vulnerability in image
### Investigation Steps
1. Check Trivy vulnerability reports:
```bash
kubectl get vulnerabilityreports -A
kubectl describe vulnerabilityreport <report-name> -n <namespace>
- Identify the vulnerable image and CVE:
kubectl get vulnerabilityreport -A -o json | \ jq '.items[] | select(.report.summary.criticalCount > 0)'
- Check if a fix is available:
trivy image --severity CRITICAL <image>:<tag>
- Update the image to a patched version
- If no fix available — add to Grype allowlist with justification
- If high risk — consider removing the workload:
kubectl scale deployment <name> -n <namespace> --replicas=0
- Document:
docs/incidents/YYYY-MM-DD-IR-002.md - Update CI pipeline to block this image version
### 4.3 Runbook: Node Failure
```markdown
## IR-003: Node Failure
**Severity:** SEV1 — Critical
**Source:** Prometheus alert `NodeDown`
### Detection
AlertManager: Node {{ instance }} is down
Longhorn: Volume replicas degraded
### Investigation Steps
1. Check Proxmox UI — is the VM running?
2. SSH to the Proxmox host:
```bash
ssh root@192.168.1.11 # vader or sidious
qm status <VM_ID>
- Check VM console for errors
- Check physical hardware: power, network, disk
- If VM stopped — restart:
qm start <VM_ID>
- If hardware failure — migrate VMs to healthy node
- Wait for K3s node to rejoin:
kubectl get nodes
- Longhorn will auto-rebuild degraded replicas
- Document hardware failure details
- Check if preventive maintenance is needed
### 4.4 Runbook: Certificate Expiring
```markdown
## IR-004: Certificate Expiring
**Severity:** SEV2 — High
**Source:** cert-manager PrometheusRule
### Remediation
cert-manager should auto-renew. If it hasn't:
```bash
kubectl describe certificate <name> -n <namespace>
# Check for errors in Events section
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager --tail=50
# Force renewal:
kubectl annotate certificate <name> -n <namespace> \
cert-manager.io/issue-temporary-certificate="true"
### 4.5 Runbook: Storage Degradation
```markdown
## IR-005: Longhorn Volume Degraded
**Severity:** SEV3 — Medium
**Source:** Prometheus alert `LonghornVolumeDegraded`
### Investigation
```bash
kubectl get volumes -n longhorn-system
kubectl describe volume <name> -n longhorn-system
Longhorn auto-rebuilds from healthy replicas. If not:
# Trigger rebuild via Longhorn UI or API
# Check node storage health:
ssh root@<node> "lsblk && df -h"
### 4.6 Runbook: Unauthorized Access Attempt
```markdown
## IR-006: Unauthorized Access (Authelia)
**Severity:** SEV2 — High
**Source:** Authelia logs
### Investigation
```bash
kubectl logs -n security deploy/authelia | grep "authentication failed"
# Check source IP — is it from Tailscale VPN or internal?
# If from Tailscale — check who was connected
# If from unexpected source — check pfSense firewall rules
- If brute force — block IP via pfSense
- If compromised credentials — reset in LLDAP
- Review Authelia access control rules
---
## Phase 5: Falco Grafana Dashboard
### 5.1 Falco Metrics Dashboard
Create a Grafana dashboard with panels:
| Panel | Metric | Description |
|-------|--------|-------------|
| Events/sec | `rate(falco_events[5m])` | Falco event throughput |
| Events by priority | `falco_events by (priority)` | Breakdown by severity |
| Events by rule | `topk(10, falco_events by (rule))` | Top 10 triggered rules |
| Events by namespace | `falco_events by (k8s_ns_name)` | Events per namespace |
| Total alerts sent | `increase(falcosidekick_output{status="ok"}[1h])` | Alerts successfully routed |
| Alert delivery failures | `falcosidekick_output{status="error"}` | Failed alert deliveries |
Import Falco dashboard (Dashboard ID: `11922`).
---
## Phase 6: Wazuh Active Response
### 6.1 Configure Active Response
On the Wazuh manager (sandbox namespace), add active response rules:
```xml
<!-- /var/ossec/etc/ossec.conf on Wazuh manager -->
<active-response>
<command>firewall-drop</command>
<location>local</location>
<rules_id>100100,100101</rules_id>
<timeout>3600</timeout>
</active-response>
<active-response>
<command>disable-account</command>
<location>local</location>
<rules_id>100200</rules_id>
<timeout>1800</timeout>
</active-response>
Create /var/ossec/active-response/bin/k8s-isolate-pod.sh:
#!/bin/bash
# Isolate a Kubernetes pod when Wazuh triggers an alert
# Requires kubectl access from Wazuh manager
ACTION=$1
USER=$2
IP=$3
ALERTID=$4
RULEID=$5
KUBECONFIG="/var/ossec/.kube/config"
if [ "$ACTION" = "add" ]; then
# Block the source IP at pfSense level
logger "WAZUH AR: Blocking IP $IP (rule $RULEID)"
# Or apply K8s NetworkPolicy to isolate
fi
if [ "$ACTION" = "delete" ]; then
logger "WAZUH AR: Unblocking IP $IP (timeout expired)"
fi# Silence all alerts during maintenance window
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
# Create silence via API
curl -X POST http://localhost:9093/api/v2/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{"name": "namespace", "value": "monitoring", "isRegex": false}
],
"startsAt": "2026-04-20T10:00:00Z",
"endsAt": "2026-04-20T12:00:00Z",
"createdBy": "admin@homelab.local",
"comment": "Scheduled maintenance window"
}'
# List active silences
curl -s http://localhost:9093/api/v2/silences | jq .# Verify falcosidekick is running
kubectl get pods -n security -l app.kubernetes.io/name=falcosidekick
# Verify AlertManager is receiving alerts
kubectl port-forward -n monitoring svc/kube-prometheus-stack-alertmanager 9093:9093 &
curl -s http://localhost:9093/api/v2/alerts | jq '.[].status.state'
# Verify PrometheusRules are loaded
kubectl get prometheusrules -n monitoring
# Test Falco alert pipeline
kubectl run test-shell --image=alpine -n monitoring -- sh -c "cat /etc/shadow; sleep 300"
sleep 10
kubectl logs -n security daemonset/falco --tail=5
kubectl logs -n security deploy/falcosidekick --tail=5
# Check Grafana for Falco events
# Navigate to: Explore → {job="falco"} → LogQL
# Verify runbooks exist
ls docs/incidents/ 2>/dev/null || echo "Create docs/incidents/ directory"
mkdir -p docs/incidents
# Clean up
kubectl delete pod test-shell -n monitoring --forcekubectl logs -n security daemonset/falco --tail=20
# Check: Falco is outputting JSON (--set falco.jsonOutput=true)
# Check: Program output is configured to curl falcosidekick# Check AlertManager config
kubectl get secret alertmanager-config -n monitoring -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d
# Check AlertManager logs
kubectl logs -n monitoring alertmanager-kube-prometheus-stack-alertmanager-0 --tail=20# Check ServiceMonitor exists
kubectl get servicemonitor -n security
# Falco metrics endpoint is typically on port 9376
kubectl exec -n security daemonset/falco -- wget -qO- http://localhost:9376/metrics- falcosidekick deployed in
securitynamespace - Falco outputting JSON to falcosidekick via program output
- falcosidekick routing to AlertManager (warning+ priority)
- falcosidekick routing to Loki (informational+ priority)
- AlertManager configured with severity-based routes
- PrometheusRules for Falco, Trivy, NodeDown, CrashLoop, Certs, Longhorn
- Incident severity classification documented (SEV1-SEV4)
- Runbook IR-001: Unauthorized shell in container
- Runbook IR-002: Critical vulnerability (Trivy)
- Runbook IR-003: Node failure / pod eviction
- Runbook IR-004: Certificate expiring
- Runbook IR-005: Storage degradation
- Runbook IR-006: Unauthorized access (Authelia)
- Falco Grafana dashboard created (events/sec, by priority, by rule)
- Alert silencing procedure documented
- Wazuh active response configured (firewall-drop)
- Custom K8s isolation script created
- End-to-end alert pipeline tested (Falco → sidekick → AlertManager)
Replace Flannel with Cilium CNI for eBPF-based networking, Hubble observability, L7 network policies, and transparent encryption.
This guide migrates the K3s cluster from Flannel (default CNI) to Cilium. Cilium brings eBPF-based datapath, identity-aware security, L7 network policies (HTTP, gRPC), Hubble flow visualization, and WireGuard transparent encryption between nodes.
Caution
This is a breaking change. The migration will briefly disrupt pod networking. Schedule a maintenance window and have a rollback plan ready. Read Phase 7 before starting.
Time Required: ~120 minutes Prerequisites: Guide 05 (K3s Cluster), Guide 10 (Monitoring Stack) completed
Cilium Network Security Architecture
┌─────────────────────────────────────────────────┐
│ Cilium CNI │
│ (eBPF datapath — kernel-level) │
│ │
│ ┌──────────────┐ ┌──────────────────────────┐ │
│ │ Hubble │ │ Transparent Encryption │ │
│ │ Observability│ │ (WireGuard) │ │
│ └──────────────┘ └──────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐│
│ │ Cilium Network Policies ││
│ │ L3/L4 (like K8s NetworkPolicy) ││
│ │ L7 (HTTP, gRPC, DNS, Kafka) ││
│ └──────────────────────────────────────────────┘│
└─────────────────────────────────────────────────┘
│
┌──────────────────┼──────────────────────────────┐
│ Node-to-Node Traffic │
│ vader (10.10.10.10) ◄──WireGuard──► sidious │
│ (10.10.10.12) │
└─────────────────────────────────────────────────┘
# Current CNI
kubectl get nodes -o wide
# K3s default: Flannel, iface=cni0
# Current NetworkPolicies
kubectl get networkpolicies -A
# Current pods (baseline for verification)
kubectl get pods -A -o wide > /tmp/pre-migration-pods.txt
# Check Longhorn volumes are healthy
kubectl get volumes -n longhorn-system# Velero backup before migration
velero backup create pre-cilium-migration \
--include-namespaces '*' \
--exclude-namespaces velero,kube-system \
--snapshot-volumes=true \
--wait
# Export all NetworkPolicies
kubectl get networkpolicies -A -o yaml > /tmp/networkpolicies-backup.yamlOn the K3s master (10.10.10.10):
# Stop K3s to reconfigure
ssh rancher@10.10.10.10 "sudo systemctl stop k3s"
# Edit K3s config to disable Flannel
ssh rancher@10.10.10.10 "sudo tee /etc/rancher/k3s/config.yaml" <<'EOF'
cluster-init: true
token: <existing-token>
disable:
- traefik
- flannel
# Disable kube-proxy — Cilium replaces it
kube-proxy-args:
"--disable=true"
node-ip: 10.10.10.10
advertise-address: 10.10.10.10
flannel-backend: none
write-kubeconfig-mode: "0644"
EOFOn the K3s worker (10.10.10.12):
ssh rancher@10.10.10.12 "sudo systemctl stop k3s-agent"
ssh rancher@10.10.10.12 "sudo tee /etc/rancher/k3s/config.yaml" <<'EOF'
server: https://10.10.10.10:6443
token: <existing-token>
disable:
- traefik
- flannel
flannel-backend: none
node-ip: 10.10.10.12
write-kubeconfig-mode: "0644"
EOFImportant
The flannel-backend: none setting is critical. Without it, K3s will recreate Flannel on startup.
On both K3s nodes:
# Remove old CNI configuration and interfaces
ssh rancher@10.10.10.10 "sudo rm -rf /var/lib/cni/ /etc/cni/net.d/*"
ssh rancher@10.10.10.12 "sudo rm -rf /var/lib/cni/ /etc/cni/net.d/*"
# Remove Flannel interfaces
ssh rancher@10.10.10.10 "sudo ip link delete flannel.1 2>/dev/null; sudo ip link delete cni0 2>/dev/null"
ssh rancher@10.10.10.12 "sudo ip link delete flannel.1 2>/dev/null; sudo ip link delete cni0 2>/dev/null"# Start master first
ssh rancher@10.10.10.10 "sudo systemctl start k3s"
# Wait for API server
kubectl wait --for=condition=Ready node/k3s-master-01 --timeout=120s
# Start worker
ssh rancher@10.10.10.12 "sudo systemctl start k3s-agent"
# Wait for worker
kubectl wait --for=condition=Ready node/k3s-worker-01 --timeout=120sNote
At this point, pods won't have networking. This is expected — Cilium will provide it.
helm repo add cilium https://helm.releases.cilium.io/
helm repo update
# Get K3s API server IP
API_SERVER_IP=10.10.10.10
helm install cilium cilium/cilium \
--namespace kube-system \
--set kubeProxyReplacement=true \
--set operator.replicas=1 \
--set hubble.enabled=true \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true \
--set hubble.metrics.enabled="{dns,drop,tcp,flow,icmp,http}" \
--set encryption.enabled=true \
--set encryption.type=wireguard \
--set k8sServiceHost=${API_SERVER_IP} \
--set k8sServicePort=6443 \
--set tunnelProtocol=vxlan \
--set ipam.mode=kubernetes \
--set operator.rollOutPods=true \
--set rollOutCiliumPods=true \
--wait# Check Cilium pods
kubectl get pods -n kube-system -l k8s-app=cilium
# Expected: cilium Running on each node, cilium-operator Running
# Check Cilium status
kubectl exec -n kube-system ds/cilium -- cilium status
# Expected: Host: OK, NodeMonitor: OK, Hubble: OK
# Run connectivity test
kubectl exec -n kube-system ds/cilium -- cilium connectivity test
# Verify WireGuard encryption
kubectl exec -n kube-system ds/cilium -- cilium encrypt status
# Expected: Encryption: Wireguard# Restart all pods (they need to join the new CNI)
kubectl rollout restart deployment --all -A
kubectl rollout restart statefulset --all -A
kubectl rollout restart daemonset --all -A
# Wait for all pods to be running
kubectl get pods -A -o wide
# Verify all pods have Cilium-managed IPs
kubectl exec -n kube-system ds/cilium -- cilium endpoint list# Test DNS
kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup kubernetes.default
# Test inter-pod connectivity
kubectl run netshoot --image=nicolaka/netshoot --rm -it --restart=Never -- curl -s http://gitea.services.svc.cluster.local:3000
# Test external connectivity
kubectl run netshoot2 --image=nicolaka/netshoot --rm -it --restart=Never -- curl -sI https://1.1.1.1# Port-forward Hubble UI
kubectl port-forward -n kube-system svc/hubble-ui 8080:80 &
# Access: http://localhost:8080apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: hubble-ui
namespace: kube-system
annotations:
cert-manager.io/cluster-issuer: homelab-ca-issuer
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
spec:
ingressClassName: nginx
tls:
- hosts:
- hubble.homelab.local
secretName: hubble-tls
rules:
- host: hubble.homelab.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: hubble-ui
port:
number: 80# Install Hubble CLI locally
HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/main/stable.txt)
curl -L https://github.com/cilium/hubble/releases/download/${HUBBLE_VERSION}/hubble-linux-amd64.tar.gz | tar xz
sudo mv hubble /usr/local/bin/
# Port-forward Hubble API
kubectl port-forward -n kube-system svc/hubble-relay 4245:4245 &
# Observe network flows in real-time
hubble observe --since 1m
# Observe DNS queries
hubble observe --type l7 --dns --since 5m
# Observe HTTP requests
hubble observe --type l7 --http --since 5m
# Observe dropped packets
hubble observe --type trace --trace-type drop --since 5m
# Filter by namespace
hubble observe --namespace monitoring --since 5m
# Filter by label
hubble observe --label app.kubernetes.io/name=prometheus --since 5mCilium exposes Prometheus metrics. The Hubble metrics are already configured in the Helm install (hubble.metrics.enabled). Create a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: cilium
namespace: monitoring
spec:
selector:
matchLabels:
k8s-app: cilium
namespaceSelector:
matchNames:
- kube-system
endpoints:
- port: hubble-metrics
interval: 15sImport Cilium dashboard (Dashboard ID: 16611) and Hubble dashboard (16612).
Replace existing Kyverno-generated NetworkPolicies with Cilium equivalents:
# gitops-apps/security/cilium/default-deny-ingress.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: default-deny-ingress
namespace: monitoring
spec:
endpointSelector:
matchLabels: {}
ingressDeny:
- fromEndpoints:
- matchLabels: {}
ingress:
# Allow from same namespace
- fromEndpoints:
- matchLabels:
io.kubernetes.pod.namespace: monitoringAllow only specific HTTP methods and paths between services:
# gitops-apps/security/cilium/grafana-to-prometheus.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: grafana-to-prometheus
namespace: monitoring
spec:
endpointSelector:
matchLabels:
app.kubernetes.io/name: prometheus
ingress:
- fromEndpoints:
- matchLabels:
app.kubernetes.io/name: grafana
toPorts:
- ports:
- port: "9090"
rules:
http:
- method: GET
path: "/api/v1/.*"# gitops-apps/security/cilium/argocd-to-gitea.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: argocd-to-gitea
namespace: services
spec:
endpointSelector:
matchLabels:
app.kubernetes.io/name: gitea
ingress:
- fromEndpoints:
- matchLabels:
io.kubernetes.pod.namespace: argocd
toPorts:
- ports:
- port: "3000"
rules:
http:
- method: GET
path: "/.*"
- method: POST
path: "/api/v1/.*"# Only allow DNS queries to AdGuard and internal names
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: restrict-dns
namespace: security
spec:
endpointSelector:
matchLabels: {}
egress:
- toEndpoints:
- matchLabels:
io.kubernetes.pod.namespace: services
app.kubernetes.io/name: adguard
toPorts:
- ports:
- port: "53"
protocol: UDP
- port: "53"
protocol: TCP# CNP in audit mode — log but don't block
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: audit-deny-egress
namespace: monitoring
annotations:
# Audit mode: log violations but don't block
io.cilium.policy.audit: "true"
spec:
endpointSelector:
matchLabels: {}
egressDeny:
- toCIDR:
- 10.20.20.0/24 # Block egress to sandbox# Check encryption status on each node
kubectl exec -n kube-system ds/cilium -- cilium encrypt status
# Verify WireGuard keys
kubectl exec -n kube-system ds/cilium -- cilium encrypt show-keys# Cilium generates WireGuard keys automatically
# To store in Vault for backup:
kubectl exec -n kube-system ds/cilium -- cat /run/cilium/wg/private.key > /tmp/wg-private.key
vault kv put secret/cilium/wireguard \
private-key="$(cat /tmp/wg-private.key)"
rm /tmp/wg-private.key# Capture traffic between nodes — should be WireGuard encrypted
# On pve-vader, capture traffic to sidious:
ssh rancher@10.10.10.10 "sudo tcpdump -i any -c 20 host 10.10.10.12"
# Expected: UDP packets on WireGuard port (51871 default)
# Raw TCP payloads should NOT be visibleIf Cilium migration fails:
helm uninstall cilium -n kube-system
# Clean up Cilium interfaces
ssh rancher@10.10.10.10 "sudo ip link delete cilium_host 2>/dev/null; sudo ip link delete cilium_vxlan 2>/dev/null"
ssh rancher@10.10.10.12 "sudo ip link delete cilium_host 2>/dev/null; sudo ip link delete cilium_vxlan 2>/dev/null"
# Clean up CNI config
ssh rancher@10.10.10.10 "sudo rm -rf /var/lib/cni/ /etc/cni/net.d/*"
ssh rancher@10.10.10.12 "sudo rm -rf /var/lib/cni/ /etc/cni/net.d/*"# Edit K3s config on master — remove flannel-backend: none
ssh rancher@10.10.10.10 "sudo sed -i '/flannel-backend/d' /etc/rancher/k3s/config.yaml"
ssh rancher@10.10.10.12 "sudo sed -i '/flannel-backend/d' /etc/rancher/k3s/config.yaml"
# Restart K3s
ssh rancher@10.10.10.10 "sudo systemctl restart k3s"
ssh rancher@10.10.10.12 "sudo systemctl restart k3s-agent"
# Wait for nodes
kubectl wait --for=condition=Ready node --all --timeout=120s
# Restart pods
kubectl rollout restart deployment --all -Avelero restore create rollback-from-cilium \
--from-backup pre-cilium-migration \
--wait# Install iperf3 for network performance testing
kubectl run iperf-server --image=networkstatic/iperf3 -- -s
kubectl run iperf-client --image=networkstatic/iperf3 -- -c iperf-server
# Run bandwidth test
kubectl exec iperf-client -- iperf3 -c iperf-server -t 30
# Expected results:
# Flannel (VXLAN): ~8-9 Gbps (on 1Gbps link: ~900 Mbps)
# Cilium (VXLAN): ~9-10 Gbps (eBPF overhead is lower)
# Cilium (WireGuard): ~7-8 Gbps (encryption overhead)# Cilium status
kubectl exec -n kube-system ds/cilium -- cilium status
# Hubble status
kubectl exec -n kube-system deploy/hubble-relay -- hubble status
# Connectivity test
kubectl exec -n kube-system ds/cilium -- cilium connectivity test
# Encryption verification
kubectl exec -n kube-system ds/cilium -- cilium encrypt status
# DNS resolution
kubectl run test-dns --image=busybox --rm -it --restart=Never -- nslookup kubernetes.default
# Pod-to-pod connectivity
kubectl run test-curl --image=curlimages/curl --rm -it --restart=Never -- curl -s http://gitea.services.svc.cluster.local:3000
# Network policies
kubectl get ciliumnetworkpolicies -A
# All pods running
kubectl get pods -A -o wide | grep -v Runningkubectl logs -n kube-system ds/cilium --tail=50
# Common: missing kernel headers (BPF compilation)
# Fix: sudo apt install -y linux-headers-$(uname -r)# Check Cilium endpoint list
kubectl exec -n kube-system ds/cilium -- cilium endpoint list
# Check Cilium identity list
kubectl exec -n kube-system ds/cilium -- cilium identity list
# Check Hubble flows for drops
hubble observe --type trace --trace-type dropkubectl exec -n kube-system ds/cilium -- cilium encrypt status
# If "Disabled": check encryption.enabled=true in Helm values
# Verify kernel module: modprobe wireguard# Cilium replaces kube-proxy — check service routing
kubectl exec -n kube-system ds/cilium -- cilium service list
# If services missing: check k8sServiceHost and k8sServicePort values# Check Hubble Relay
kubectl logs -n kube-system deploy/hubble-relay --tail=20
# Verify Hubble is enabled on Cilium agents
kubectl exec -n kube-system ds/cilium -- cilium config | grep hubble- Flannel removed from K3s config (
flannel-backend: none) - Old CNI config and interfaces cleaned up
- Cilium installed with kube-proxy replacement
- All pods restarted and running with Cilium networking
- DNS resolution working across namespaces
- Pod-to-pod connectivity verified
- External connectivity verified
- Hubble UI accessible (ingress or port-forward)
- Hubble CLI installed and observing flows
- Hubble metrics flowing to Prometheus
- Cilium and Hubble Grafana dashboards imported
- L3/L4 CiliumNetworkPolicies replacing K8s NetworkPolicies
- L7 HTTP policies deployed (Grafana→Prometheus, ArgoCD→Gitea)
- DNS restriction policy deployed
- Transparent WireGuard encryption enabled and verified
- Traffic between nodes confirmed encrypted (tcpdump)
- Velero backup taken before migration
- Rollback procedure documented and tested
- Performance benchmark completed (Flannel vs Cilium comparison)
- All CiliumNetworkPolicies committed to gitops-apps/security/cilium/
Capture Kubernetes events, enrich logs with structured parsing, and build log-based alerting and anomaly detection on top of Loki and Grafana Alloy.
Kubernetes events disappear after 1 hour by default. This guide configures Grafana Alloy to persist events and container logs to Loki, adds structured parsing and enrichment, enables log-based alerting via Loki ruler, and builds analytics dashboards for pattern detection.
Time Required: ~75 minutes Prerequisites: Guide 10 (Monitoring Stack) completed
Events & Log Analytics Pipeline
┌─────────────────────────────────────────────────┐
│ Data Sources │
│ │
│ K8s Events (etcd, 1hr TTL) │
│ Container Logs (stdout/stderr) │
│ Application Logs (JSON, text) │
└──────────────────┬──────────────────────────────┘
▼
┌─────────────────────────────────────────────────┐
│ Grafana Alloy (already installed) │
│ │
│ loki.source.kubernetes_events → Loki │
│ loki.source.pod_logs → Loki │
│ loki.process stages: │
│ - JSON parsing │
│ - Label extraction (namespace, pod, app) │
│ - Timestamp normalization │
│ - Drop noisy logs │
└──────────────────┬──────────────────────────────┘
▼
┌─────────────────────────────────────────────────┐
│ Loki (already installed) │
│ │
│ LogQL queries · Ruler alerts · Analytics │
└──────────────────┬──────────────────────────────┘
▼
┌─────────────────────────────────────────────────┐
│ Grafana Dashboards & Alerts │
│ │
│ Error rate · Event timeline · Log volume │
│ Anomaly detection · Top errors · Correlation │
└─────────────────────────────────────────────────┘
Grafana Alloy already runs as a DaemonSet in the logging namespace. Add a kubernetes_events source to its configuration.
Create gitops-apps/monitoring/alloy-events-config.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: alloy-event-capture
namespace: logging
data:
event-capture.alloy: |
// ── Kubernetes Event Capture ───────────────────────
loki.source.kubernetes_events "events" {
job_name = "kubernetes-events"
log_format = "json"
namespaces = [] // empty = all namespaces
forward_to = [loki.process.event_enrichment.receiver]
}
// ── Event Enrichment ───────────────────────────────
loki.process "event_enrichment" {
// Extract key fields as labels for querying
stage.json {
expressions = {
reason = "reason",
kind = "involvedObject.kind",
name = "involvedObject.name",
ns = "involvedObject.namespace",
severity = "type",
}
}
// Set Loki labels from extracted fields
stage.labels {
values = {
reason = "",
kind = "",
severity = "",
}
}
// Add namespace as external label
stage.static_labels {
values = {
job = "kubernetes-events",
}
}
forward_to = [loki.write.homelab.receiver]
}# If Alloy is deployed via Helm, add the config via values
cat > alloy-events-values.yaml <<'EOF'
extraConfigmapMounts:
- name: alloy-event-capture
configMap: alloy-event-capture
mountPath: /etc/alloy/event-capture.alloy
subPath: event-capture.alloy
alloy:
configMap:
content: |
// Include default logging config
// Include event capture config
import "event-capture" "/etc/alloy/event-capture.alloy"
// Existing logging config would go here
// ...
EOF
helm upgrade alloy grafana/alloy \
--namespace logging \
--reuse-values \
-f alloy-events-values.yaml \
--wait# Check Alloy is processing events
kubectl logs -n logging -l app.kubernetes.io/name=alloy --tail=20 | grep kubernetes-events
# Query events in Grafana Explore
# LogQL: {job="kubernetes-events"}
# Or filter: {job="kubernetes-events"} |= "Failed"
# Trigger an event for testing
kubectl run event-test --image=invalid-image-that-does-not-exist -n default
# Wait 30 seconds, then check Loki
# Query: {job="kubernetes-events"} |= "event-test"
kubectl delete pod event-test -n default --forceCreate gitops-apps/monitoring/alloy-log-pipeline-config.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: alloy-log-pipeline
namespace: logging
data:
log-pipeline.alloy: |
// ── Pod Log Collection with Enrichment ─────────────
discovery.kubernetes "pods" {
role = "pod"
}
loki.source.kubernetes "pod_logs" {
targets = discovery.kubernetes.pods.targets
job_name = "integrations/kubernetes/pod_logs"
forward_to = [loki.process.pod_enrichment.receiver]
}
// ── Log Enrichment Pipeline ────────────────────────
loki.process "pod_enrichment" {
// Stage 1: Parse JSON logs (apps using structured logging)
stage.json {
expressions = {
level = "",
msg = "",
timestamp = "",
logger = "",
traceID = "",
spanID = "",
}
}
// Stage 2: Set level as label for fast filtering
stage.labels {
values = {
level = "",
}
}
// Stage 3: Extract K8s metadata from pod labels
stage.kubernetes {
// Already enriched by discovery.kubernetes
}
// Stage 4: Drop noisy system logs to reduce storage
stage.match {
selector = '{namespace="kube-system"} |= "kube-proxy"'
action = "drop"
drop_counter_reason = "noisy-system-logs"
}
stage.match {
selector = '{namespace="longhorn-system"} |~ "instance manager client.*connect"'
action = "drop"
drop_counter_reason = "longhorn-heartbeat-noise"
}
// Stage 5: Normalize log levels
stage.match {
selector = '{level=""} |~ "(?i)error|err|fatal|panic"'
stage.json {
expressions = {level = "error"}
}
}
stage.static_labels {
values = {
cluster = "homelab-k3s",
}
}
forward_to = [loki.write.homelab.receiver]
}kubectl apply -f gitops-apps/monitoring/alloy-log-pipeline-config.yaml
# Restart Alloy to pick up new config
kubectl rollout restart daemonset -n logging alloy
kubectl rollout status daemonset -n logging alloyLoki's ruler component evaluates LogQL expressions and fires alerts. Update the Loki Helm values:
cat > loki-ruler-values.yaml <<'EOF'
loki:
ruler:
enabled: true
alertmanager_url: "http://kube-prometheus-stack-alertmanager.monitoring.svc.cluster.local:9093"
storage:
type: configmap
configmap:
name: loki-ruler-rules
rule_path: /loki/rules
ring:
kvstore:
store: inmemory
EOF
helm upgrade loki grafana/loki \
--namespace logging \
--reuse-values \
-f loki-ruler-values.yaml \
--waitCreate gitops-apps/monitoring/loki-log-alerts.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-ruler-rules
namespace: logging
data:
log-alerts.yaml: |
groups:
- name: homelab-log-alerts
rules:
# High error rate across any service
- alert: HighErrorRate
expr: |
sum(rate({level="error"}[5m])) by (namespace, job)
/
sum(rate({job=~".+"}[5m])) by (namespace, job)
> 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate in {{ $labels.namespace }}/{{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} in the last 5 minutes"
# Vault seal/unseal events
- alert: VaultSealEvent
expr: |
sum(count_over_time({app="vault"} |= "core: seal" [5m])) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Vault sealed unexpectedly"
description: "Vault in namespace security has been sealed"
# Pod OOMKilled events
- alert: PodOOMKilled
expr: |
sum(count_over_time({job="kubernetes-events"} |= "OOMKilled" [10m])) by (name) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.name }} was OOMKilled"
description: "A pod was killed due to out-of-memory. Consider increasing memory limits."
# Image pull failures
- alert: ImagePullFailure
expr: |
sum(count_over_time({job="kubernetes-events"} |~ "Failed.*ImagePull|ErrImagePull" [10m])) by (name) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Image pull failure for {{ $labels.name }}"
# Longhorn volume issues
- alert: LonghornVolumeError
expr: |
sum(count_over_time({namespace="longhorn-system"} |= "error" |~ "volume|replica" [5m])) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "Longhorn volume errors detected"
# CrashLoopBackOff events
- alert: CrashLoopBackOffDetected
expr: |
sum(count_over_time({job="kubernetes-events"} |= "BackOff" |~ "CrashLoop" [5m])) by (name) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.name }} in CrashLoopBackOff"
# Falco high event rate (log-based)
- alert: FalcoHighEventRate
expr: |
sum(rate({namespace="security"} |= "Falco" [5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Unusually high Falco event rate"
description: "Falco is generating {{ $value }} events/second via logs"
# SSL/TLS certificate errors
- alert: CertificateError
expr: |
sum(count_over_time({job=~".+"} |~ "certificate.*error|tls.*handshake.*fail|x509.*cert" [10m])) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "TLS/certificate errors detected"kubectl apply -f gitops-apps/monitoring/loki-log-alerts.yaml
# Restart Loki to pick up ruler config
kubectl rollout restart statefulset -n logging lokiCreate a Grafana dashboard with these panels:
| Panel | LogQL Query | Description |
|---|---|---|
| Events over time | sum(count_over_time({job="kubernetes-events"}[1h])) |
Event volume trends |
| Events by reason | topk(10, sum(count_over_time({job="kubernetes-events"} [1h])) by (reason)) |
Top event types |
| Events by namespace | sum(count_over_time({job="kubernetes-events"} [1h])) by (ns) |
Which namespaces generate events |
| Warning events | sum(count_over_time({job="kubernetes-events"} |="Warning" [1h])) |
Warning-level events |
| Failed pods | count_over_time({job="kubernetes-events"} |="Failed" [1h]) |
Pod failure events |
| Image pull failures | count_over_time({job="kubernetes-events"} |~"ErrImagePull|ImagePullBackOff" [24h]) |
Image issues |
| Panel | LogQL Query | Description |
|---|---|---|
| Error rate by namespace | sum(rate({level="error"}[5m])) by (namespace) |
Error distribution |
| Top error messages | topk(10, sum(count_over_time({level="error"} [1h])) by (msg)) |
Most frequent errors |
| Error trend | sum_over_time({level="error"} [1d]) |
Daily error count |
| New errors (not seen before) | Custom query comparing time windows | First-seen errors |
| Error by service | sum(rate({level="error"}[5m])) by (app) |
Which service errors most |
| Panel | LogQL Query | Description |
|---|---|---|
| Total log volume | sum(rate({job=~".+"} [5m])) |
Overall ingestion rate |
| Volume by namespace | sum(rate({job=~".+"} [5m])) by (namespace) |
Log distribution |
| Volume spike detection | Compare current rate to 1h average | Unusual log surges |
| Dropped logs | sum(rate(alloy_dropped_log_lines_total[5m])) by (namespace) |
Logs dropped by Alloy |
| Loki ingestion rate | loki_distributor_lines_received_total |
Lines ingested per second |
Create gitops-apps/monitoring/grafana-dashboards/events-analytics.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: events-analytics-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
events-analytics.json: |
{
"annotations": { "list": [] },
"title": "Kubernetes Events & Log Analytics",
"tags": ["homelab", "events", "logs"],
"timezone": "utc",
"panels": [
{
"title": "Events Over Time",
"type": "timeseries",
"datasource": { "type": "loki", "uid": "loki" },
"targets": [{
"expr": "sum(count_over_time({job=\"kubernetes-events\"}[$__interval]))",
"refId": "A"
}],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
},
{
"title": "Top Event Reasons",
"type": "barchart",
"datasource": { "type": "loki", "uid": "loki" },
"targets": [{
"expr": "topk(10, sum(count_over_time({job=\"kubernetes-events\"} [1h])) by (reason))",
"refId": "A"
}],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
},
{
"title": "Warning Events",
"type": "timeseries",
"datasource": { "type": "loki", "uid": "loki" },
"targets": [{
"expr": "sum(count_over_time({job=\"kubernetes-events\"} |= \"Warning\" [$__interval])) by (reason)",
"refId": "A",
"legendFormat": "{{reason}}"
}],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }
},
{
"title": "Error Rate by Namespace",
"type": "timeseries",
"datasource": { "type": "loki", "uid": "loki" },
"targets": [{
"expr": "sum(rate({level=\"error\"}[5m])) by (namespace)",
"refId": "A",
"legendFormat": "{{namespace}}"
}],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 }
},
{
"title": "Log Volume by Namespace",
"type": "timeseries",
"datasource": { "type": "loki", "uid": "loki" },
"targets": [{
"expr": "sum(rate({job=~\".+\"}[5m])) by (namespace)",
"refId": "A",
"legendFormat": "{{namespace}}"
}],
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 16 }
}
],
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" }
}kubectl apply -f gitops-apps/monitoring/grafana-dashboards/events-analytics.yaml
# Grafana will auto-import if dashboard sidecar is configured
# Otherwise: Dashboards → Import → paste JSONAdd to the Alloy log pipeline:
// Stage: Extract traceID from structured logs for Tempo correlation
stage.json {
expressions = {
traceID = "traceID",
spanID = "spanID",
}
}
// Link to Tempo trace in Grafana
stage.labels {
values = {
traceID = "",
}
}
In Grafana, configure the Loki datasource derived fields to link to Tempo:
- Connections → Data Sources → Loki
- Derived fields → Add:
- Name:
TraceID - Regex:
traceID=(\w+) - Datasource: Tempo
- URL: leave empty (uses datasource)
- Name:
This makes trace IDs in logs clickable — jumping directly to the trace in Tempo.
cat > loki-retention-values.yaml <<'EOF'
loki:
limits_config:
retention_period: 744h # 31 days
max_query_length: 721h
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
delete_request_store: filesystem
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: loki_index_
period: 24h
EOF
helm upgrade loki grafana/loki \
--namespace logging \
--reuse-values \
-f loki-retention-values.yaml \
--wait# Check Loki storage usage
kubectl exec -n logging loki-0 -- du -sh /loki/
kubectl exec -n logging loki-0 -- du -sh /loki/chunks/ /loki/index/
# Check PVC usage
kubectl exec -n logging loki-0 -- df -h /loki# All Warning events
{job="kubernetes-events"} |= "Warning"
# Events for a specific pod
{job="kubernetes-events"} | json | name="my-pod"
# Failed scheduling events
{job="kubernetes-events"} |= "FailedScheduling"
# Eviction events
{job="kubernetes-events"} |= "Evicting"
# Recent events sorted by time
{job="kubernetes-events"} | json | line_format "{{.metadata.creationTimestamp}} [{{.type}}] {{.reason}}: {{.message}}"
# Errors in last hour by service
{level="error"} | json | line_format "{{.logger}}: {{.msg}}"
# Errors with stack traces
{level="error"} |~ "panic|fatal|stack trace"
# HTTP 5xx errors
|~ "status_code=\"5[0-9][0-9]\"|HTTP 5[0-9][0-9]|\"status\":5[0-9][0-9]"
# Slow requests (>1s)
|~ "duration_ms=\"[1-9][0-9]{3,}\"|elapsed.*[0-9]+s"
# Log volume spike (compare to 1h average)
sum(rate({job=~".+"} [5m])) / (sum(rate({job=~".+"} [1h])) * 12) > 2
# New error messages (not seen in previous hour)
sum(count_over_time({level="error"} [5m])) by (msg)
and on (msg)
sum(count_over_time({level="error"} [5m] offset 1h])) by (msg) == 0
# Verify Alloy is capturing events
kubectl logs -n logging -l app.kubernetes.io/name=alloy --tail=20 | grep "kubernetes-events"
# Verify events in Loki
kubectl exec -n monitoring deploy/kube-prometheus-stack-grafana -- \
wget -qO- 'http://loki.logging.svc.cluster.local:3100/loki/api/v1/query?query={job="kubernetes-events"}' | jq '.data.result | length'
# Verify log-based alerts are loaded
kubectl get configmap loki-ruler-rules -n logging -o yaml | grep alert
# Verify Loki retention is configured
kubectl exec -n logging loki-0 -- wget -qO- 'http://localhost:3100/loki/api/v1/status/config' | jq '.limits_config.retention_period'
# Verify dashboards
curl -s "http://admin:admin@10.10.10.10:30090/api/dashboards/uid" | jq '.[].title' | grep -i event
# Generate test events and verify capture
kubectl run test-events --image=invalid-image -n default
sleep 30
# Query in Grafana: {job="kubernetes-events"} |= "test-events"
kubectl delete pod test-events -n default --force# Check Alloy event capture config
kubectl exec -n logging ds/alloy -- cat /etc/alloy/event-capture.alloy
# Check Alloy logs for errors
kubectl logs -n logging ds/alloy --tail=50 | grep -i error
# Verify Loki is receiving: check /loki/api/v1/labels# Check ruler is enabled
kubectl logs -n logging loki-0 --tail=50 | grep ruler
# Verify rules configmap
kubectl get cm loki-ruler-rules -n logging -o yaml
# Check ruler metrics
kubectl exec -n logging loki-0 -- wget -qO- 'http://localhost:3100/metrics' | grep ruler# Identify noisy sources
kubectl exec -n monitoring deploy/kube-prometheus-stack-grafana -- \
wget -qO- 'http://loki.logging.svc.cluster.local:3100/loki/api/v1/query?query=topk(10,sum(rate({job=~".+"}[1h]))by(namespace))'
# Add more drop stages in Alloy config for noisy namespaces# Test Loki query directly
kubectl exec -n monitoring deploy/kube-prometheus-stack-grafana -- \
wget -qO- 'http://loki.logging.svc.cluster.local:3100/ready'
# Expected: ready
# Check datasource URL in Grafana: http://loki.logging.svc.cluster.local:3100- Grafana Alloy configured with
loki.source.kubernetes_events - Kubernetes events persisted to Loki (survive past 1hr TTL)
- Events enriched with labels (reason, kind, severity, namespace)
- Alloy log pipeline stages configured (JSON parsing, label extraction, noise filtering)
- Loki ruler enabled with AlertManager integration
- Log-based alert rules created (error rate, OOMKilled, ImagePull, CrashLoop, Falco, certs)
- Kubernetes Events dashboard created in Grafana
- Error Analytics dashboard created in Grafana
- Log Volume Anomaly dashboard created in Grafana
- TraceID extraction configured for log-to-trace correlation
- Grafana derived fields link Loki logs to Tempo traces
- Loki retention set to 31 days
- Log storage monitored (PVC usage)
- LogQL query reference documented
- All configs committed to gitops-apps
Comprehensive step-by-step guides for deploying your DevSecOps homelab infrastructure.
| Guide | Description | Target Node |
|---|---|---|
| 01 - Local Setup | Configure MacBook with TF/Ansible | Local |
| 02 - Proxmox Cluster | Form Cluster & SDN (VXLAN) | Vader (Master) |
| 03 - Terraform | Provision 24/7 VMs & Hack Box | Vader/Sidious |
| 04 - Ansible Bootstrap | OS hardening & K3s Prereqs | All Nodes |
| 05 - K3s Cluster | Deploy Kubernetes (Server/Agent) | Vader/Sidious |
| 06 - Longhorn | HA Distributed Block Storage | K3s Nodes |
| 07 - GitOps Stack | Deploy Gitea & ArgoCD | K3s Cluster |
| 08 - Security Tooling | Vault, Falco, Trivy, Kyverno | K3s Cluster |
| 09 - Red/Blue Team | Deploy Kali & Security Sandboxes | Maul (Hack Box) |
| 10 - LGTM Stack | Loki, Grafana, Tempo, Prometheus | K3s Cluster |
| 11 - Identity & SSO | Authelia & Active Directory | K3s Cluster |
| 12 - CI/CD Pipeline | Gitea Actions, gitleaks, security gates | K3s Cluster |
| 13 - Supply Chain | Cosign, Syft, Grype, Sigstore | K3s Cluster |
| 14 - IaC Security | tfsec, Checkov, Terrascan, Conftest | K3s Cluster |
| 15 - Cert Manager | TLS automation, private CA | K3s Cluster |
| 16 - Backup & DR | Velero, PBS, restore runbooks | K3s Cluster |
| 17 - Compliance | kube-bench, CIS, OpenSCAP | All Nodes |
| 18 - DAST | OWASP ZAP automated scanning | Maul (Sandbox) |
| 19 - Chaos Eng | Chaos Mesh resilience testing | K3s Cluster |
| 20 - Policy as Code | OPA Gatekeeper, Conftest policies | K3s Cluster |
| 21 - Incident Response | Falco→AlertManager, runbooks | K3s Cluster |
| 22 - Network Security | Cilium CNI, Hubble, WireGuard | K3s Cluster |
| 23 - Events & Logs | K8s events capture, Loki ruler, log analytics | K3s Cluster |
| Component | Network | IP Address | Host Node |
|---|---|---|---|
| pve-vader | Physical | 192.168.1.11 | Master |
| pve-sidious | Physical | 192.168.1.12 | 24/7 Node |
| pve-maul | Physical | 192.168.1.10 | Hack Box |
| pfSense LAN | VNet1 | 10.10.10.1 | Vader |
| AdGuard Home | VNet1 | 10.10.10.2 | Vader |
| Tailscale | VNet1 | 10.10.10.3 | Vader |
| K3s Master | VNet1 | 10.10.10.10 | Vader |
| K3s Worker 1 | VNet1 | 10.10.10.12 | Sidious |
| Kali Linux | VNet2 | 10.20.20.10 | Maul |
Outcome: Proxmox cluster healthy, SDN configured, VMs provisioned via IaC.
Outcome: HA cluster running with distributed storage backed by physical SATA SSDs.
Outcome: GitOps engine (ArgoCD) and Secrets (Vault) operational.
Outcome: Full LGTM observability, isolated security labs, and Single Sign-On (SSO) active.
Outcome: CI/CD pipeline with security gates, image signing and SBOMs, IaC scanning in pipeline.
Outcome: Automated TLS certificates, backup/DR with tested restore procedures, CIS compliance.
Outcome: DAST scanning against vulnerable apps, chaos engineering for resilience, policy-as-code with OPA Gatekeeper.
Outcome: Complete incident response pipeline (Falco→AlertManager→Grafana), Cilium eBPF networking with WireGuard encryption and Hubble observability.
Outcome: Kubernetes events persisted beyond 1hr TTL, log-based alerting via Loki ruler, structured log parsing, and analytics dashboards for pattern detection and anomaly identification.
Architecture for a 3-node Proxmox and Kubernetes homelab tailored for DevSecOps, GitOps, and Observability.
| Node | Role | Hardware Pool | Workloads |
|---|---|---|---|
| pve-vader | Primary Master | NVMe + SATA SSD | pfSense, K3s Master, ArgoCD, Vault |
| pve-sidious | 24/7 Worker | NVMe + SATA SSD | K3s Worker, LGTM Stack Persistence |
| pve-maul | Hack Box | NVMe Only | Kali Linux, Red Team Sandboxes |
- Hypervisor: Proxmox VE 8.x with SDN (VXLAN Tunneling)
- Kubernetes: K3s (Master on Vader, Workers on Sidious/Vader)
- Storage: Longhorn HA (backed by host-replicated Virtual Disks)
- Observability: LGTM Stack (Loki, Grafana, Tempo, Prometheus)
- Networking: pfSense (Double NAT Gateway), Tailscale (Admin VPN)
Core networking (pfSense) and the Kubernetes Control Plane are pinned to pve-vader. Along with pve-sidious, these nodes maintain Proxmox quorum and cluster stability 24/7.
pve-maul is designated as an optional node. It is isolated at the firewall level (pfSense) to allow for high-risk security testing without compromising the stability of the management infrastructure.
To resolve host-path limitations, Longhorn uses virtual disks mapped from the physical SATA SSDs on Vader and Sidious. This ensures data persistence across node reboots.
- Physical Preparation: Complete the LVM Thin Pool setup in
docs/checklist/storage-checklist.md. - Infrastructure Build: Follow the
docs/checklist/implementation-checklist.mdstarting from Phase 0.