Skip to content

Instantly share code, notes, and snippets.

@jjo
Created June 1, 2026 14:32
Show Gist options
  • Select an option

  • Save jjo/cd4e26ef905fce805387da4cb8699e38 to your computer and use it in GitHub Desktop.

Select an option

Save jjo/cd4e26ef905fce805387da4cb8699e38 to your computer and use it in GitHub Desktop.

UM Cloud — Optimization Pass (late May 2026)

Two-day stabilization + scale-up campaign covering Ceph health, the OpenStack control plane (rook-ceph, RabbitMQ, MariaDB, memcached, Horizon, nova/neutron/cinder), and a reproducible load-test harness. End state: cluster healthy, 40 concurrent VM creates verified at 2m10s–2m17s with 0 errors, no MessagingTimeouts in the control plane.


1. Ceph — recovery + resilience

Acute fires

  • rabbitmq-rabbitmq-0 stuck mount (CSI op-lock leftover from um-kros-03 hard reboot). Fix: restart csi-rbdplugin DaemonSet pod on the node + csi-rbdplugin-provisioner rollout.
  • PG 1.1d stuck peering for ~54 min on acting [10,7,6] (replicapool, blocking rabbitmq RBD image). Fix: ceph pg repeer 1.1d + restart osd.6 (um-kros-05).
  • PG 3.11 stuck peering for cinder.volumes pool — osd.10 slow ops. Fix: restart osd.10 (um-kros-07). Surfaced the deeper issue (next section).
  • um-kros-01 OSDs back: node had been cordoned. osd.0 + osd.26 live on /dev/sda4 + /dev/sdb4 (system raid is sda2+sdb2/md0). Uncordoning brought both back into the cluster.
  • mon.v off-quorum on um-kros-01 — same uncordon brought it back, restoring 3/3 quorum.
  • 3 daemon crashes archived (ceph crash archive-all).

Lasting changes

  • kubeai-cephfs pools flipped from size=1 to size=2 min_size=1 (kubeai-cephfs-metadata, kubeai-cephfs-data0). Eliminates single-OSD data-loss exposure for the AI workload.
  • mon_data_avail_warn lowered cluster-wide from 30% to 15%. Mons sit on 45 GB ext4 with <150 MB data — the 30% default was overly conservative.

Disk cleanup on mon hosts

Node Before After Reclaimed
um-kros-04 90% used 60% ~13 GB
um-kros-03 71% used 62% ~4 GB
um-kros-01 76% used 74% ~1 GB

Removed: stale mon-{a..w} directories (only the active mon per host kept), etcd.bak, kubebet.bak-kros-v1, old etcd tarballs, journal vacuum to 7 d, kube-audit rotated logs.


2. Docker zfs storage driver pressure

zfs list was burning a full core on um-kros-01, um-kros-03, um-kros-04 continuously. Root cause: docker's zfs storage driver shells out to zfs list -r -t all -Hp ... on every container event; ~1000–1200 datasets per node made each call O(N) and slow.

Action: docker system prune -af --filter "until=240h" plus builder prune on each node.

Node Datasets before → after Reclaimed
um-kros-01 1119 → 361 33.8 GB
um-kros-03 1264 → 472 36.4 GB
um-kros-04 982 → 417 17.5 GB

CPU pressure cleared immediately; load on um-kros-03 dropped from ~5 to <1.

Long-term: migrate dockerd storage driver away from zfs (overlay2 over a zfs dataset is fine) — separate effort, documented but not executed.


3. RabbitMQ — Tier-2 tuning + safety fixes

Existing setup: single STS replica rabbitmq-rabbitmq-0 in openstack ns, RabbitMQ 3.9.0 on Ceph RBD (768Mi PVC), Erlang VM previously pinned to +S 1:1 (single scheduler).

Patch

os/configs/rabbitmq-perf.cm-patch.yaml (committed) + make apply-rabbitmq-tuning:

rabbitmq-env.conf:
  SERVER_ADDITIONAL_ERL_ARGS="+S 4:4 +A 128 +K true +sbwt none +sbwtdcpu none +sbwtdio none"

rabbitmq.conf additions:
  log.default.level                  = warning
  vm_memory_high_watermark.relative  = 0.6
  disk_free_limit.absolute           = 50MB
  collect_statistics_interval        = 30000

Bugs caught during tuning

  • disk_free_limit.relative = 1.0 (initial attempt) means "require 1× total system RAM free on the volume" — impossible on a 768 Mi PVC. Result: permanent disk alarm → all publishers blocked → every nova-compute lost heartbeat at the same wall-clock instant → hypervisor list showed all hosts downopenstack server list took 3m51s → horizon hit apache proxy timeout → 504s. Fixed by switching to disk_free_limit.absolute = 50MB.
  • queue_index_max_journal_entries and mnesia.dump_log_write_threshold aren't valid keys in 3.9's sysctl-format rabbitmq.conf (would need advanced.config). Removed.

Verified

  • erlang:system_info(schedulers_online) → 4 (was 1)
  • erlang:system_info(thread_pool_size) → 128 (was 64 default)
  • Auth grant on vhost / for the rabbitmq admin user added by the perf-test script.

4. memcached — relocate + de-SPOF

  • Single pod was on um-kros-04 (the disk-pressured / mon-host node) and momentarily refused connections during our work, causing keystone token-validation cascade-fail → Horizon 504s.
  • Killed, rescheduled onto um-kros-03. Verified RTT ~21 ms from a nova-api pod.
  • Scaled to 2 replicas in the Helm values.

5. ImagePullBackOff — kubernetes-entrypoint schema-v1 footgun

quay.io/airshipit/kubernetes-entrypoint:v1.0.0 is Docker manifest schema-v1, which modern dockerd refuses to pull from a remote registry. Every openstack-helm chart uses this image as dep_check init container. A node missing the local cache blocks init containers → nova-scheduler stuck Init:0/3MessagingTimeout on every select_destinations → every new VM lands in ERROR (the ddb10561-… case investigated).

Fix: os/scripts/preload-dep-check-image.sh + make preload-dep-check-image — finds a CP node that has the image cached, docker save | docker loads it onto every other CP node tagged openstack-control-plane=enabled. Idempotent. Should be re-run after a fresh node joins or a docker prune.


6. Horizon — WSGI concurrency

The Apache WSGIDaemonProcess line baked into the image was:

WSGIDaemonProcess horizon-http processes=5 threads=1

5 concurrent Django requests per pod total. With 3 pods → 15 cluster-wide. The /project/instances/ view fires per-row polling for every visible server (we have 226), instantly saturating the WSGI pool and queuing the rest. Horizon CPU sat idle at <4% while users saw lag.

Fix: patched the horizon-etc Secret (the apache config lives there as 000-default.conf, not in a ConfigMap):

processes=8 threads=15        # 120 concurrent per pod, 360 cluster-wide

Plus deployment-level patch for the readiness probe:

readinessProbe.timeoutSeconds = 5     # was 1 — cold start of 8 WSGI procs exceeded 1s
readinessProbe.initialDelaySeconds = 30

Mid-fix gotcha: an initial kubectl patch secret used the wrong jsonpath escape and wiped the apache config to empty; recovered by reading it back from a still-running pod and re-patching. Worth doing the change in the chart values long-term so it survives helm upgrade.


7. Scale-up for ~40 concurrent VM peak (replicas)

Cluster was originally sized with most services at 1–2 replicas + workers = 1 inside each WSGI pod. 40 simul incoming API calls would have serialized through 2 nova-api slots and 1 nova-scheduler.

Live + chart values updated (pod.replicas block in os/configs/os-charts-values.tmpl.yaml)

Service Before After
nova-api-osapi 2 4
nova-scheduler 1 3
nova-conductor 2 2
placement-api 1 2
cinder-volume 1 2
cinder-scheduler 1 2
memcached 1 2

Note: the shared values file applies any given pod.replicas.<role> to every chart that defines that role. Documented side effects: pod.replicas.scheduler=3 also bumps cinder-scheduler; pod.replicas.api=2 bumps placement / glance / heat / cinder / keystone where consumed (all already ≥2).

Quick-win #2 (not yet applied)

Bump osapi_compute_workers, metadata_workers, neutron workers, cinder workers from 14 inside oslo configs. Effective concurrency per pod goes from 1 to 4 → another 4× ingress headroom. Useful if traffic doubles from current peak.


8. Load-test harness — reproducible benchmarks

RabbitMQ — rabbitmq-perf-test driven from an in-cluster Pod

os/scripts/rabbitmq-loadtest.sh + Makefile targets rabbitmq-loadtest-{baseline,realistic,stress}.

Auto-scrapes admin creds from secret/rabbitmq-admin-user, grants vhost / permissions, spawns a one-shot Pod with pivotalrabbitmq/perf-test:latest, writes results to /tmp/rmq-loadtest-<scenario>-<ts>.log.

Numbers (post-Tier-2, mid Ceph recovery)

Scenario Producers/Consumers Persistent? avg msg/s confirm p99 e2e p99
baseline 1/1 no 16,037 23 ms 262 ms
realistic (worst-case durable) 4/4 yes 1,011 1.45 s 1.0 s
stress 16/16 no 13,839 1.73 s* 1.69 s*

* stress latency dominated by --confirm 1000 batching, not rabbit. Throughput plateau at ~14–16k msg/s = single-node ceiling.

Verified every openstack vhost (nova/neutron/keystone/cinder/glance/heat) declares only durable=false queues. So the persistent-message penalty doesn't apply to real traffic. Real openstack RPC load is in the low hundreds of msg/s — 30–300× headroom.

VM creation — Terraform/OpenTofu

os/terraform/vm-loadtest/ (main.tf + variables.tf + versions.tf) + helper scripts:

  • os/scripts/vm-loadtest-quota.sh — bumps nova + neutron admin quotas, neutron via direct API PUT (CLI's check_limit parameter rejected by wallaby neutron).
  • os/scripts/vm-loadtest-report.sh — polls servers, reports ACTIVE/ERROR + fault messages.

Makefile targets: vm-loadtest-quota, vm-loadtest-init, vm-loadtest-apply, vm-loadtest-report, vm-loadtest-destroy, vm-loadtest (chains the first three).

make vm-loadtest N=40

Result

  • 40 / 40 ACTIVE, 0 ERROR
  • Per-VM end-to-end 2 m 10 s – 2 m 17 s (7 s spread)
  • No MessagingTimeout in nova/neutron/keystone/cinder/placement/glance logs during the window
  • Bottleneck now lives inside the VM (cloud-init), not in the control plane

9. Repo deltas (durable)

/home/jjo/src/juanjo/um-cloud/kros-v2/os/:

Path Purpose
configs/rabbitmq-perf.cm-patch.yaml Tier-2 rabbit ConfigMap patch
configs/os-charts-values.tmpl.yaml pod.replicas block added
scripts/preload-dep-check-image.sh Side-load schema-v1 image to CP nodes
scripts/rabbitmq-loadtest.sh In-cluster perf-test harness
scripts/vm-loadtest-quota.sh Admin quota bump (nova + neutron direct API)
scripts/vm-loadtest-report.sh Polls + reports VM outcomes
terraform/vm-loadtest/main.tf Concurrent VM creation manifest
terraform/vm-loadtest/variables.tf n, name_prefix, image/flavor/net names
terraform/vm-loadtest/versions.tf provider pin + cloud = openstack_helm
Makefile apply-rabbitmq-tuning, preload-dep-check-image, rabbitmq-loadtest-{baseline,realistic,stress}, vm-loadtest{,-quota,-init,-apply,-report,-destroy}

Plus a planning doc ~/Downloads/um-rabbitmq-local.md for the eventual rabbit→local-storage migration (deemed unnecessary for current throughput, kept for SPOF mitigation).


10. Open items

  • um-kros-01 nova-compute in CrashLoopBackOff (EHOSTUNREACH to keystone-api ClusterIP). Pre-existing networking/iptables gap — separate from anything load-related. Either fix kube-proxy/CNI on that node or take nova-compute off um-kros-01 entirely if it was never intended as a hypervisor.
  • dockerd zfs → overlay2 migration plan (drained nodes, daemon.json switch, image re-pull). Avoid future zfs list CPU storms.
  • kubernetes-entrypoint image: prune-resistant only as long as preload-dep-check-image.sh runs on every new CP node. Long-term, upgrade the chart dep_check reference to an OCI-v2 image.
  • workers = 1 in oslo configs everywhere — Quick Win #2; bump to 4 if peak grows.
  • Rabbit → local storage migration plan archived at ~/Downloads/um-rabbitmq-local.md. Throughput gain currently marginal but mitigates the Ceph-coupling SPOF.
  • Horizon WSGI changes live in Secret/horizon-etc and a kubectl patch — both revert on helm upgrade. Mirror to chart values for durability.
  • Ceph crash log clean; backfill complete; cluster HEALTH_WARN only because of um-kros-07 OSDs intentionally out.

11. Headline metrics, before → after

Metric Before After
RabbitMQ msg/s (transient, 1P/1C) ~3-5k (estimated, +S 1:1) 16 k
RabbitMQ schedulers online 1 4
RabbitMQ async IO threads 64 (default) 128
Horizon WSGI concurrency (cluster) 15 360
openstack server list (admin) 3 m 51 s ~5–8 s
40 concurrent VM creates not measured; expected MessagingTimeouts 40 / 40 ACTIVE, 0 error, 2:10–2:17
Ceph pools at size=1 (kubeai-cephfs) 2 0
Mons low-on-disk warnings 3 (d, n, v) 0 (threshold + cleanup)
OSDs down due to slow-ops blocks 3 (6, 10, 16) 0 (osd.1+10 deliberately out)
CP nodes missing dep_check image 2/3 0 (preload script idempotent)
memcached SPOF yes (1 pod on troubled node) 2 replicas, on different node

12. Follow-up optimizations (manual kubectl patches)

Horizon (additional tuning)

  • WSGI raised from processes=5 threads=1processes=12 threads=2 (via direct edit of Secret/horizon-etc).
  • Session backend temporarily changed to django.contrib.sessions.backends.cachereverted to cached_db after causing repeated re-authentication.
  • Resource requests added to Deployment (generous, no limits):
    resources:
      requests:
        cpu: "100m"
        memory: "512Mi"
      limits:
        cpu: "500m"
        memory: "1Gi"

MariaDB (mariadb-server StatefulSet)

  • First component to receive resource requests in this follow-up pass.
  • Generous requests, no limits (as requested):
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
  • Pod is now Burstable QoS. Current usage remains low (~24m / ~700Mi), leaving significant headroom.

These changes were applied via targeted kubectl patch (no Helm upgrades) to avoid destabilizing the control plane during the load-test window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment