UM Cloud — Optimization Pass (late May 2026)

Two-day stabilization + scale-up campaign covering Ceph health, the OpenStack control plane (rook-ceph, RabbitMQ, MariaDB, memcached, Horizon, nova/neutron/cinder), and a reproducible load-test harness. End state: cluster healthy, 40 concurrent VM creates verified at 2m10s–2m17s with 0 errors, no MessagingTimeouts in the control plane.

1. Ceph — recovery + resilience

Acute fires

rabbitmq-rabbitmq-0 stuck mount (CSI op-lock leftover from um-kros-03 hard reboot). Fix: restart csi-rbdplugin DaemonSet pod on the node + csi-rbdplugin-provisioner rollout.
PG 1.1d stuck peering for ~54 min on acting [10,7,6] (replicapool, blocking rabbitmq RBD image). Fix: ceph pg repeer 1.1d + restart osd.6 (um-kros-05).
PG 3.11 stuck peering for cinder.volumes pool — osd.10 slow ops. Fix: restart osd.10 (um-kros-07). Surfaced the deeper issue (next section).
um-kros-01 OSDs back: node had been cordoned. osd.0 + osd.26 live on /dev/sda4 + /dev/sdb4 (system raid is sda2+sdb2/md0). Uncordoning brought both back into the cluster.
mon.v off-quorum on um-kros-01 — same uncordon brought it back, restoring 3/3 quorum.
3 daemon crashes archived (ceph crash archive-all).

Lasting changes

kubeai-cephfs pools flipped from size=1 to size=2 min_size=1 (kubeai-cephfs-metadata, kubeai-cephfs-data0). Eliminates single-OSD data-loss exposure for the AI workload.
mon_data_avail_warn lowered cluster-wide from 30% to 15%. Mons sit on 45 GB ext4 with <150 MB data — the 30% default was overly conservative.

Disk cleanup on mon hosts

Node	Before	After	Reclaimed
um-kros-04	90% used	60%	~13 GB
um-kros-03	71% used	62%	~4 GB
um-kros-01	76% used	74%	~1 GB

Removed: stale mon-{a..w} directories (only the active mon per host kept), etcd.bak, kubebet.bak-kros-v1, old etcd tarballs, journal vacuum to 7 d, kube-audit rotated logs.

2. Docker zfs storage driver pressure

zfs list was burning a full core on um-kros-01, um-kros-03, um-kros-04 continuously. Root cause: docker's zfs storage driver shells out to zfs list -r -t all -Hp ... on every container event; ~1000–1200 datasets per node made each call O(N) and slow.

Action: docker system prune -af --filter "until=240h" plus builder prune on each node.

Node	Datasets before → after	Reclaimed
um-kros-01	1119 → 361	33.8 GB
um-kros-03	1264 → 472	36.4 GB
um-kros-04	982 → 417	17.5 GB

CPU pressure cleared immediately; load on um-kros-03 dropped from ~5 to <1.

Long-term: migrate dockerd storage driver away from zfs (overlay2 over a zfs dataset is fine) — separate effort, documented but not executed.

3. RabbitMQ — Tier-2 tuning + safety fixes

Existing setup: single STS replica rabbitmq-rabbitmq-0 in openstack ns, RabbitMQ 3.9.0 on Ceph RBD (768Mi PVC), Erlang VM previously pinned to +S 1:1 (single scheduler).

Patch

os/configs/rabbitmq-perf.cm-patch.yaml (committed) + make apply-rabbitmq-tuning:

rabbitmq-env.conf:
  SERVER_ADDITIONAL_ERL_ARGS="+S 4:4 +A 128 +K true +sbwt none +sbwtdcpu none +sbwtdio none"

rabbitmq.conf additions:
  log.default.level                  = warning
  vm_memory_high_watermark.relative  = 0.6
  disk_free_limit.absolute           = 50MB
  collect_statistics_interval        = 30000

Bugs caught during tuning

disk_free_limit.relative = 1.0 (initial attempt) means "require 1× total system RAM free on the volume" — impossible on a 768 Mi PVC. Result: permanent disk alarm → all publishers blocked → every nova-compute lost heartbeat at the same wall-clock instant → hypervisor list showed all hosts down → openstack server list took 3m51s → horizon hit apache proxy timeout → 504s. Fixed by switching to disk_free_limit.absolute = 50MB.
queue_index_max_journal_entries and mnesia.dump_log_write_threshold aren't valid keys in 3.9's sysctl-format rabbitmq.conf (would need advanced.config). Removed.

Verified

erlang:system_info(schedulers_online) → 4 (was 1)
erlang:system_info(thread_pool_size) → 128 (was 64 default)
Auth grant on vhost / for the rabbitmq admin user added by the perf-test script.

4. memcached — relocate + de-SPOF

Single pod was on um-kros-04 (the disk-pressured / mon-host node) and momentarily refused connections during our work, causing keystone token-validation cascade-fail → Horizon 504s.
Killed, rescheduled onto um-kros-03. Verified RTT ~21 ms from a nova-api pod.
Scaled to 2 replicas in the Helm values.

5. ImagePullBackOff — `kubernetes-entrypoint` schema-v1 footgun

quay.io/airshipit/kubernetes-entrypoint:v1.0.0 is Docker manifest schema-v1, which modern dockerd refuses to pull from a remote registry. Every openstack-helm chart uses this image as dep_check init container. A node missing the local cache blocks init containers → nova-scheduler stuck Init:0/3 → MessagingTimeout on every select_destinations → every new VM lands in ERROR (the ddb10561-… case investigated).

Fix: os/scripts/preload-dep-check-image.sh + make preload-dep-check-image — finds a CP node that has the image cached, docker save | docker loads it onto every other CP node tagged openstack-control-plane=enabled. Idempotent. Should be re-run after a fresh node joins or a docker prune.

6. Horizon — WSGI concurrency

The Apache WSGIDaemonProcess line baked into the image was:

WSGIDaemonProcess horizon-http processes=5 threads=1

→ 5 concurrent Django requests per pod total. With 3 pods → 15 cluster-wide. The /project/instances/ view fires per-row polling for every visible server (we have 226), instantly saturating the WSGI pool and queuing the rest. Horizon CPU sat idle at <4% while users saw lag.

Fix: patched the horizon-etc Secret (the apache config lives there as 000-default.conf, not in a ConfigMap):

processes=8 threads=15        # 120 concurrent per pod, 360 cluster-wide

Plus deployment-level patch for the readiness probe:

readinessProbe.timeoutSeconds = 5     # was 1 — cold start of 8 WSGI procs exceeded 1s
readinessProbe.initialDelaySeconds = 30

Mid-fix gotcha: an initial kubectl patch secret used the wrong jsonpath escape and wiped the apache config to empty; recovered by reading it back from a still-running pod and re-patching. Worth doing the change in the chart values long-term so it survives helm upgrade.

7. Scale-up for ~40 concurrent VM peak (replicas)

Cluster was originally sized with most services at 1–2 replicas + workers = 1 inside each WSGI pod. 40 simul incoming API calls would have serialized through 2 nova-api slots and 1 nova-scheduler.

Live + chart values updated (`pod.replicas` block in `os/configs/os-charts-values.tmpl.yaml`)

Service	Before	After
`nova-api-osapi`	2	4
`nova-scheduler`	1	3
`nova-conductor`	2	2
`placement-api`	1	2
`cinder-volume`	1	2
`cinder-scheduler`	1	2
`memcached`	1	2

Note: the shared values file applies any given pod.replicas.<role> to every chart that defines that role. Documented side effects: pod.replicas.scheduler=3 also bumps cinder-scheduler; pod.replicas.api=2 bumps placement / glance / heat / cinder / keystone where consumed (all already ≥2).

Quick-win #2 (not yet applied)

Bump osapi_compute_workers, metadata_workers, neutron workers, cinder workers from 1 → 4 inside oslo configs. Effective concurrency per pod goes from 1 to 4 → another 4× ingress headroom. Useful if traffic doubles from current peak.

8. Load-test harness — reproducible benchmarks

RabbitMQ — `rabbitmq-perf-test` driven from an in-cluster Pod

os/scripts/rabbitmq-loadtest.sh + Makefile targets rabbitmq-loadtest-{baseline,realistic,stress}.

Auto-scrapes admin creds from secret/rabbitmq-admin-user, grants vhost / permissions, spawns a one-shot Pod with pivotalrabbitmq/perf-test:latest, writes results to /tmp/rmq-loadtest-<scenario>-<ts>.log.

Numbers (post-Tier-2, mid Ceph recovery)

Scenario	Producers/Consumers	Persistent?	avg msg/s	confirm p99	e2e p99
baseline	1/1	no	16,037	23 ms	262 ms
realistic (worst-case durable)	4/4	yes	1,011	1.45 s	1.0 s
stress	16/16	no	13,839	1.73 s*	1.69 s*

* stress latency dominated by --confirm 1000 batching, not rabbit. Throughput plateau at ~14–16k msg/s = single-node ceiling.

Verified every openstack vhost (nova/neutron/keystone/cinder/glance/heat) declares only durable=false queues. So the persistent-message penalty doesn't apply to real traffic. Real openstack RPC load is in the low hundreds of msg/s — 30–300× headroom.

VM creation — Terraform/OpenTofu

os/terraform/vm-loadtest/ (main.tf + variables.tf + versions.tf) + helper scripts:

os/scripts/vm-loadtest-quota.sh — bumps nova + neutron admin quotas, neutron via direct API PUT (CLI's check_limit parameter rejected by wallaby neutron).
os/scripts/vm-loadtest-report.sh — polls servers, reports ACTIVE/ERROR + fault messages.

Makefile targets: vm-loadtest-quota, vm-loadtest-init, vm-loadtest-apply, vm-loadtest-report, vm-loadtest-destroy, vm-loadtest (chains the first three).

make vm-loadtest N=40

Result

40 / 40 ACTIVE, 0 ERROR
Per-VM end-to-end 2 m 10 s – 2 m 17 s (7 s spread)
No MessagingTimeout in nova/neutron/keystone/cinder/placement/glance logs during the window
Bottleneck now lives inside the VM (cloud-init), not in the control plane

9. Repo deltas (durable)

/home/jjo/src/juanjo/um-cloud/kros-v2/os/:

Path	Purpose
`configs/rabbitmq-perf.cm-patch.yaml`	Tier-2 rabbit ConfigMap patch
`configs/os-charts-values.tmpl.yaml`	`pod.replicas` block added
`scripts/preload-dep-check-image.sh`	Side-load schema-v1 image to CP nodes
`scripts/rabbitmq-loadtest.sh`	In-cluster perf-test harness
`scripts/vm-loadtest-quota.sh`	Admin quota bump (nova + neutron direct API)
`scripts/vm-loadtest-report.sh`	Polls + reports VM outcomes
`terraform/vm-loadtest/main.tf`	Concurrent VM creation manifest
`terraform/vm-loadtest/variables.tf`	`n`, `name_prefix`, image/flavor/net names
`terraform/vm-loadtest/versions.tf`	provider pin + `cloud = openstack_helm`
`Makefile`	`apply-rabbitmq-tuning`, `preload-dep-check-image`, `rabbitmq-loadtest-{baseline,realistic,stress}`, `vm-loadtest{,-quota,-init,-apply,-report,-destroy}`

Plus a planning doc ~/Downloads/um-rabbitmq-local.md for the eventual rabbit→local-storage migration (deemed unnecessary for current throughput, kept for SPOF mitigation).

10. Open items

um-kros-01 nova-compute in CrashLoopBackOff (EHOSTUNREACH to keystone-api ClusterIP). Pre-existing networking/iptables gap — separate from anything load-related. Either fix kube-proxy/CNI on that node or take nova-compute off um-kros-01 entirely if it was never intended as a hypervisor.
dockerd zfs → overlay2 migration plan (drained nodes, daemon.json switch, image re-pull). Avoid future zfs list CPU storms.
kubernetes-entrypoint image: prune-resistant only as long as preload-dep-check-image.sh runs on every new CP node. Long-term, upgrade the chart dep_check reference to an OCI-v2 image.
workers = 1 in oslo configs everywhere — Quick Win #2; bump to 4 if peak grows.
Rabbit → local storage migration plan archived at ~/Downloads/um-rabbitmq-local.md. Throughput gain currently marginal but mitigates the Ceph-coupling SPOF.
Horizon WSGI changes live in Secret/horizon-etc and a kubectl patch — both revert on helm upgrade. Mirror to chart values for durability.
Ceph crash log clean; backfill complete; cluster HEALTH_WARN only because of um-kros-07 OSDs intentionally out.

11. Headline metrics, before → after

Metric	Before	After
RabbitMQ msg/s (transient, 1P/1C)	~3-5k (estimated, +S 1:1)	16 k
RabbitMQ schedulers online	1	4
RabbitMQ async IO threads	64 (default)	128
Horizon WSGI concurrency (cluster)	15	360
`openstack server list` (admin)	3 m 51 s	~5–8 s
40 concurrent VM creates	not measured; expected MessagingTimeouts	40 / 40 ACTIVE, 0 error, 2:10–2:17
Ceph pools at size=1 (kubeai-cephfs)	2	0
Mons low-on-disk warnings	3 (d, n, v)	0 (threshold + cleanup)
OSDs down due to slow-ops blocks	3 (6, 10, 16)	0 (osd.1+10 deliberately out)
CP nodes missing `dep_check` image	2/3	0 (preload script idempotent)
memcached SPOF	yes (1 pod on troubled node)	2 replicas, on different node

12. Follow-up optimizations (manual kubectl patches)

Horizon (additional tuning)

WSGI raised from processes=5 threads=1 → processes=12 threads=2 (via direct edit of Secret/horizon-etc).
Session backend temporarily changed to django.contrib.sessions.backends.cache → reverted to cached_db after causing repeated re-authentication.

Resource requests added to Deployment (generous, no limits):

resources:
  requests:
    cpu: "100m"
    memory: "512Mi"
  limits:
    cpu: "500m"
    memory: "1Gi"

MariaDB (`mariadb-server` StatefulSet)

First component to receive resource requests in this follow-up pass.

Generous requests, no limits (as requested):

resources:
  requests:
    cpu: "500m"
    memory: "1Gi"

Pod is now Burstable QoS. Current usage remains low (~24m / ~700Mi), leaving significant headroom.

These changes were applied via targeted kubectl patch (no Helm upgrades) to avoid destabilizing the control plane during the load-test window.

jjo/um-optimized-May2026.md

Select an option

No results found

Select an option

No results found

UM Cloud — Optimization Pass (late May 2026)

1. Ceph — recovery + resilience

Acute fires

Lasting changes

Disk cleanup on mon hosts

2. Docker zfs storage driver pressure

3. RabbitMQ — Tier-2 tuning + safety fixes

Patch

Bugs caught during tuning

Verified

4. memcached — relocate + de-SPOF

5. ImagePullBackOff — `kubernetes-entrypoint` schema-v1 footgun

6. Horizon — WSGI concurrency

7. Scale-up for ~40 concurrent VM peak (replicas)

Live + chart values updated (`pod.replicas` block in `os/configs/os-charts-values.tmpl.yaml`)

Quick-win #2 (not yet applied)

8. Load-test harness — reproducible benchmarks

RabbitMQ — `rabbitmq-perf-test` driven from an in-cluster Pod

Numbers (post-Tier-2, mid Ceph recovery)

VM creation — Terraform/OpenTofu

Result

9. Repo deltas (durable)

10. Open items

11. Headline metrics, before → after

12. Follow-up optimizations (manual kubectl patches)

Horizon (additional tuning)

MariaDB (`mariadb-server` StatefulSet)

jjo/um-optimized-May2026.md

UM Cloud — Optimization Pass (late May 2026)

1. Ceph — recovery + resilience

Acute fires

Lasting changes

Disk cleanup on mon hosts

2. Docker zfs storage driver pressure

3. RabbitMQ — Tier-2 tuning + safety fixes

Patch

Bugs caught during tuning

Verified

4. memcached — relocate + de-SPOF

5. ImagePullBackOff — kubernetes-entrypoint schema-v1 footgun

6. Horizon — WSGI concurrency

7. Scale-up for ~40 concurrent VM peak (replicas)

Live + chart values updated (pod.replicas block in os/configs/os-charts-values.tmpl.yaml)

Quick-win #2 (not yet applied)

8. Load-test harness — reproducible benchmarks

RabbitMQ — rabbitmq-perf-test driven from an in-cluster Pod

Numbers (post-Tier-2, mid Ceph recovery)

VM creation — Terraform/OpenTofu

Result

9. Repo deltas (durable)

10. Open items

11. Headline metrics, before → after

12. Follow-up optimizations (manual kubectl patches)

Horizon (additional tuning)

MariaDB (mariadb-server StatefulSet)

5. ImagePullBackOff — `kubernetes-entrypoint` schema-v1 footgun

Live + chart values updated (`pod.replicas` block in `os/configs/os-charts-values.tmpl.yaml`)

RabbitMQ — `rabbitmq-perf-test` driven from an in-cluster Pod

MariaDB (`mariadb-server` StatefulSet)