Two-day stabilization + scale-up campaign covering Ceph health, the OpenStack control plane (rook-ceph, RabbitMQ, MariaDB, memcached, Horizon, nova/neutron/cinder), and a reproducible load-test harness. End state: cluster healthy, 40 concurrent VM creates verified at 2m10s–2m17s with 0 errors, no MessagingTimeouts in the control plane.
- rabbitmq-rabbitmq-0 stuck mount (CSI op-lock leftover from um-kros-03 hard reboot). Fix: restart
csi-rbdpluginDaemonSet pod on the node +csi-rbdplugin-provisionerrollout. - PG
1.1dstuck peering for ~54 min on acting[10,7,6](replicapool, blocking rabbitmq RBD image). Fix:ceph pg repeer 1.1d+ restartosd.6(um-kros-05). - PG
3.11stuck peering for cinder.volumes pool —osd.10slow ops. Fix: restartosd.10(um-kros-07). Surfaced the deeper issue (next section). - um-kros-01 OSDs back: node had been cordoned.
osd.0+osd.26live on/dev/sda4+/dev/sdb4(system raid issda2+sdb2/md0). Uncordoning brought both back into the cluster. - mon.v off-quorum on um-kros-01 — same uncordon brought it back, restoring 3/3 quorum.
- 3 daemon crashes archived (
ceph crash archive-all).
- kubeai-cephfs pools flipped from
size=1tosize=2 min_size=1(kubeai-cephfs-metadata,kubeai-cephfs-data0). Eliminates single-OSD data-loss exposure for the AI workload. mon_data_avail_warnlowered cluster-wide from 30% to 15%. Mons sit on 45 GB ext4 with <150 MB data — the 30% default was overly conservative.
| Node | Before | After | Reclaimed |
|---|---|---|---|
| um-kros-04 | 90% used | 60% | ~13 GB |
| um-kros-03 | 71% used | 62% | ~4 GB |
| um-kros-01 | 76% used | 74% | ~1 GB |
Removed: stale mon-{a..w} directories (only the active mon per host kept), etcd.bak, kubebet.bak-kros-v1, old etcd tarballs, journal vacuum to 7 d, kube-audit rotated logs.
zfs list was burning a full core on um-kros-01, um-kros-03, um-kros-04 continuously. Root cause: docker's zfs storage driver shells out to zfs list -r -t all -Hp ... on every container event; ~1000–1200 datasets per node made each call O(N) and slow.
Action: docker system prune -af --filter "until=240h" plus builder prune on each node.
| Node | Datasets before → after | Reclaimed |
|---|---|---|
| um-kros-01 | 1119 → 361 | 33.8 GB |
| um-kros-03 | 1264 → 472 | 36.4 GB |
| um-kros-04 | 982 → 417 | 17.5 GB |
CPU pressure cleared immediately; load on um-kros-03 dropped from ~5 to <1.
Long-term: migrate dockerd storage driver away from zfs (overlay2 over a zfs dataset is fine) — separate effort, documented but not executed.
Existing setup: single STS replica rabbitmq-rabbitmq-0 in openstack ns, RabbitMQ 3.9.0 on Ceph RBD (768Mi PVC), Erlang VM previously pinned to +S 1:1 (single scheduler).
os/configs/rabbitmq-perf.cm-patch.yaml (committed) + make apply-rabbitmq-tuning:
rabbitmq-env.conf:
SERVER_ADDITIONAL_ERL_ARGS="+S 4:4 +A 128 +K true +sbwt none +sbwtdcpu none +sbwtdio none"
rabbitmq.conf additions:
log.default.level = warning
vm_memory_high_watermark.relative = 0.6
disk_free_limit.absolute = 50MB
collect_statistics_interval = 30000
disk_free_limit.relative = 1.0(initial attempt) means "require 1× total system RAM free on the volume" — impossible on a 768 Mi PVC. Result: permanent disk alarm → all publishers blocked → everynova-computelost heartbeat at the same wall-clock instant →hypervisor listshowed all hostsdown→openstack server listtook 3m51s → horizon hit apache proxy timeout → 504s. Fixed by switching todisk_free_limit.absolute = 50MB.queue_index_max_journal_entriesandmnesia.dump_log_write_thresholdaren't valid keys in 3.9's sysctl-formatrabbitmq.conf(would needadvanced.config). Removed.
erlang:system_info(schedulers_online)→ 4 (was 1)erlang:system_info(thread_pool_size)→ 128 (was 64 default)- Auth grant on vhost
/for therabbitmqadmin user added by the perf-test script.
- Single pod was on
um-kros-04(the disk-pressured / mon-host node) and momentarily refused connections during our work, causing keystone token-validation cascade-fail → Horizon 504s. - Killed, rescheduled onto
um-kros-03. Verified RTT ~21 ms from a nova-api pod. - Scaled to 2 replicas in the Helm values.
quay.io/airshipit/kubernetes-entrypoint:v1.0.0 is Docker manifest schema-v1, which modern dockerd refuses to pull from a remote registry. Every openstack-helm chart uses this image as dep_check init container. A node missing the local cache blocks init containers → nova-scheduler stuck Init:0/3 → MessagingTimeout on every select_destinations → every new VM lands in ERROR (the ddb10561-… case investigated).
Fix: os/scripts/preload-dep-check-image.sh + make preload-dep-check-image — finds a CP node that has the image cached, docker save | docker loads it onto every other CP node tagged openstack-control-plane=enabled. Idempotent. Should be re-run after a fresh node joins or a docker prune.
The Apache WSGIDaemonProcess line baked into the image was:
WSGIDaemonProcess horizon-http processes=5 threads=1
→ 5 concurrent Django requests per pod total. With 3 pods → 15 cluster-wide. The /project/instances/ view fires per-row polling for every visible server (we have 226), instantly saturating the WSGI pool and queuing the rest. Horizon CPU sat idle at <4% while users saw lag.
Fix: patched the horizon-etc Secret (the apache config lives there as 000-default.conf, not in a ConfigMap):
processes=8 threads=15 # 120 concurrent per pod, 360 cluster-wide
Plus deployment-level patch for the readiness probe:
readinessProbe.timeoutSeconds = 5 # was 1 — cold start of 8 WSGI procs exceeded 1s
readinessProbe.initialDelaySeconds = 30
Mid-fix gotcha: an initial kubectl patch secret used the wrong jsonpath escape and wiped the apache config to empty; recovered by reading it back from a still-running pod and re-patching. Worth doing the change in the chart values long-term so it survives helm upgrade.
Cluster was originally sized with most services at 1–2 replicas + workers = 1 inside each WSGI pod. 40 simul incoming API calls would have serialized through 2 nova-api slots and 1 nova-scheduler.
| Service | Before | After |
|---|---|---|
nova-api-osapi |
2 | 4 |
nova-scheduler |
1 | 3 |
nova-conductor |
2 | 2 |
placement-api |
1 | 2 |
cinder-volume |
1 | 2 |
cinder-scheduler |
1 | 2 |
memcached |
1 | 2 |
Note: the shared values file applies any given pod.replicas.<role> to every chart that defines that role. Documented side effects: pod.replicas.scheduler=3 also bumps cinder-scheduler; pod.replicas.api=2 bumps placement / glance / heat / cinder / keystone where consumed (all already ≥2).
Bump osapi_compute_workers, metadata_workers, neutron workers, cinder workers from 1 → 4 inside oslo configs. Effective concurrency per pod goes from 1 to 4 → another 4× ingress headroom. Useful if traffic doubles from current peak.
os/scripts/rabbitmq-loadtest.sh + Makefile targets rabbitmq-loadtest-{baseline,realistic,stress}.
Auto-scrapes admin creds from secret/rabbitmq-admin-user, grants vhost / permissions, spawns a one-shot Pod with pivotalrabbitmq/perf-test:latest, writes results to /tmp/rmq-loadtest-<scenario>-<ts>.log.
| Scenario | Producers/Consumers | Persistent? | avg msg/s | confirm p99 | e2e p99 |
|---|---|---|---|---|---|
| baseline | 1/1 | no | 16,037 | 23 ms | 262 ms |
| realistic (worst-case durable) | 4/4 | yes | 1,011 | 1.45 s | 1.0 s |
| stress | 16/16 | no | 13,839 | 1.73 s* | 1.69 s* |
* stress latency dominated by --confirm 1000 batching, not rabbit. Throughput plateau at ~14–16k msg/s = single-node ceiling.
Verified every openstack vhost (nova/neutron/keystone/cinder/glance/heat) declares only durable=false queues. So the persistent-message penalty doesn't apply to real traffic. Real openstack RPC load is in the low hundreds of msg/s — 30–300× headroom.
os/terraform/vm-loadtest/ (main.tf + variables.tf + versions.tf) + helper scripts:
os/scripts/vm-loadtest-quota.sh— bumps nova + neutron admin quotas, neutron via direct API PUT (CLI'scheck_limitparameter rejected by wallaby neutron).os/scripts/vm-loadtest-report.sh— polls servers, reports ACTIVE/ERROR + fault messages.
Makefile targets: vm-loadtest-quota, vm-loadtest-init, vm-loadtest-apply, vm-loadtest-report, vm-loadtest-destroy, vm-loadtest (chains the first three).
make vm-loadtest N=40
- 40 / 40 ACTIVE, 0 ERROR
- Per-VM end-to-end 2 m 10 s – 2 m 17 s (7 s spread)
- No
MessagingTimeoutin nova/neutron/keystone/cinder/placement/glance logs during the window - Bottleneck now lives inside the VM (cloud-init), not in the control plane
/home/jjo/src/juanjo/um-cloud/kros-v2/os/:
| Path | Purpose |
|---|---|
configs/rabbitmq-perf.cm-patch.yaml |
Tier-2 rabbit ConfigMap patch |
configs/os-charts-values.tmpl.yaml |
pod.replicas block added |
scripts/preload-dep-check-image.sh |
Side-load schema-v1 image to CP nodes |
scripts/rabbitmq-loadtest.sh |
In-cluster perf-test harness |
scripts/vm-loadtest-quota.sh |
Admin quota bump (nova + neutron direct API) |
scripts/vm-loadtest-report.sh |
Polls + reports VM outcomes |
terraform/vm-loadtest/main.tf |
Concurrent VM creation manifest |
terraform/vm-loadtest/variables.tf |
n, name_prefix, image/flavor/net names |
terraform/vm-loadtest/versions.tf |
provider pin + cloud = openstack_helm |
Makefile |
apply-rabbitmq-tuning, preload-dep-check-image, rabbitmq-loadtest-{baseline,realistic,stress}, vm-loadtest{,-quota,-init,-apply,-report,-destroy} |
Plus a planning doc ~/Downloads/um-rabbitmq-local.md for the eventual rabbit→local-storage migration (deemed unnecessary for current throughput, kept for SPOF mitigation).
- um-kros-01 nova-compute in CrashLoopBackOff (EHOSTUNREACH to
keystone-apiClusterIP). Pre-existing networking/iptables gap — separate from anything load-related. Either fix kube-proxy/CNI on that node or take nova-compute off um-kros-01 entirely if it was never intended as a hypervisor. - dockerd zfs → overlay2 migration plan (drained nodes, daemon.json switch, image re-pull). Avoid future
zfs listCPU storms. - kubernetes-entrypoint image: prune-resistant only as long as
preload-dep-check-image.shruns on every new CP node. Long-term, upgrade the chartdep_checkreference to an OCI-v2 image. workers = 1in oslo configs everywhere — Quick Win #2; bump to 4 if peak grows.- Rabbit → local storage migration plan archived at
~/Downloads/um-rabbitmq-local.md. Throughput gain currently marginal but mitigates the Ceph-coupling SPOF. - Horizon WSGI changes live in
Secret/horizon-etcand a kubectl patch — both revert onhelm upgrade. Mirror to chart values for durability. - Ceph crash log clean; backfill complete; cluster
HEALTH_WARNonly because of um-kros-07 OSDs intentionally out.
| Metric | Before | After |
|---|---|---|
| RabbitMQ msg/s (transient, 1P/1C) | ~3-5k (estimated, +S 1:1) | 16 k |
| RabbitMQ schedulers online | 1 | 4 |
| RabbitMQ async IO threads | 64 (default) | 128 |
| Horizon WSGI concurrency (cluster) | 15 | 360 |
openstack server list (admin) |
3 m 51 s | ~5–8 s |
| 40 concurrent VM creates | not measured; expected MessagingTimeouts | 40 / 40 ACTIVE, 0 error, 2:10–2:17 |
| Ceph pools at size=1 (kubeai-cephfs) | 2 | 0 |
| Mons low-on-disk warnings | 3 (d, n, v) | 0 (threshold + cleanup) |
| OSDs down due to slow-ops blocks | 3 (6, 10, 16) | 0 (osd.1+10 deliberately out) |
CP nodes missing dep_check image |
2/3 | 0 (preload script idempotent) |
| memcached SPOF | yes (1 pod on troubled node) | 2 replicas, on different node |
- WSGI raised from
processes=5 threads=1→processes=12 threads=2(via direct edit ofSecret/horizon-etc). - Session backend temporarily changed to
django.contrib.sessions.backends.cache→ reverted tocached_dbafter causing repeated re-authentication. - Resource requests added to Deployment (generous, no limits):
resources: requests: cpu: "100m" memory: "512Mi" limits: cpu: "500m" memory: "1Gi"
- First component to receive resource requests in this follow-up pass.
- Generous requests, no limits (as requested):
resources: requests: cpu: "500m" memory: "1Gi"
- Pod is now
BurstableQoS. Current usage remains low (~24m / ~700Mi), leaving significant headroom.
These changes were applied via targeted kubectl patch (no Helm upgrades) to avoid destabilizing the control plane during the load-test window.