This document explains every file and directory in this repository and maps each one to the corresponding deliverable from the senior MLOps/Infrastructure Engineer challenge prompt.
The prompt asked for the design and implementation of a Zero-Downtime Deployment and Replication Health System for a healthcare product (eFiche) deployed across Rwanda — a centralized Kigali primary and Raspberry Pi edge nodes at rural clinics. Every deliverable in the prompt has a corresponding artifact in this repo.
The project index. It maps each prompt deliverable to the file that implements it and provides runnable instructions for testing locally (spin up Redis, stub the Postgres replica, call /replication-health, run unit tests). It also documents what is real vs. stubbed — critical in a system where a live Postgres replica is not always available during development.
This directory is the direct replacement and improvement of the ops_agent/monitor.py snippet given in the prompt.
The main FastAPI service. It addresses two deliverables from the prompt.
Deliverable 2 — Replication monitoring endpoint that detects lag trend, not just current lag.
The original monitor.py only returned {"lag_seconds": ...} — a snapshot with no context. The new /replication-health endpoint does significantly more:
-
Queries
pg_last_xact_replay_timestamp()on the Postgres replica (same SQL as the original, but via theasyncpgasync driver). -
Pushes every reading into Redis using
LPUSH+LTRIMto maintain a rolling window of the last 5 readings (REDIS_HISTORY_KEY = "replication:lag:history"). -
classify_trend()compares the oldest vs. newest reading in that window:- Lag grew by ≥ 2 seconds →
"growing" - Lag shrank by ≥ 2 seconds →
"recovering" - Otherwise →
"stable"
This directly solves the prompt's complaint that the original endpoint only told you the current value, not whether the situation is getting better or worse.
- Lag grew by ≥ 2 seconds →
-
is_degraded()flags a booleanTrueif the newest reading is more than 10 seconds higher than the reading three slots back — a faster signal for a replica that is falling behind dangerously. -
The response returns all five fields:
lag_seconds,trend,degraded,last_checked,history.
Why Redis? The edge nodes are Raspberry Pis with 2 GB RAM and intermittent connectivity. Redis is lightweight, in-memory, and acts as a local circular buffer to hold lag history across requests without needing a persistent DB write.
The DATABASE_URL and REDIS_URL are read from environment variables, connecting this directly to the config validation deliverable — those env vars must exist at deploy time.
A local development helper. Because a live Postgres replica streaming WAL is not always available in a local dev environment, this file uses FastAPI's dependency_overrides to swap out the real get_db dependency with a FakeSession that returns a hardcoded timedelta(seconds=4.2). You point uvicorn at dev_stub_runner:app instead of app:app. This does not touch any production code path.
Dependencies for the service:
| Package | Purpose |
|---|---|
fastapi, uvicorn |
Web server |
sqlalchemy, asyncpg |
Async Postgres driver for querying the replica |
redis |
Async Redis client for the lag history window |
pandas, scikit-learn |
Present because the service also carries a /predict stub (ML inference placeholder) |
pytest |
Unit test gate in CI |
Unit tests for the classify_trend() function. This maps directly to Deliverable 3 — an improved CI pipeline that catches classes of bugs the current pipeline misses. The prompt's original CI had no Python tests at all. These two tests verify:
- A lag series increasing from 1.1 s to 4.2 s over five readings is classified as
"growing". - A lag series decreasing from 15.0 s to 4.1 s is classified as
"recovering".
These are regression guards: if someone refactors classify_trend() and breaks the comparison logic (e.g., flips a sign or changes the threshold), the CI pipeline fails before any broken code reaches production and misclassifies an escalating replication problem as stable.
Deliverable 1 — safe migration strategy for live production data under constant write load. The filename encodes the date (April 26, 2026) in standard ordered-migration format.
The prompt included a dangerous one-liner:
ALTER TABLE visit_invoices ADD COLUMN billing_status VARCHAR(20) NOT NULL DEFAULT 'pending';That single statement on a 3.1 million row table with ~3,000 visits/day:
- Takes a heavy table lock → stalls writes → clinical incident.
- Generates a WAL burst touching every row → replicas fall behind.
- Has no recovery checkpoint if it fails partway through.
The file replaces this with a three-step strategy:
-
Step 1 — Non-blocking schema prep
ADD COLUMN IF NOT EXISTS billing_status VARCHAR(20)(nullable, no default). This is a metadata-only operation in modern PostgreSQL. No table rewrite, minimal lock window. The application keeps writing immediately. -
Step 2 — Batched backfill
ADO $$ ... LOOPthat updates rows in batches of 10,000 at a time viactid(physical row pointer, no secondary index needed), sleeping 50 ms between each batch. This spreads IO and WAL across time. Replicas stay in sync. The backfill is re-entrant — it only touchesWHERE billing_status IS NULL, so it can be paused and resumed safely. -
Step 3 — Defensive constraint enforcement
AddsCHECK ... NOT VALIDfirst, then runsVALIDATE CONSTRAINT(which fails closed if any nulls remain from a partial backfill), then sets the column default andNOT NULL. The validation step is the safety gate the prompt described as missing.
Deliverable 3 — an improved CI pipeline. The prompt showed a GitLab CI pipeline with four problems:
- Lint only covered PHP; no Python quality gate.
- No migration safety check — a broken SQL script would reach production.
- No config validation — a missing env var causes the app to crash on first request, not at deploy time.
- No unit tests — the trend classification logic could be silently broken.
The GitHub Actions workflow builds and deploys the Docker image to Kubernetes (using GCR and kubectl apply). The docs/submission_analysis.md section c details exactly which four new jobs were designed to be added on top of this base pipeline:
| New CI Job | What it runs | Incidents prevented |
|---|---|---|
migration_safety_check |
Boots throwaway Postgres, applies SQL file, verifies no NULLs remain | Runtime migration failures, constraints applied too early, broken backfill scripts |
config_validation |
Shell assertions that DATABASE_URL, REDIS_URL, and split tokens exist |
App crash-loops on missing env vars, dashboard starting in insecure fallback mode |
python_lint |
ruff check app |
Broken imports, dead code paths, accidental debug leftovers |
unit_tests |
pytest -q app/tests |
Trend misclassification after refactor, alert logic regressions |
Declares a Kubernetes Deployment for the FastAPI service. It runs 3 replicas, meaning rolling updates keep at least some pods alive at all times — this is how zero-downtime is achieved on the Kigali central server side. Kubernetes' default rolling update strategy brings up new pods, waits for readiness, then terminates old ones. Container port is 8080, matching what uvicorn listens on.
A LoadBalancer Service that exposes port 80 externally and forwards traffic to port 8080 on the pods. This is the entry point for the ops dashboard or any internal client calling /replication-health. The selector: app: ml-model must match the labels in the Deployment — it is the coupling point between the two manifests.
Important context from the prompt: The Raspberry Pi edge nodes explicitly run Docker Compose, not Kubernetes (see
design_document.md). The k8s manifests are for the Kigali central server only. That distinction is architecturally significant — 2 GB RAM and intermittent connectivity make Kubernetes impractical at the edge.
Provisions a Google Kubernetes Engine (GKE) cluster with 2 nodes on e2-medium machines. This is the infrastructure where the k8s/ manifests get applied. The cluster (mlops-cluster) in us-central1 represents the Kigali central infrastructure in a cloud analogue.
Declares three input variables: project_id, region (defaulting to us-central1), and cluster_name (defaulting to mlops-prod-cluster). These allow the same Terraform config to be applied to different environments without code changes.
Exports the cluster name and region after terraform apply so downstream tools (e.g., a CI step running gcloud container clusters get-credentials) can reference them without hardcoding values.
These are the written deliverables the prompt explicitly required. The code alone is not sufficient — the prompt asked for analysis and architectural reasoning alongside implementation.
Deliverable 1 — written portions a–d.
| Section | Content |
|---|---|
| a | Why the naive one-line ALTER TABLE is dangerous (lock risk, WAL burst, no recovery checkpoint) and how the three-step migration in sql/ addresses each problem |
| b | Root cause and recovery procedure for the ghost column WAL replay error (column "billing_status" of relation "visit_invoices" does not exist) — including exactly where data loss can occur if slots are dropped prematurely |
| c | CI gap analysis: four missing check categories and one new CI job designed for each |
| d | 190-word internal memo on replacing the single OPS_API_KEY with a two-token model (directly addresses Deliverable 4) |
Deliverable 3 (a–c) — deployment and infrastructure design.
| Section | Content |
|---|---|
| a | RPi edge node deployment strategy: pull-based model (nodes poll for release manifest every 5 min with jitter, no push dependency), staged container swap with smoke checks, pre-flight gate that blocks deployment if replica lag exceeds 1800 s or local schema is behind |
| b | End-to-end migration safety process in a replication context: phase 0 prechecks, phase 1 schema-add on primary, phase 2 batched backfill with monitoring, phase 3 constraint enforcement; migration_state dashboard field; two feature flags (write_billing_status, use_billing_status_for_logic) |
| c | What not to automate in V1: auto-triggering destructive schema sync, auto-reinitializing a lagging replica, auto-rolling-back on a single failed healthcheck, and auto-restarting the DB container are all dangerous in a healthcare/clinic context despite being technically feasible |
Deliverable 4 — security posture for the ops dashboard.
The prompt showed a single OPS_API_KEY authorizing both read-only metrics and destructive actions. This memo proposes:
OPS_METRICS_TOKENfor read-only endpoints.OPS_ADMIN_TOKENfor mutating/destructive endpoints.- Mandatory
X-Action-Confirm: trueheader for destructive actions. - Immutable audit logs (timestamp, action, token class, source IP, request ID).
- Quarterly rotation and immediate rotation on team membership change.
The justification explicitly argues against jumping straight to full OIDC/SSO/RBAC — at a 4-engineer team size that adds disproportionate implementation burden without equivalent risk reduction. Full per-user authN/authZ is called out as the likely next step, not the current one.
A serialized scikit-learn model binary. The /predict endpoint in app.py is a placeholder stub, so this file represents an ML model that could be served. In the context of the assignment the primary deliverables are the ops/replication side; the inference path is background infrastructure.
| Prompt Deliverable | Primary Artifacts |
|---|---|
| Safe migration strategy for live writes | sql/20260426_add_billing_status_safe.sql, docs/submission_analysis.md §a |
| Replication monitoring with lag trend | app/app.py (/replication-health, classify_trend, is_degraded), app/tests/test_replication_health.py |
| Improved CI pipeline | .github/workflows/deploy.yml, docs/submission_analysis.md §c |
| Ops dashboard security (4-engineer team) | docs/internal_security_memo.md, docs/submission_analysis.md §d |
| RPi edge deployment (2 GB RAM, intermittent) | docs/design_document.md §a–c |
| Infrastructure provisioning | terraform/, k8s/ |
| Ghost column WAL replay recovery | docs/submission_analysis.md §b |