Skip to content

Instantly share code, notes, and snippets.

@gsingal
Last active April 25, 2026 07:42
Show Gist options
  • Select an option

  • Save gsingal/46ade46d244d68728cc5e994712b3851 to your computer and use it in GitHub Desktop.

Select an option

Save gsingal/46ade46d244d68728cc5e994712b3851 to your computer and use it in GitHub Desktop.
QA Smoke Deterministic Reset Foundation — Strategic Plan for team review

QA Smoke Deterministic Reset Foundation

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Problem & Why Now ← Dimension 1

Every smoke run inherits state from prior runs because QA server state is never fully reset between runs. The team has been patching symptoms for months — ownership guards for stale listing.json (PR #4283), silent-skip fallbacks when the demo user is fraud-flagged (PR #4279), test.skip() cascades when ephemeral lister state is wrong, content tests that accept "any of three states" as passing (#3655 Playwright specs). Each workaround makes the suite weaker: it proves fewer things, accepts more failure modes as "green," and hides real regressions under fake passes.

The 2026-04-21 smoke audit made the cost concrete: 4 of Megan's 7 dispositioned findings were not real app bugs but cross-run state drift. The 3 remaining required app-code fixes in PR #4279 — but those fixes only unblock the workarounds; they don't remove the class of bug. The next run-state drift (subscription left active, room left under_review, user left blocked, verification code row stuck in failed_attempts=5) will cause the same silent-skip cascade.

"Reset the DB before each run" is the obvious fix. We already have TestingController::reset() + TestDatabaseSeeder that does it locally. The current gate has two layers: (a) bootstrap/app.php:26-30 only loads routes/testing.php when APP_ENV=testing, and (b) TestingController::__construct only denies production|staging — so any non-prod/non-staging env (local, dev, custom) is fail-open once the routes are loaded. QA today has APP_ENV=production, so the routes aren't loaded and reset is unreachable. This plan meaningfully tightens the gate (route-layer + controller-layer + hostname + token) rather than just extending it.

Simpler interventions considered and rejected in Alternatives — this needs a proper foundation.

Prior Art & Research ← External best practices

  • Anthropic claude-code own testing playbook (internal). Test databases are recreated per test job, not per-test. Persona fixtures live in seed data, not inline factories. Ephemeral state is banned from tests — every "did this feature work" check begins from a named known state.
  • Playwright best practices 2026 (playwright.dev/docs/best-practices). Recommends per-run database reset via test-only API endpoint OR per-test database transaction rollback. Rolls up both to one rule: "no test may depend on another test's side effects."
  • Cypress + Testcontainers pattern (cypress.io/blog/database-seeding-cypress). Seeding into a dedicated "smoke" database snapshot that's restored before each run. Faster than truncate-and-reseed for large schemas.
  • CLAUDE.md E2E rule #1 (this repo): "Tests must replicate exact user actions" — this is only possible when preconditions are controlled. Silent-skip + state-drift fallbacks directly violate the spirit of that rule.
  • Prior in-repo work: TestingController::reset() (already exists, local only), ownership guard in qa-helpers.ts::readOwnedListingId() (workaround for the same class of bug), PR #4279 admin email_verified_at / is_fraudulent (unblocks manual recovery — but only needed because reset is missing).

Alternatives Considered ← Dimension 2

1. Do nothing — keep patching symptoms. Continue adding ownership guards, silent-skip fallbacks, and workarounds as they arise. Cost: every new smoke finding requires triage to separate "real bug" from "state drift," which has repeatedly cost 30-60 min of debug time. Every new feature that touches state (subscriptions, fraud, verification) adds to the drift surface. Verdict: Rejected — compounding technical debt.

2. Truncate+reseed via TestingController::reset() called from the smoke runner before each run. Enable /testing/* routes on QA server (relax the APP_ENV=testing gate to also permit a specific QA_SMOKE_TESTING=1 env flag). Smoke runner curls /testing/reset before each run. Cost: truncating ~80 tables + reseeding baseline + re-creating FK indexes takes ~20s per reset; small. Risk: running reset against staging/prod is a catastrophic data-loss event, so the gate needs multiple independent checks. Verdict: Chosen — see rationale below.

3. Per-test transaction wrapping. Each Playwright test runs inside a DB transaction that rolls back at teardown. Cost: requires app-level transaction awareness (Playwright can't drive DB transactions directly on an HTTP server). Would need test-controller endpoint to begin/rollback. Also doesn't work for tests that cross HTTP boundaries where the server commits mid-test. Verdict: Rejected — incompatible with HTTP boundary tests.

4. Dedicated QA database snapshot + restore. Dump a known-good DB state to a file; restore via pg_restore before each run. Faster than reseed (~5s for 80 tables). Cost: snapshot must be regenerated whenever schema changes; adds a CI step. Verdict: Rejected for now — reseeding is fast enough, complexity budget is better spent elsewhere. May revisit if reset time exceeds 60s.

5. Ephemeral-only (no persistent seeded users at all). Every spec creates its own user via registration UI at the top of the spec, tears down at the end. Cost: registration itself is a spec under test (spec 12), so circular dependency. Also 50+ specs × 10s registration = 8+ minutes added per run. Verdict: Rejected — too slow + circular.

6. Playwright project dependencies (projects: [{ name: 'reset-setup', testMatch: /reset/ }, { name: 'qa-smoke', dependencies: ['reset-setup'] }]). Native Playwright mechanism for "run this before the suite." Would call /testing/reset as a proper setup project, auto-isolate reset failures from spec failures, and give the dashboard native per-run visibility into reset success. Cost: adds a new Playwright project, requires refactoring playwright.config.qa.ts. Benefit over shell-script approach (Task 4): reset failure is a visible Playwright test failure in the report rather than a pre-run shell abort, and the desktop/mobile projects both inherit the dependency without runner-level duplication. Verdict: Adopted for Task 4 — the plan's "call reset from shell before npx playwright test" approach is replaced with a Playwright dependency project. See Task 4 for the revised spec.

7. Nightly snapshot + per-run restore. Distinct from Alternative 4 (schema-change-triggered regeneration): a cron runs pg_dump --data-only --format=custom nightly against a freshly-truncated-and-seeded DB; each smoke run does pg_restore --data-only --clean (~3-5s). Cost: one nightly cron, one restore command in the runner. Benefit: 4-5x faster than truncate-and-reseed at the point of each run. Verdict: Deferred to Phase E follow-up — only triggered if Phase A-D reset p95 exceeds 30s; described in Non-Goals.

Chosen — Alternative 2 because it reuses existing infrastructure (TestingController, TestDatabaseSeeder), the reset cost is bounded (~20s/run), and it makes the smoke suite's invariants explicit ("every run starts from a known baseline") rather than inferring them from patchwork guards.

Assumptions ← Dimension 3

  1. Truncating the QA database is acceptable. QA has no real user data; it's a dev environment seeded with fixture data. If wrong (e.g., someone is using QA for sales demos), reset would erase their context — mitigate: require an explicit env flag on the server (QA_RESET_ALLOWED=1) AND a dated annotation in /qa.rotatingroom.com/README documenting the reset behavior.
  2. Truncate+reseed completes in <30s. Spot-checked locally at ~20s for the current schema. If wrong (large seed data, slow FK rebuild), smoke runs get slower by 30-60s — acceptable but tracked as a guardrail metric.
  3. Stripe test-mode state doesn't leak between runs. Stripe test mode has its own state (customers, subscriptions, coupons). The seeder doesn't reset Stripe. If wrong (a prior run's customer is in a weird state), spec 18 could fail. Mitigate: Task 0c provides a StripeTestReset helper invoked inside /testing/reset (Task 1) to delete test customers matching the run's email pattern; accept residual Stripe objects from other sources.
  4. File uploads (S3/Spaces) don't meaningfully persist between runs. Listings in DB are truncated; their images in Spaces become orphans. Storage cost is trivial for test data. If wrong (lifecycle issues surface), Task 6 adds an orphan-image pruner.
  5. PostHog/analytics events in a test run are discardable. Tests fire real PostHog events to a dedicated test project. If wrong (production PostHog project receives test data), cardinality blows up metrics dashboards. Mitigate: audit PostHog project config before Phase A lands; ensure QA points to test project, not prod.
  6. TestDatabaseSeeder seeds every table the suite's FK graph needs. Currently the seeder populates ~8 table groups (users, admin_users, room_types, etc.). The app has 80+ tables, and specs create rows that FK into unseeded tables (institutions, transit_scores, blog_posts, plans, permission rules). Breaks if wrong: Phase C spec migrations silently fail with FK violations after the first truncate. Detect via: Phase A Task 0 (new — FK-transitive audit). Mitigate: Phase A Task 0 extends seeder to cover every table the current suite touches; CI parity check flags future drift.
  7. Reset runs are serialized. Two operators triggering smoke simultaneously (one via /qa-smoke, another via CI or a teammate) would race, producing a half-reset DB during the second run. Breaks if wrong: second run starts mid-truncate → FK violations, spurious failures, or corrupted seed state. Mitigate: Task 1 wraps the reset body in a PostgreSQL advisory lock (chosen over Redis SETNX — see Approach section); second caller receives 423 Locked with retry-after. Implementation note: prefer pg_advisory_xact_lock inside a transaction OR pin the PDO connection across lock/unlock to avoid session-scope leaks if connections cycle.
  8. personas.phppersonas.ts parity test's regex parser is robust to TS idioms. The round-1 plan proposed preg_match_all('/^\s*([a-z][a-zA-Z]+):\s*\{/m'), which misses satisfies annotations, object shorthand {name, email}, and as const suffixes. Breaks if wrong: CI check is a tautology (always passes) and persona drift ships undetected. Mitigate: Task 2 uses a proper TS AST parser (@typescript-eslint/parser or ts-node executing the module) to extract keys, not regex.

Approach & Rationale ← Dimension 4

ADR-style Y-statement:

In the context of a smoke suite whose reliability is eroded by cross-run state drift, facing the choice between continued symptom-patching and a foundational reset protocol, we chose to enable TestingController::reset() on QA (gated behind a new QA_RESET_ALLOWED env flag), expand TestDatabaseSeeder to include named .edu personas and pre-seeded scenario fixtures, and call reset from the smoke runner before every run, to achieve deterministic preconditions for every spec (eliminating silent-skip fallbacks), accepting a ~20s per-run cost and the operational risk of a misconfigured reset gate on staging/prod (mitigated by triple-gate — env flag, explicit token, and server hostname allowlist).

Architectural changes

  1. TestingController gate — three independent layers, all request-context-aware.

    The constraint bootstrap/app.php's then: closure runs at boot time (before any request exists) means we cannot gate route-registration on hostname — hostname is only known per-request. The working design:

    • Layer 1 — Route registration (env-var only, boot-time): bootstrap/app.php:26-30 registers routes/testing.php when APP_ENV=testing OR config('app.qa_reset_allowed') === true. This is a coarse switch: on QA we set QA_RESET_ALLOWED=1, on staging/production we never set it. Staging/prod → routes don't exist → 404 on any /testing/* request. Hostname is NOT checked here.
    • Layer 2 — Per-request middleware (EnsureQaResetAllowed): new middleware attached to the testing.php route group; on every request verifies the QA flag AND request()->getHost() === 'qa.rotatingroom.com'. Returns 403 if either fails. This is where hostname is checked — it runs in request context, so request()->getHost() is valid. In tests, the middleware reads the Host header the test set via Playwright extraHTTPHeaders or the PHP feature-test's ->withServerVariables(['HTTP_HOST' => 'qa.rotatingroom.com']).
    • Layer 3 — Controller-layer token: TestingController::__construct validates X-Testing-Token against config('app.testing_token'). Even if layers 1+2 pass, a wrong or missing token returns 403. Important: Task 1 also removes the current abort(403) on production|staging — that check would prevent QA (where APP_ENV=production) from ever passing through to the token check. Environment gating is handled entirely by Layer 1 (boot env-var) + Layer 2 (hostname).
    • CSRF exception: bootstrap/app.php:52-54 currently appends testing/* to the CSRF-except list only when APP_ENV=testing. Task 1 extends this to include the QA-flag condition — otherwise the smoke runner's curl receives 419 Page Expired instead of executing the reset. This oversight would have broken Task 4 on Day 1.
    • Concurrency lock: Task 1 wraps the entire reset() body (truncate + reseed + Mailpit clear + Stripe reset + cache/queue flush) in a PostgreSQL advisory lock (pg_try_advisory_lock($key) at start, pg_advisory_unlock($key) at end). Chosen over Redis SETNX because advisory locks auto-release on connection close (handles crash-mid-reset cleanly) and don't require Redis availability. A second caller during an in-progress reset gets 423 Locked with Retry-After: 30. Lock scope explicitly covers every sub-operation so a second caller never sees half-complete state.
    • Staging and production set neither the QA_RESET_ALLOWED flag nor the HTTP_HOST of qa.rotatingroom.com; they cannot reset even if someone exports the flag in their shell.

    Test seam: feature tests drive the middleware via HTTP-level Host header ($this->withServerVariables(['HTTP_HOST' => 'qa.rotatingroom.com']) or ->call('POST', '/testing/reset', [], [], [], ['HTTP_HOST' => ...])). No production-code branches for test-only config keys — the plan's earlier config('app.hostname_override') proposal is withdrawn.

  2. Named persona contract — self-documenting + shared password + purge-pattern-safe. TestDatabaseSeeder grows an explicit persona catalog (mirrors tests/playwright/utils/test-data.ts::personas). Every persona has: user record, verification state, blocked/fraudulent flags, owned rooms (status + plan + needs_edu_verification), subscription state, and derivative rows (verification codes, activity_log, bounce records).

    Naming convention: every email starts with qa- and encodes the persona's state in the local-part — e.g., qa-edu-locked@example.edu, qa-blocked@rotatingroom.com, qa-verify-expired@rotatingroom.com. A developer reading loginAs('qa-edu-locked@example.edu') knows the persona's state without looking it up. The qa- prefix also ensures the email matches QA_STRIPE_EMAIL_PATTERN so Stripe customers created for these personas get cleaned up automatically (prevents the Stripe-side drift Ahmed Concern #5 flagged).

    Shared password: every persona uses RR4Life! (the same password the team already uses for prototype + staging QA credentials). Stored in config('testing.persona_password') — one value, no persona-specific lookup. Reduces friction for the team; a teammate who wants to poke at qa-blocked@rotatingroom.com on QA just types the password they already know.

    Documentation: docs/QA_PERSONAS.md is a team-facing cheat-sheet linked from CLAUDE.md, smoke dashboard header, and #qa channel topic — everyone knows what personas exist and when to log in as each.

    Removing a persona requires a test-writing skill update. Adding one requires a migration-style review. Both are enforced by CI via the persona parity test (Task 2) and the persona-email-matches-purge-pattern test (Task 3).

  3. Scenario fixtures. Beyond user personas, seed named scenario objects: an unsubscribed lister with a paid draft, a user mid-way through /verification-request, a user with 3-of-5 failed verification attempts (so the next is locked), a room flagged under_review, a coupon that expires tomorrow. Each scenario has a stable ID + documentation in tests/fixtures/scenarios.md.

  4. Smoke runner prefix. Before npx playwright test, the runner calls POST /testing/reset. On non-200 response, abort with clear error (no silent continuation — running a smoke suite against an unreset QA is the failure mode we're trying to eliminate).

  5. Playwright spec migration. Remove test.skip() fallbacks that exist solely because the prior state was unknown. Specs now assert the precondition is present (e.g., "eduUnverified persona exists AND has pending free listing"), and fail loudly if the seeder didn't produce that state. This surfaces seeder regressions immediately.

Why phased

Flipping the flag globally and rewriting 50 specs in one PR is too risky. Phases A–D decouple the infrastructure work from the spec rewrites so each one lands independently and can be reverted cleanly. See Tasks.

Risks & Rollback ← Dimension 5

# Risk Probability Impact Mitigation Rollback
1 /testing/reset mis-fires against staging or production, erasing the DB Low (two-layer gate + CSRF + hostname + token) Catastrophic Route-layer gate (routes don't exist → 404) + controller-layer gate + hostname check + CSRF protection on non-QA + token; gate-bypass probe in Phase A Task 3 hard-fails CI on any regression Hard precondition: PR #3722 (pg_dump backup) must land before Phase B. If backup isn't in place, we have no rollback for catastrophic loss. Runbook: docs/handoffs/smoke-reset-incident.md
2 Reseed is too slow, smoke runs balloon past 30min Medium Moderate Guardrail metric: reset duration ≤30s; if exceeded, Phase E (snapshot approach — Alternative 7) triggers automatically Revert Task 4 (Playwright setup project); runs continue with current cross-run drift while we investigate
3 Personas diverge between seeder and test-data.ts, causing "correct persona but wrong state" failures Medium Low-moderate Task 2 adds a CI check using a proper TS AST parser (not regex); fails build on drift Fix drift; no runtime rollback needed
4 Specs that rely on pre-existing production-like data (blog posts, institution affiliates, plans, permission rules) break when truncated due to FK-transitive gaps High Moderate Phase A Task 0 (new) audits every table the suite FKs into; seeder extended to cover all of them; CI check prevents future drift Hold Phase B (runner integration) until Task 0 audit is complete and seeder covers all FK dependencies
5 Stripe test-mode state accumulates (orphan customers, subscriptions) Medium Low Task 5a includes a Stripe reset helper; Stripe has its own test-mode GC Skip Stripe reset if quota hit; accept orphans until quarterly cleanup
6 Two smoke runs race on /testing/reset, producing half-reset DB Medium (will happen the first time two people are both trying to QA on the same day) Moderate Task 1 uses PostgreSQL advisory lock (pg_try_advisory_lock); second caller receives 423 Locked + Retry-After: 30. Implementation must pin the PDO connection or use pg_advisory_xact_lock to prevent session-scope leaks on connection cycling Lock acquisition failure → runner aborts with clear message; no partial-reset state possible (advisory locks auto-release on session close)
7 Seeder-schema drift: someone adds a required column without updating TestDatabaseSeeder, every reset fails afterward High (schema changes happen weekly) Moderate Phase A Task 0b (new) adds CI check that runs migrate:fresh --seed=TestDatabaseSeeder on every PR touching database/migrations/ or database/seeders/; fails build on seeder-migration incompatibility Revert offending migration or quick-patch seeder; CI catches before merge
8 Reset-under-load: reset fires while Postmark webhooks or real users are hitting QA, transactions pile up, DB locks cascade Low-medium Moderate Reset runs during known smoke windows only; Task 11 dashboard widget flags p95 >30s; Phase B cut-over announces windows in #qa If cascading locks observed, reduce reset frequency; investigate lock contention; consider snapshot approach (Phase E)
9 QA users (Mahmoud manual UI testing, Megan spot-checks) lose their in-progress session/data when first reset runs High (will happen to Mahmoud on Day 1) Low-moderate Phase B Task 4b (new) adds a 24-hour cut-over announcement in #qa; Task 4c (v2) replaces the 4h window with a pre-reset active-session coordination check — /qa-smoke pauses and prompts the operator if a Backpack admin session is active; operator decides continue/abort Teammates can abort /qa-smoke at the prompt; no data-loss from reset itself (same as Risk 2 — revert Task 4)
10 qa-*@* persona email collides with an existing real user on prod or QA, and the Stripe purge targets that user's test-mode customer Low (persona prefixes are unusual) but must be verified before Phase A Moderate Phase A Task 0 deliverable #5 runs SELECT * FROM users WHERE email LIKE 'qa-%' against both prod replica and QA; if any match is not in our persona catalog, narrow QA_STRIPE_EMAIL_PATTERN to allowlist only the verified persona emails Narrow the allowlist in StripeTestReset::ALLOWED_PATTERNS to exclude the collision range
11 Shared password RR4Life! leaks (accidental commit, Slack paste, screenshot, or public gist) — someone external can authenticate as every persona on QA Medium (shared passwords leak eventually) Low-moderate (QA has no real user data, but can still be used to generate spam, trigger webhooks, stress Stripe test quotas) Documented rotation runbook in docs/handoffs/qa-password-rotation.md (Task 2 sub-deliverable); password stored in config/testing.php which is git-tracked but lives alongside the APP_KEY — same security posture as other test credentials. Rotation: generate new password, update config/testing.php, update QA server .env if overridden, update docs/QA_PERSONAS.md, re-deploy, trigger reset, verify specs still pass Rotate to new password; previous password stops working on next reset
12 Stripe's 5s hard-timeout causes persistent partial-purge drift during sustained Stripe incidents Medium (Stripe p99 can spike to 10-30s during region incidents) Low (state accumulates over time but doesn't break tests immediately) Telemetry tracks skipped_timeout count per reset (Task 11 dashboard widget); decision rule: if skipped_timeout > 10% of purges for >3 consecutive runs, raise the timeout or schedule a separate nightly purge job Widen timeout to 15s or move purge to background; short-term accept partial drift
14 Two operators invoke /qa-smoke concurrently; second receives 423 Locked but aborts instead of waiting → second operator's suite never runs Low-medium (multi-operator smoke is infrequent but possible) Low (second operator re-runs manually, but it's avoidable friction) Task 4 explicitly specifies 423 retry behavior: /qa-smoke on 423 Locked waits 30s and retries up to 3 times before aborting. Second operator's reset queues cleanly behind the first, then its suite runs against that already-reset state (which is what they wanted anyway) Manual re-run; the retry logic is ~5 lines in the skill script so unlikely to need rollback

Expert shock-test additions (2026-04-23):

  • Pre-QA canary: before Phase B lands, run the full reset + reseed path against a local QA clone (pg_dump current QA → restore locally → point the TestingController at it → run 20 reset cycles → verify no FK violations, no orphans, no quota issues). Treat as a hard precondition for landing Task 4. New Task 4a formalizes this.
  • On-call / escalation: reset failure on QA must page the active on-call via #qa mention + Slack app. The plan's "abort with clear message" is necessary but insufficient. Task 11 extended: if the reset-telemetry endpoint receives three consecutive failures OR no heartbeat in 24h, post @channel to #qa. Owner: whoever is assigned the weekly on-call rotation (currently ad-hoc — plan flags this as a dependency on #4303 or a new on-call rotation issue).
  • Fixture schema versioning: database/seeders/data/personas.php returns a versioned array (['version' => 2, 'personas' => [...]]). Parity test compares versions; if TS side's PERSONAS_VERSION constant doesn't match PHP version, CI fails. Prevents silent drift when a new persona field is added PHP-side but specs still expect the old shape.

Rollback plan if the whole initiative needs to be reversed:

  1. gh pr revert the runner change (Task 4) — smoke reverts to current behavior instantly.
  2. Leave the expanded seeder + personas (Phase A) in place; they're additive and don't break anything.
  3. Tasks 7+ (spec rewrites) each land as their own PR; individual reverts if any causes regressions.

Non-Goals ← Dimension 6

  • Reset on production or staging. Production: never. Staging: also never (staging is the accumulator branch pre-push; its data is the ship-candidate QA'd before deploy). Complexity: trivial (just don't add the env flag). Justified — these environments have their own data contracts.
  • Per-test reset (between Playwright tests within a spec). The framework supports it (test.beforeEachcurl /testing/reset), but wall-clock cost is prohibitive (80 tests × 20s = 27 min added). Complexity: trivial (call reset in beforeEach). Justified — smoke suite is designed to run sequentially with shared setup (see specs 11-12 dependency chain).
  • Snapshot-based reset (Alternative 7 — nightly snapshot + per-run restore). Moved from "non-goal forever" to "Phase E follow-up, auto-triggered if reset p95 > 30s." Complexity: moderate (1-2 days — nightly cron + restore command in runner). If Phase A-D reset stays under 30s p95 for 4 weeks, skip Phase E. Justification for Phase E path (not immediate adoption): truncate+reseed is simpler, reset p95 is unmeasured so we shouldn't optimize pre-emptively.
  • Reset Stripe test mode comprehensively. Stripe API quota doesn't permit clearing all test customers per run; Task 5's helper clears only the specific emails this run generates. Complexity: moderate (would need a background worker that runs nightly). Justified — residual Stripe state rarely causes test flakes (confirmed by 2026-04-21 audit).
  • Reset PostHog / GA4 / Rollbar event history. These are append-only analytics systems. Test events live in a dedicated test project; production projects don't see them. Complexity: complex (requires analytics-provider cooperation). Justified — out of scope.

Explicitly in scope (to remove ambiguity):

  • Mailpit inbox clear. Task 5b (promoted from "supplementary" to a named acceptance criterion): /testing/reset endpoint also calls Mailpit's DELETE /api/v1/messages so spec 27 (email flows) starts from zero messages every run.

Success Criteria ← Dimension 7

  1. Within 2 weeks of merge: Zero new smoke findings attributed to "cross-run state drift" (tracked via /smoke-feedback-review disposition test-needs-fix where the root cause is prior-run state, not a test-logic bug).
  2. Within 4 weeks: ≥10 test.skip() calls removed from the Playwright suite (those that exist solely because of unknown prior state). Measured by grep diff against master.
  3. Reset completes in ≤30s (p95 over 10 consecutive runs). Measured from smoke-runner log timestamps.
  4. No accidental reset against staging or production — verified by gate-bypass probe in Phase A Task 3 and monthly review of /testing/* access logs (Task 12).
  5. Persona reset contract compiles — TypeScript type + PHP PHPDoc pair, CI fails if they diverge (Task 2 CI check).

Launch Metrics ← Post-launch impact tracking

Success (what we're improving)

Baselines pending Phase A Task 0 measurement day (claimed numbers below are flagged — exact baselines land with Task 0's measurement-day deliverable).

Metric Source Baseline Target Timepoint
Smoke findings tagged state-drift per run New state-drift disposition tag (Task 11 adds the dashboard counter AND formalizes the reviewer-tagging vocabulary in /smoke-feedback-review; reviewer tags test-needs-fix items where root cause is prior-run state) 4 per run (2026-04-21 audit: 4 of 7 findings were cross-run state drift) 0 per run Day 14
Test-skip count due to state-drift (narrow metric) Task 0 deliverable: inventory of test.skip() calls in tests/playwright/qa-smoke/ and tests/playwright/journeys/, each categorized as state-drift / prod-gate / legitimate-conditional. Target metric is only the state-drift subset. TBD — Phase A Task 0 delivers the count. Total test.skip today is ~473 non-prod-gate calls; the state-drift subset is the ≤25% of those that exist because prior-run state was unknown. 0 state-drift skips remaining Day 28
Smoke pass rate (passes / total, skips count against) dashboard data.json 95.4% (2026-04-22 run) ≥97% Day 14
Reset endpoint p95 latency smoke-runner telemetry (Task 11 dashboard widget) TBD — Phase A Task 0 measures 20 runs on current QA ≤30s Day 7

Proxy validation:

  • state-drift disposition count is a direct measurement of the problem (not a proxy). Reviewers explicitly tag findings as state-drift vs real-bug — the count going to zero is the outcome we want.
  • state-drift skip count is a direct measurement — each skip call exists in the code; we count them before and after.
  • smoke pass rate ≥97% is a weaker proxy — could improve for unrelated reasons or stay flat if new specs land that rely on yet-unseeded state. Included as a guardrail-ish directional indicator, not a causal claim. Correlation with the other two metrics will be tracked on the dashboard.
  • reset p95 ≤30s is the operational health metric — direct measurement.

Guardrails (what must not break)

Metric Source Current Threshold
Reset endpoint response time smoke-runner log n/a (new) ≤30s p95
Total smoke run wall time dashboard timings ~30 min must stay ≤35 min
QA server uptime during run Uptime Kuma 100% must stay ≥99%
Stripe test quota consumption Stripe dashboard current baseline must stay within quota
Accidental resets against staging/prod audit log review (monthly) 0 must stay at 0 (any violation = immediate rollback)

Decision Rules

  • Day 7: if reset p95 >30s → investigate seed size; continue with Phase C spec rewrites only if reset stable.
  • Day 14: if smoke findings attributed to state-drift > 1 per run on average → Phase C rewrite plan needs to be accelerated; escalate to Gaurav.
  • Day 28: if test.skip() count hasn't dropped by ≥10 → Phase C spec rewrites aren't landing; reopen scope with Gaurav.
  • Guardrail violation: revert regardless of success metric. Gate-bypass probe failure → immediate rollback of Task 2 (route change).

Requirements Input ← Dimension 10

User request (2026-04-23): "What else should be reset besides personas? Write a comprehensive, deterministic testing plan. Specifically around what things should be reset between runs."

Requirement (from input) Addressed by Status
Enumerate every class of state that drifts Appendix A (State Drift Catalog below) In scope — Appendix A
Design a reset protocol Tasks 1-4 (gate, seeder, runner, scenario fixtures) In scope
Phase migration so tests don't all break at once Phases A-D (Task grouping) In scope
Rollback/failure modes Risks & Rollback section + per-Phase rollback plan In scope
Identify guardrail metrics Launch Metrics / Guardrails table In scope
Personas (specifically) Task 2 (persona catalog) In scope

Appendix B: Post-Reset Baseline State (what everything gets set to)

This is the canonical answer to "after /testing/reset completes, what exists in the DB?" The current TestDatabaseSeeder defines most of it; Task 2 extends it with .edu personas + scenario fixtures. Every spec must be able to assume this state exists before it runs.

B.1 — Users (11 personas with self-documenting names + shared password)

Design goals:

  1. Self-documenting email names — an engineer or QA reviewer should be able to read a test that does loginAs('qa-lister@rotatingroom.com') and know exactly what state that user is in without cross-referencing the persona catalog.
  2. Shared password RR4Life! — same password every team member already uses for manual QA across prototypes and staging. No separate credentials to look up, no password cycling per persona. (Bcrypt at $2y$12$...; stored in config('testing.persona_password').)
  3. qa- prefix on every email — matches the qa-smoke-* and qa-* pattern that StripeTestReset is allowed to purge (per Ahmed's Concern #5), so specs 18/35 that create Stripe customers against these personas get cleaned up automatically.
  4. Stable IDs in the 1000s — keeps baseline personas out of the range that ephemeral spec-12 registrations use (which auto-assign from the sequence starting at ~14), so there's never an ID collision even if a spec forgets to clean up.

Shared password: RR4Life! (bcrypt in the seeder, plain in config('testing.persona_password') for test-helper login). Every persona below uses this.

Admin: Admin #1 = qa-admin@rotatingroom.com (password RR4Life!) — only one admin, lives in admin_users table.

# ID Email State Purpose
1 1001 qa-lister@rotatingroom.com active, email verified Seeded lister — owns all 60 baseline rooms + 4 scenario fixtures
2 1002 qa-support@rotatingroom.com active Support/Megan-style persona — ops flows
3 1003 qa-founder@rotatingroom.com active Founder/Gaurav-style persona — strategic flows
4 1004 qa-demo@rotatingroom.com active, verified Non-.edu general user — canonical login for E2E demos
5 1005 qa-verify-expired@rotatingroom.com active, email_verified_at = 7 months ago Expired-verification re-prompt UI
6 1006 qa-blocked@rotatingroom.com active, blocked=1 Blocked-user gates + admin-unblock flow
7 1007 qa-inactive@rotatingroom.com active=0 Inactive-account paths
8 1010 qa-edu-unverified@example.edu unverified .edu, has pending free listing First-time verify flow
9 1011 qa-edu-verified@example.edu verified .edu Post-verify "no banner" flow
10 1012 qa-edu-locked@example.edu failed_attempts=5, locked Lockout UI + admin-unlock
11 1013 qa-edu-verifying@example.edu failed_attempts=3 Exercises final attempts 4–5 (expert shock test finding)

Readability test: a developer sees loginAs('qa-edu-locked@example.edu') and immediately knows "this is the locked .edu user." No catalog lookup. Contrast with loginAs('edu-locked@example.edu') or loginAs(personas.eduLocked) — both force the reader to know conventions.

Stripe-purge compatibility: every email starts with qa-, which matches the QA_STRIPE_EMAIL_PATTERN (qa-*@*) StripeTestReset is configured to purge. See Task 0c + Task 3 probe for the fail-closed validation that ensures this pattern stays narrow.

Not seeded (must be created per-run by specs): the ephemeral qa-smoke-<timestamp>-lister@rotatingroom.com + qa-smoke-<timestamp>-traveler@rotatingroom.com from specs 11-12. These test registration itself, so they're intentionally created per-run and also match the qa- purge pattern.

Team-facing documentation: docs/QA_PERSONAS.md (created by Task 2) is a one-page cheat-sheet with the email list, password, and "when would I log in as each?" column. Linked from CLAUDE.md and the smoke dashboard header.

B.2 — Rooms (60 today, + scenario fixtures from Task 2)

All 60 seeded rooms belong to User #1001 (qa-lister@rotatingroom.com). 10 rooms per city across 6 cities. Covers the full filter/sort space:

Dimension Range seeded
Cities (6) Boston ($800-1600), NY ($1400-2200), Chicago ($700-1500), LA ($1200-2000), SF ($1600-2400), Houston ($550-1350) — each has 10 rooms with rent ±$400 around base
Room type Alternates private_room / entire_place per city
Bedrooms 1, 2, or 3 (staggered: $i % 3 + 1)
Bathrooms 1 or 2
Availability Starts today through +5 months depending on room; all 6-month windows
Plan Alternates rr-monthly, rr-quarterly, rr-annually
Status All active, all under_review = false
Lat/Lon Jittered ±0.025° around each city center
Each has an address row 10 Test Street through 6000 Test Street per city

Task 2 scenario fixture rooms (added on top of the 60):

ID Type State Purpose
9001 Paid draft listing status=pending_payment, plan=rr-monthly, owner=User #1001 Spec 18 "blocked by paid draft" path — replaces the savePaymentListing(null) workaround
9002 Fraud-flagged room is_fraudulent=true, status=inactive, owner=User #1001 Spec 34 cleanup no-op fix (PR #4279 unblocks manual recovery, fixture guarantees starting state)
9003 Under-review room under_review=true, status=active, owner=User #1001 Stripe Radar simulation starting state
9004 Pending free listing status=pending_free, plan=rr-free, owner=User #1010 (qa-edu-unverified) .edu journey spec: "user finishes verification → pending listing activates"

Fixture room IDs in the 9000s keep them out of both the baseline range (1–60) and the ephemeral-sequence range (61+).

B.3 — Stripe Plans (10 plans)

Code Name Monthly Upfront Duration
rr-free Free $0 $0
rr-monthly Monthly $25 $25 monthly
rr-quarterly Quarterly (default) $20 $60 quarterly
rr-annually Annual $15 $180 annually
rr-premium-monthly Premium Monthly $35 $35 monthly
rr-premium-quarterly Premium Quarterly $28 $84 quarterly
rr-premium-annually Premium Annually $21 $252 annually
rr-pro-monthly Pro Monthly $49 $49 monthly
rr-pro-quarterly Pro Quarterly $39 $117 quarterly
rr-pro-annually Pro Annually $29 $348 annually

B.4 — Cities (6 cities)

Boston, New York, Chicago, Los Angeles, San Francisco, Houston — each with slug, state, lat/lon.

B.5 — Permission rules (allows + restrictions)

Allows (4 rules):

# Regex create_account post_free post_paid send_queries Purpose
1 catch-all .+[@.].+\..+$ 1 0 1 0 Any domain can register + post paid
2 .org$ 1 0 1 0 .org tier (same as catch-all)
100 .edu$ 1 0 1 1 University tier — .edu gets send_queries
4 queensu.ca$ 1 0 1 1 Queen's University sample

Restrictions (2 rules):

# Regex Blocks Purpose
3 `(alum alumni)` create_account
4 `(protonmail.com proton.me pm.me)$`

B.6 — Everything else: empty by design

All other tables are empty after reset:

  • conversations, chat_messages, broadcasts → 0 rows (specs 14-15 create them)
  • stripe_subscriptions, stripe_coupons → 0 rows (spec 18 creates them; Stripe test mode also reset)
  • email_verification_codes, password_reset_tokens, personal_access_tokens → 0 rows
  • bounced_emails, notifications → 0 rows
  • activity_log → 0 rows
  • failed_jobs, jobs → 0 rows
  • email_verification_failed_attempts on most users → 0 (except User #1012 = 5 locked; User #1013 = 3 per Task 2 personas)
  • DO Spaces bucket: stays as-is per reset (orphan pruner is thematically adjacent — split into its own follow-up plan; unused room IDs become orphaned but don't affect DB correctness or smoke determinism)
  • Mailpit inbox → cleared by Task 1 reset body
  • Stripe test mode → customers matching QA_STRIPE_EMAIL_PATTERN (default qa-*@*) purged by Task 0c, with fail-closed validation against unsafe patterns (per Ahmed Concern #3)

B.7 — Auto-increment sequences

After seed, all PG sequences advance past max seeded ID so spec inserts don't collide:

  • users → next ID starts at 1014 (after User #1013; baseline IDs 1001–1013 keep spec-generated ephemeral IDs well above them)
  • rooms → next ID starts at 9005 (after scenario fixture #9004; baseline IDs 1–60 leave room for spec-generated ephemeral IDs starting at 61)
  • admin_users → next ID starts at 2
  • cities → next ID starts at 7
  • stripe_plans → next ID starts at 11
  • allows → next ID starts at 101 (after .edu tier #100)
  • restrictions → next ID starts at 5

This is critical: spec 11-12 creates timestamped-email QA accounts that auto-assign IDs — if sequences aren't advanced, they'd collide with seeded IDs and fail. Note on the 1000-baseline / 9000-fixtures gap: the sequence-advance step sets users.id to >= 1014. Ephemeral users created by spec 11 get IDs 1014, 1015, etc. — never colliding with the baseline 1001–1013 range and never pushed into a 9000+ range reserved for future scenario fixtures.

B.8 — What is NOT in the baseline (must be created by specs or NOT exist)

  • Ephemeral QA lister + traveler (spec 11-12 creates — expected to NOT exist at reset time; the whole point)
  • .auth/*.json files (cleared pre-run by smoke runner rm -rf)
  • Any production data (there is no QA reset against staging/prod — Layer 1+2 gates prevent it)
  • Any photo uploads beyond DO Spaces orphans (which don't affect DB correctness)

Appendix A: State Drift Catalog (comprehensive enumeration)

Every class of state that currently drifts between smoke runs, and how the reset protocol handles each:

A.1 — User state

What drifts Current cause Reset protocol
users.is_fraudulent on demo@rotatingroom.com Spec 34 flags it; cleanup admin call was a no-op pre-#4279 Full users table truncate + reseed
users.blocked Spec 15/19 toggles it Full truncate
users.email_verified_at Admin CRUD edits (PR #4279) + verification specs Full truncate
users.email_verification_failed_attempts, _locked_at Spec 21 locks users out Full truncate
users.send_queries Spec 21 gates Full truncate
Ephemeral QA lister/traveler accounts from prior runs Spec 11/12 creates with timestamped email; never deletes Full truncate — these accounts cease to exist between runs
users.is_sent_queries_disabled Moderation flag changed by spec 15/19 Full truncate

A.2 — Listing / Room state

What drifts Current cause Reset protocol
rooms.status (active → inactive → pending) Specs 13, 22, 34 change status Full truncate
rooms.is_fraudulent Spec 34, Stripe radar hooks Full truncate
rooms.under_review TestingController::setRoomUnderReview Full truncate
rooms.needs_edu_verification Spec 13/21 Full truncate
rooms.plan_id Specs 18, 22 (plan changes) Full truncate
Orphan rooms owned by deleted ephemeral listers Spec 13 creates; previous lister removed but rooms remain Full truncate — orphans gone

A.3 — Subscription / Billing state

What drifts Current cause Reset protocol
stripe_subscriptions.stripe_status (active, past_due, canceled) Specs 18, 35 Full truncate (DB side)
stripe_subscriptions.ends_at, trial_ends_at Spec 18 Full truncate
Past_due + retry state Spec 35 Full truncate
Stripe test-mode customers (remote) Spec 18 creates; never deletes Task 5: StripeTestReset helper — deletes customers matching this run's email pattern; accept orphans from other sources
Stripe coupons created by spec 19 Manual test Hands-off — coupons expire; quarterly cleanup

A.4 — Messaging / Conversation state

What drifts Current cause Reset protocol
conversations rows (renter → lister) Spec 14 creates inquiries Full truncate
chat_messages Spec 14 Full truncate
Duplicate-conversation detection window (1-month, per PR #4264) PR #4264 Full truncate — conversations gone means no duplicate detection fires
broadcasts + broadcast-recipient join rows Spec 15 Full truncate of both broadcasts and broadcast_recipients (or whatever the join-table is per schema)

A.5 — Verification / Fraud / Auth state

What drifts Current cause Reset protocol
email_verification_codes rows Spec 21 + verification flow Full truncate
failed_attempts, locked_at Spec 21 lockout test Full truncate
Verification documents uploaded Spec 21 Full truncate of DB-side record; physical file cleanup handled by the DO Spaces pruner follow-up plan (see Appendix A.6)
bounced_emails (Postmark webhook state) Spec 27 webhook simulation Full truncate
activity_log entries for fraud_flag_cleared, etc. Admin actions, PR #4279 Full truncate (but note: activity_log is a large production table; observers that read it may behave differently after truncate — Task 0 audit flags any reader)
password_reset_tokens Spec 27 password reset flow; don't auto-clean Full truncate
personal_access_tokens Any Laravel Sanctum-based test Full truncate
Notifications (notifications table — Laravel queued email/Slack) Any notifying action Full truncate

A.6 — Filesystem / Client state

What drifts Current cause Reset protocol
tests/playwright/.auth/accounts.json Ephemeral accounts from specs 11-12 Already cleared pre-run (rm -rf tests/playwright/.auth)
tests/playwright/.auth/listing.json Spec 13 Already cleared pre-run
tests/playwright/.auth/payment-listing.json Spec 18 Already cleared pre-run
test-results/ screenshots + traces Playwright Already cleared pre-run
DO Spaces (S3) uploaded photos Spec 13/22/42 Task 6: orphan-image pruner — runs weekly, not per-run
Browser state (cookies, localStorage) Playwright Isolated per test context — no reset needed

A.7 — External-service state

What drifts Current cause Reset protocol
Mailpit inbox (QA) All email sends Task 5 supplementary: POST /testing/mailpit-clear wraps Mailpit's own clear API
Rollbar errors logged Any test-triggered error Append-only; not reset per-run. Category-filter in Rollbar review.
PostHog events Analytics-emitting specs Append-only on test project; cardinality contained by project isolation.
Slack notifications (e.g., new subscription) Webhook specs Accept as noise in #dev-feed; quarterly cleanup
Cloudflare cache Affected by geo-lookup specs TTL-based; not reset per-run

A.8 — Schema / Config / Infrastructure state

What drifts Current cause Reset protocol
migrations table Deploys Never truncate (Task 1 adds this to $skipTables)
Cache (Redis) Any request php artisan cache:clear at reset time (Task 1)
Session data (Redis-backed sessions) Authenticated request Redis flushed with cache at reset time
Queue jobs (Redis) Mailer queue, analytics Task 1: flush queue via php artisan queue:clear
failed_jobs table (Horizon/queue failures) Any job failure, including test-induced failures Full truncate — otherwise failed_jobs accumulates noise across runs
jobs table (if queue driver is database) Queued actions Full truncate
Config cache Any deploy Not reset per-run; only on deploy
Full-text search vectors (rooms.search_vector, other tsvector columns) Any room CRUD; Scout updates these via observers Task 1 addendum: after truncate+reseed, run php artisan search:reindex OR explicitly trigger observers during seed so tsvectors populate. Without this, search specs return empty results from the fresh seeded rooms.
Scout index (external provider, if configured) Room CRUD php artisan scout:flush "App\Models\Room" + search:import at reset (or rely on Scout driver's own reset)

Tasks ← Dimension 8

Phase A — Infrastructure (must land first, no spec changes)

Task 0: Measurement Day (1-day spike, no code changes)

Deliverables (published as docs/handoffs/2026-04-24-smoke-reset-baseline.md):

  1. Reset-duration baseline: run TestingController::reset() locally 20 times; publish p50/p95/p99.
  2. test.skip inventory: full grep -rn 'test\.skip' tests/playwright/qa-smoke/ tests/playwright/journeys/ with each call categorized as:
    • state-drift (exists because prior-run state was unknown — target for removal)
    • prod-gate (exists because the spec shouldn't run against production — keep)
    • legitimate-conditional (exists because the test genuinely doesn't apply to every run — keep)
  3. FK-transitive audit: query information_schema.table_constraints for every FK the suite transitively depends on; cross-reference against TestDatabaseSeeder's seedUsers, seedAdmins, seedRoomTypes, etc. Produce a gap list of tables the suite FKs into but the seeder doesn't populate (e.g., institutions, permission rules, plans).
  4. Rebaseline Success Metrics in the Launch Metrics table with actual numbers.
  5. Prod & QA qa-*@* collision check (per v2 review):
    -- Run against both prod read-replica AND current QA DB
    SELECT COUNT(*) AS qa_prefix_users, MIN(created_at) AS oldest, MAX(created_at) AS newest
      FROM users WHERE email LIKE 'qa-%';
    SELECT email, created_at FROM users WHERE email LIKE 'qa-%' ORDER BY created_at LIMIT 20;
    If any prod users match qa-*@*, the StripeTestReset default pattern cannot be qa-*@* — narrow it. If any QA users match but aren't in our persona list, investigate before Phase A Task 1.
  6. Existing Stripe test-mode customer survey: via Stripe CLI stripe customers list --limit 100 | grep -E 'qa-|smoke'. Count pre-existing customers the pattern would delete; confirm they're all test data (not accidentally in live mode).

Exit criterion: the plan's Launch Metrics table has concrete numbers (not TBDs) AND the qa-*@* collision check produces an explicit branch decision:

  • Branch A (no collisions): zero prod-DB matches for email LIKE 'qa-%'StripeTestReset::ALLOWED_PATTERNS default is qa-*@*; proceed to Phase A Task 1.
  • Branch B (collisions present): any prod-DB matches exist → default narrows to qa-smoke-*@* (the narrower pattern that excludes qa-* human signups). Each pre-existing prod qa-% user is documented in the handoff with email, signup date, and inferred owner. Proceed to Phase A Task 1 with the narrower pattern.

"No unexpected matches" without this explicit branching would be ambiguous — Branch B defines the concrete alternative rather than blocking on "investigate further." Phase A Task 1 depends on one of these two branches being selected in the handoff doc.

Commit: docs(smoke): Task 0 measurement-day baseline for deterministic reset


Task 0b: CI check — seeder-migration parity

Files:

  • Create: .github/workflows/seeder-migration-check.yml

Step 1: on every PR touching database/migrations/ or database/seeders/, run php artisan migrate:fresh --seed --seeder="Database\Seeders\TestDatabaseSeeder" --force against a fresh Postgres in CI. Note: --seed enables seeding and --seeder=<FQCN> specifies which seeder class; --seed=<name> (one token) is invalid Laravel syntax. If the seeder fails (missing column default, FK violation, etc.), fail the check.

Step 2: verify Laravel surfaces seeder failure as non-zero exit. Laravel's default behavior wraps seed exceptions in \RuntimeException; in some versions the seeder swallows exceptions and exits 0. Add a post-step check: php artisan tinker --execute='echo DB::selectOne("SELECT count(*) c FROM users")->c . PHP_EOL;' — fails the job if count is 0, catching silent seeder aborts.

Exit criterion: the check passes on master today; fails loudly on a deliberately-broken test migration (probe commit in the PR).

Commit: ci: seeder-migration parity check to prevent reset breakage


Task 0c: Stripe test-mode reset helper (moved from Phase C)

Files:

  • Create: app/Services/Testing/StripeTestReset.php
  • Create: config/testing.php (new — holds stripe_email_pattern, persona_password; referenced by Tasks 0c, 1, 2, 3, 4c — without this file, every config('testing.*') call returns null and the fail-closed guards throw on every invocation)
  • Modify: app/Http/Controllers/TestingController.php (call from reset() when QA_RESET_ALLOWED=1)

Deletes Stripe test-mode customers whose email matches a configured pattern. Pattern source: config('testing.stripe_email_pattern') (default qa-*@*). Service exposes purgeTestCustomers() with no args; callers configure the pattern via env (QA_STRIPE_EMAIL_PATTERN). Accepts orphans from other sources. Gated behind reset endpoint — only callable from within TestingController::reset().

Fail-closed pattern validation — literal allowlist (Ahmed Concern #3, refined via plan review v2). Earlier drafts used a regex-blacklist approach (reject patterns matching "dangerous" shapes). The v2 review flagged this as over-engineered: every blacklist is a test of "did I think of every bad pattern?" whereas an allowlist asks "is this exactly one of my known-good patterns?" — strictly safer, ~25 lines less code, no edge-case attack surface.

StripeTestReset's constructor validates the pattern against a hard-coded allowlist:

final class StripeTestReset {
    private const ALLOWED_PATTERNS = [
        'qa-*@*',              // default — matches all QA personas + ephemerals
        'qa-smoke-*@*',        // narrower — just spec-11/12 ephemerals
        'qa-smoketest@*',      // single existing spec user
        'qa-proplan-test@*',   // single existing spec user
    ];

    public function __construct(string $pattern) {
        if (!in_array($pattern, self::ALLOWED_PATTERNS, true)) {
            throw new InvalidArgumentException(
                "Stripe email pattern '{$pattern}' is not in the allowlist. " .
                "Adding a new pattern requires a PR to this file — intentional, " .
                "to force review of any widening of Stripe purge scope."
            );
        }
        $this->pattern = $pattern;
    }
}

Adding a new pattern requires editing this file (code review + tests run). A misconfigured env var gets rejected loudly (HTTP 500 on reset). No regex parsing, no wildcard semantics to defend — the check is set-membership, which is trivial to reason about.

Stripe API hard-timeout (Ahmed Concern #4): each call into Stripe's API (list, delete) is wrapped in a 5-second hard cap. On timeout, the helper:

  • logs a warning with the customer count not yet purged
  • returns a partial-success result ['purged' => N, 'skipped_timeout' => M]
  • does NOT throw — the rest of the reset body (which still holds the advisory lock) completes normally, the advisory lock is released, and residual Stripe customers get cleaned up next run
  • "Fail-open on Stripe, fail-closed on gate" is the deliberate asymmetry: missing a Stripe purge is annoying, but holding the DB lock for a multi-second Stripe tail blocks the entire smoke queue.

Structured result:

public function purgeTestCustomers(): array {
    return [
        'pattern_used' => $this->pattern,
        'purged' => 12,
        'skipped_timeout' => 0,
        'duration_ms' => 847,
    ];
}

TestingController::reset() includes this result in its JSON response so the smoke dashboard can surface Stripe-reset health.

Rationale for moving to Phase A: this is infrastructure (belongs with the reset contract), not a per-spec migration.

Commit: feat(testing): Stripe test-mode reset helper bundled with DB reset


Task 1: Three-layer gate + advisory lock on reset endpoint

Files:

  • Modify: bootstrap/app.php:26-30 (Layer 1 — env-var-only route registration)
  • Modify: bootstrap/app.php:52-54 (CSRF exception extended to QA flag)
  • Create: app/Http/Middleware/EnsureQaResetAllowed.php (Layer 2 — per-request hostname check)
  • Modify: routes/testing.php (attach middleware to group)
  • Modify: app/Http/Controllers/TestingController.php:13-19remove the production|staging abort. Layer 3 becomes token check only; environment gating is handled by Layer 1 (boot env-var) and Layer 2 (middleware hostname). Keeping the constructor abort on production would make the whole stack unreachable on QA (where APP_ENV=production).
  • Modify: app/Http/Controllers/TestingController.php:24-54 (advisory lock, Mailpit clear, Stripe reset call, cache/queue flush)
  • Add import: use Illuminate\Support\Facades\Http; to TestingController.php (needed for Mailpit clear call)
  • Create: tests/Feature/TestingResetGateTest.php (all gate tests via HTTP Host header — no production test seams)

Step 1: Write failing tests (gate behavior via HTTP Host header).

// tests/Feature/TestingResetGateTest.php
public function test_reset_blocked_on_wrong_hostname(): void {
    // FUNCTIONAL RISKS: unintended reset on staging/prod; Layer 2 hostname gate must block.
    config(['app.testing_token' => 'valid', 'app.qa_reset_allowed' => true]);

    $response = $this->withHeaders(['X-Testing-Token' => 'valid'])
        ->withServerVariables(['HTTP_HOST' => 'staging.rotatingroom.com'])
        ->postJson('/testing/reset');

    $response->assertStatus(403);
    $this->assertDatabaseHas('users', ['email' => 'ahmed@rotatingroom.com']); // not wiped
}

public function test_reset_allowed_on_qa_host_with_flag_and_token(): void {
    config(['app.testing_token' => 'valid', 'app.qa_reset_allowed' => true]);

    $response = $this->withHeaders(['X-Testing-Token' => 'valid'])
        ->withServerVariables(['HTTP_HOST' => 'qa.rotatingroom.com'])
        ->postJson('/testing/reset');

    $response->assertStatus(200);
}

public function test_reset_blocked_without_flag(): void {
    // Absence of flag at boot = routes never registered = 404.
    config(['app.testing_token' => 'valid', 'app.qa_reset_allowed' => false]);

    $response = $this->withHeaders(['X-Testing-Token' => 'valid'])
        ->withServerVariables(['HTTP_HOST' => 'qa.rotatingroom.com'])
        ->postJson('/testing/reset');

    $response->assertStatus(404);
}

public function test_reset_blocked_with_bad_token(): void {
    config(['app.testing_token' => 'valid', 'app.qa_reset_allowed' => true]);

    $response = $this->withHeaders(['X-Testing-Token' => 'wrong'])
        ->withServerVariables(['HTTP_HOST' => 'qa.rotatingroom.com'])
        ->postJson('/testing/reset');

    $response->assertStatus(403);
}

public function test_concurrent_reset_returns_423_locked(): void {
    // Hold pg_advisory_lock in a separate DB connection; second caller must return 423.
    // See Task 1 Step 3 implementation for lock key.
    config(['app.testing_token' => 'valid', 'app.qa_reset_allowed' => true]);
    $lockKey = crc32('qa-testing-reset');

    DB::connection('primary-raw')->select('SELECT pg_advisory_lock(?)', [$lockKey]);
    try {
        $response = $this->withHeaders(['X-Testing-Token' => 'valid'])
            ->withServerVariables(['HTTP_HOST' => 'qa.rotatingroom.com'])
            ->postJson('/testing/reset');
        $response->assertStatus(423);
        $this->assertEquals('30', $response->headers->get('Retry-After'));
    } finally {
        DB::connection('primary-raw')->select('SELECT pg_advisory_unlock(?)', [$lockKey]);
    }
}

Step 2: Run tests → FAIL (gate + lock not implemented).

Step 3: Implement all three layers + advisory lock.

Layer 1 — bootstrap route registration (env-var only, boot time):

// bootstrap/app.php:26-30
then: function () {
    if (env('APP_ENV') === 'testing' || env('QA_RESET_ALLOWED') === '1') {
        Route::middleware(['web', \App\Http\Middleware\EnsureQaResetAllowed::class])
            ->group(base_path('routes/testing.php'));
    }
},

CSRF exception (bootstrap/app.php:52-54):

if (env('APP_ENV') === 'testing' || env('QA_RESET_ALLOWED') === '1') {
    $middleware->validateCsrfTokens(except: ['testing/*']);
}

Layer 2 — per-request hostname middleware:

// app/Http/Middleware/EnsureQaResetAllowed.php
public function handle(Request $request, Closure $next): mixed {
    $allowed = config('app.qa_reset_allowed') === true;
    $onQaHost = $request->getHost() === 'qa.rotatingroom.com'
                || app()->environment('testing'); // local/CI bypass

    if (!$allowed || !$onQaHost) {
        abort(403, 'Testing endpoints are not available for this host.');
    }

    return $next($request);
}

Layer 3 — controller with advisory lock + expanded reset:

// app/Http/Controllers/TestingController.php
public function reset(Request $request): JsonResponse {
    $this->validateToken($request);

    $lockKey = crc32('qa-testing-reset');
    $acquired = DB::selectOne('SELECT pg_try_advisory_lock(?) AS got', [$lockKey])->got;
    if (!$acquired) {
        return response()
            ->json(['error' => 'reset in progress'], 423)
            ->header('Retry-After', '30');
    }

    try {
        Schema::disableForeignKeyConstraints();
        $skipTables = ['migrations', 'spatial_ref_sys', 'geometry_columns', 'geography_columns', 'raster_columns', 'raster_overviews'];
        foreach (Schema::getTableListing() as $t) {
            if (!in_array($t, $skipTables)) DB::table($t)->truncate();
        }
        Schema::enableForeignKeyConstraints();

        Artisan::call('db:seed', ['--class' => 'Database\\Seeders\\TestDatabaseSeeder', '--force' => true]);
        Artisan::call('cache:clear');
        Artisan::call('queue:clear');

        // Mailpit clear (A.7)
        if ($url = config('services.mailpit.url')) {
            try { Http::timeout(3)->delete("{$url}/api/v1/messages"); }
            catch (Throwable $e) { /* non-fatal — Mailpit unreachable shouldn't block reset */ }
        }

        // Stripe test reset (Task 0c integration point).
        // Hard 5s cap per Stripe API call inside the helper; fail-open on timeout
        // (see Ahmed Concern #4 — a Stripe tail-latency hang would otherwise hold
        // the advisory lock for seconds and block the queued smoke runs).
        $stripeResult = app(\App\Services\Testing\StripeTestReset::class)->purgeTestCustomers();

        return response()->json([
            'status' => 'reset_complete',
            'stripe' => $stripeResult,  // {pattern_used, purged, skipped_timeout, duration_ms}
        ]);
    } finally {
        DB::statement('SELECT pg_advisory_unlock(?)', [$lockKey]);
    }
}

Lock-scope rationale (addresses Ahmed Concern #4): the advisory lock covers the full reset body including Stripe purge, but StripeTestReset::purgeTestCustomers() internally enforces a 5s hard cap per API call and returns partial-success on timeout (never throws). This keeps atomicity — a second caller sees the full post-reset state, not half of it — while bounding the worst-case lock hold to (DB truncate + reseed + Mailpit + Stripe timeout fallback) ≈ 20-25s + 5s = ≤30s p95. Moving Stripe outside the lock would be simpler but re-introduces the race: a second caller could see DB in "reset" state while Stripe still has prior-run customers. The plan accepts the tighter lock scope + per-call timeout as the right tradeoff.

Remove the fabricated hostname_override code path from any earlier plan draft — test seam is HTTP Host header only, no production-code branches.

Step 4: Run tests → PASS (all 5 gate tests + reset body).

Step 5: Commit. feat(testing): three-layer gate + advisory lock + expanded reset body


Task 2: Expand TestDatabaseSeeder with persona catalog + scenario fixtures

Files:

  • Modify: database/seeders/TestDatabaseSeeder.php (add seedEduPersonas, seedScenarios)
  • Create: database/seeders/data/personas.php (canonical persona list)
  • Modify: tests/playwright/utils/test-data.ts (mirror persona catalog, add .edu variants)
  • Create: tests/Feature/PersonaCatalogParityTest.php (CI drift check)
  • Create: tests/fixtures/scenarios.md (human-readable scenario documentation)

Step 1: Write parity test — AST-based, not regex (per Assumption 8).

Instead of regexing the TS file (fragile against as const, satisfies, nested objects, object shorthand), run a small Node helper that imports the module and emits its persona keys as JSON. The PHP test invokes the helper via Process::fromShellCommandline (Symfony Process, already used elsewhere in the codebase) and compares JSON arrays.

// tests/Feature/PersonaCatalogParityTest.php
public function test_php_and_ts_persona_lists_match(): void {
    // FUNCTIONAL RISKS: persona drift between PHP seeder and TS test-data causes
    // "correct persona but wrong state" failures; CI must block drift.
    $phpPersonas = collect(include database_path('seeders/data/personas.php'))->keys()->sort()->values()->all();

    $process = new \Symfony\Component\Process\Process(['npx', '--yes', 'tsx', base_path('tests/utils/extract-persona-keys.mjs')]);
    $process->run();
    $output = trim($process->getOutput());
    $this->assertNotEmpty($output, 'TS persona extractor returned empty output: ' . $process->getErrorOutput());
    $tsPersonas = json_decode($output, true);
    $this->assertIsArray($tsPersonas, 'TS extractor output was not valid JSON: ' . $output);

    $this->assertEquals($phpPersonas, $tsPersonas, 'PHP persona list must match TS persona list');
}

Create tests/utils/extract-persona-keys.mjs:

// tests/utils/extract-persona-keys.mjs
import { personas } from '../playwright/utils/test-data.ts';
console.log(JSON.stringify(Object.keys(personas).sort()));

CI installs tsx via npm install -g tsx or uses npx --yes tsx per-invocation. This is robust against every TS idiom because the module is actually evaluated by Node's TS loader, not parsed with a regex. Symfony Process is used rather than PHP's shell functions because it's the project's standard (safer argument passing, existing dependency).

Step 2: Run → FAIL (file doesn't exist).

Step 3: Create database/seeders/data/personas.php — self-documenting names, shared password, qa- prefix on every email so StripeTestReset's purge pattern catches them (Ahmed Concern #5):

<?php
// Every persona email starts with `qa-` to match QA_STRIPE_EMAIL_PATTERN (`qa-*@*`)
// and the broader "recognizable QA user" convention (per 2026-04-24 decision).
// Password for every persona: `RR4Life!` (stored in config('testing.persona_password')).
// Keys are camelCase for TS parity; emails are kebab-case for URL/email readability.
return [
    // --- Stable non-.edu users ---
    'qaLister' => [
        'id' => 1001, 'email' => 'qa-lister@rotatingroom.com',
        'email_verified_at' => 'now', 'active' => 1,
        'owns_baseline_rooms' => true,  // owner of all 60 rooms + scenario fixtures 9001-9003
    ],
    'qaSupport' => [
        'id' => 1002, 'email' => 'qa-support@rotatingroom.com',
        'active' => 1,
    ],
    'qaFounder' => [
        'id' => 1003, 'email' => 'qa-founder@rotatingroom.com',
        'active' => 1,
    ],
    'qaDemo' => [
        'id' => 1004, 'email' => 'qa-demo@rotatingroom.com',
        'email_verified_at' => 'now', 'active' => 1,
    ],
    'qaVerifyExpired' => [
        'id' => 1005, 'email' => 'qa-verify-expired@rotatingroom.com',
        'email_verified_at' => '-7 months', 'active' => 1,
    ],
    'qaBlocked' => [
        'id' => 1006, 'email' => 'qa-blocked@rotatingroom.com',
        'active' => 1, 'blocked' => 1,
    ],
    'qaInactive' => [
        'id' => 1007, 'email' => 'qa-inactive@rotatingroom.com',
        'active' => 0,
    ],
    // --- .edu verification personas ---
    'qaEduUnverified' => [
        'id' => 1010, 'email' => 'qa-edu-unverified@example.edu',
        'email_verified_at' => null, 'active' => 1,
        'owns_pending_free_listing' => 9004,  // scenario fixture room ID
    ],
    'qaEduVerified' => [
        'id' => 1011, 'email' => 'qa-edu-verified@example.edu',
        'email_verified_at' => 'now', 'active' => 1,
    ],
    'qaEduLocked' => [
        'id' => 1012, 'email' => 'qa-edu-locked@example.edu',
        'email_verification_failed_attempts' => 5,
        'email_verification_locked_at' => 'now',
    ],
    'qaEduVerifying' => [
        'id' => 1013, 'email' => 'qa-edu-verifying@example.edu',
        'email_verification_failed_attempts' => 3,  // exercises attempts 4-5
    ],
];

Step 3b: Create database/seeders/data/scenarios.php — non-user fixtures (room-level state):

<?php
return [
    'paidDraftListing' => ['room_id' => 9001, 'owner_id' => 1001, 'plan' => 'rr-monthly', 'status' => 'pending_payment'],
    'fraudFlaggedRoom' => ['room_id' => 9002, 'owner_id' => 1001, 'is_fraudulent' => true, 'status' => 'inactive'],
    'underReviewRoom' => ['room_id' => 9003, 'owner_id' => 1001, 'under_review' => true, 'status' => 'active'],
    'eduPendingFreeListing' => ['room_id' => 9004, 'owner_id' => 1010, 'plan' => 'rr-free', 'status' => 'pending_free'],
];

Step 3c: Add persona-email-matches-purge-pattern invariant (Ahmed Concern #5 — Task 3 probe extension):

// tests/Feature/PersonaEmailPurgePatternTest.php
public function test_every_persona_email_matches_stripe_purge_pattern(): void {
    // FUNCTIONAL RISKS: if a persona email doesn't match QA_STRIPE_EMAIL_PATTERN,
    // specs 18/35 create Stripe customers for them that never get cleaned up,
    // re-creating the exact Stripe-side drift this plan is trying to eliminate.
    $personas = include database_path('seeders/data/personas.php');
    $pattern = config('testing.stripe_email_pattern', 'qa-*@*');
    // Convert wildcard pattern to regex (same logic as StripeTestReset)
    $regex = '/^' . str_replace(['*', '.'], ['.*', '\.'], $pattern) . '$/i';

    foreach ($personas as $key => $p) {
        $this->assertMatchesRegularExpression(
            $regex, $p['email'],
            "Persona '{$key}' email '{$p['email']}' does not match Stripe purge pattern '{$pattern}'. " .
            "Specs that create Stripe customers for this persona will leak test-mode data across runs."
        );
    }
}

Step 3d: Create docs/QA_PERSONAS.md — team-facing cheat-sheet:

# QA Personas — canonical test users on the QA server

All passwords: `RR4Life!` (same as prototype/staging QA credentials).
All reset to this state before every `/qa-smoke` run.

| Email | ID | State | When would I log in as this user? |
|-------|----|----|------|
| qa-lister@rotatingroom.com | 1001 | verified, owns 60 baseline rooms | "I want to see the lister dashboard / edit listing flows" |
| qa-demo@rotatingroom.com | 1004 | verified, non-.edu | "I want to test as a regular authenticated user" |
| qa-edu-unverified@example.edu | 1010 | unverified .edu with pending free listing | "I want to test the first-time .edu verification flow" |
| qa-edu-verified@example.edu | 1011 | verified .edu | "I want to test post-verification state" |
| qa-edu-locked@example.edu | 1012 | 5 failed attempts, locked | "I want to test the lockout UI" |
| qa-edu-verifying@example.edu | 1013 | 3 failed attempts | "I want to test attempts 4-5 of verification" |
| qa-blocked@rotatingroom.com | 1006 | blocked | "I want to test blocked-user paths" |
| qa-verify-expired@rotatingroom.com | 1005 | verified 7mo ago | "I want to test re-verification prompt" |
| qa-inactive@rotatingroom.com | 1007 | inactive | "I want to test inactive-account paths" |
| qa-support@rotatingroom.com | 1002 | active | "I want to test ops/support flows" |
| qa-founder@rotatingroom.com | 1003 | active | "I want to test founder/strategic flows" |
| qa-admin@rotatingroom.com (admin) | admin #1 | active | "I want to log in as Backpack admin" |

**Linked from:** `CLAUDE.md` (QA section), smoke dashboard header, `#qa` channel topic.

Task 2 ensures this doc is updated any time a persona is added/removed (parity test fails the build otherwise).

Step 4: Modify TestDatabaseSeeder::run() to iterate the catalog:

$catalog = include database_path('seeders/data/personas.php');
foreach ($catalog as $key => $config) {
    $this->seedPersona($key, $config);
}

Step 5: Mirror into tests/playwright/utils/test-data.tskeys and emails must match personas.php exactly (parity test enforces it; per Codex v2 P1 finding). Every email carries the qa- prefix to match the Stripe purge pattern. Password comes from the shared constant, not per-persona:

// tests/playwright/utils/test-data.ts
export const PERSONA_PASSWORD = 'RR4Life!';  // shared across all personas; matches config('testing.persona_password')

export const personas = {
  qaLister:        { email: 'qa-lister@rotatingroom.com',         password: PERSONA_PASSWORD, name: 'QA Lister', canSendQueries: false },
  qaSupport:       { email: 'qa-support@rotatingroom.com',        password: PERSONA_PASSWORD, name: 'QA Support', canSendQueries: false },
  qaFounder:       { email: 'qa-founder@rotatingroom.com',        password: PERSONA_PASSWORD, name: 'QA Founder', canSendQueries: false },
  qaDemo:          { email: 'qa-demo@rotatingroom.com',           password: PERSONA_PASSWORD, name: 'QA Demo', canSendQueries: false },
  qaVerifyExpired: { email: 'qa-verify-expired@rotatingroom.com', password: PERSONA_PASSWORD, name: 'QA Verify Expired', canSendQueries: false },
  qaBlocked:       { email: 'qa-blocked@rotatingroom.com',        password: PERSONA_PASSWORD, name: 'QA Blocked', canSendQueries: false },
  qaInactive:      { email: 'qa-inactive@rotatingroom.com',       password: PERSONA_PASSWORD, name: 'QA Inactive', canSendQueries: false },
  qaEduUnverified: { email: 'qa-edu-unverified@example.edu',      password: PERSONA_PASSWORD, name: 'QA Edu Unverified', canSendQueries: true },
  qaEduVerified:   { email: 'qa-edu-verified@example.edu',        password: PERSONA_PASSWORD, name: 'QA Edu Verified', canSendQueries: true },
  qaEduLocked:     { email: 'qa-edu-locked@example.edu',          password: PERSONA_PASSWORD, name: 'QA Edu Locked', canSendQueries: true },
  qaEduVerifying:  { email: 'qa-edu-verifying@example.edu',       password: PERSONA_PASSWORD, name: 'QA Edu Verifying', canSendQueries: true },
} as const;

Step 6: Run parity test → PASS.

Step 6: Post-seed state verification test (orchestrator shock-test addition). Beyond the parity test (keys match between PHP and TS) and the CI migration-seeder test (seeder runs without errors), add a concrete assertion suite that the seeded state is exactly right:

// tests/Feature/SeededStateTest.php
public function test_seeded_users_exist_with_correct_state(): void {
    // FUNCTIONAL RISKS: seeder typo (wrong ID, missing field, wrong flag) silently
    // corrupts baseline state; every downstream spec inherits the corruption.
    $this->artisan('db:seed', ['--class' => 'Database\\Seeders\\TestDatabaseSeeder', '--force' => true]);

    // Spot-check each persona by exact state
    $this->assertDatabaseHas('users', ['id' => 1010, 'email' => 'qa-edu-unverified@example.edu', 'email_verified_at' => null, 'active' => 1]);
    $this->assertDatabaseHas('users', ['id' => 1012, 'email' => 'qa-edu-locked@example.edu', 'email_verification_failed_attempts' => 5]);
    $this->assertDatabaseHas('users', ['id' => 1006, 'email' => 'qa-blocked@rotatingroom.com', 'blocked' => 1]);
    // ... all 11 personas + admin
    $this->assertDatabaseHas('admin_users', ['email' => 'qa-admin@rotatingroom.com']);

    // Scenario fixture rooms
    $this->assertDatabaseHas('rooms', ['id' => 9001, 'user_id' => 1001, 'status' => 'pending_payment']);
    $this->assertDatabaseHas('rooms', ['id' => 9002, 'user_id' => 1001, 'is_fraudulent' => true]);
    $this->assertDatabaseHas('rooms', ['id' => 9004, 'user_id' => 1010]);

    // Baseline rooms count
    $this->assertEquals(60, \App\Models\Room::where('id', '<=', 60)->count());

    // Sequence advance
    $this->assertGreaterThanOrEqual(1014, DB::selectOne("SELECT nextval('users_id_seq') AS n")->n);
}

Runs in the same CI slot as the migration-seeder check. Catches seeder typos before they reach QA.

Step 6b: Verify admin seeding. qa-admin@rotatingroom.com is the Backpack admin for QA (appears in Appendix B.1). It lives in the admin_users table, not users, so it's not part of personas.php. TestDatabaseSeeder::seedAdmins() must still run and produce this exact email. Task 2 explicitly checks this and updates the seeder if the existing email differs from qa-admin@rotatingroom.com (per the naming convention).

Step 6c: Sequence advancement. After seeding, advance PG sequences past max seeded ID so spec inserts don't collide:

DB::statement("SELECT setval('users_id_seq', 1013)");        // next insert → 1014
DB::statement("SELECT setval('rooms_id_seq', 9004)");        // next insert → 9005
// plus admin_users, cities, stripe_plans, allows, restrictions per Appendix B.7

Step 6d: Team-facing documentation. In addition to docs/QA_PERSONAS.md (from Step 3d), Task 2 links the cheat-sheet from every discovery surface an engineer would hit:

  • CLAUDE.md Testing section — add a "QA Personas" subsection with link to docs/QA_PERSONAS.md
  • #qa Slack channel topic — update via Slack API to reference the cheat-sheet URL
  • Smoke dashboard header (tests/playwright/qa-smoke/dashboard-golden.html) — add a small "QA personas: [link]" header note
  • .claude/skills/qa-smoke/SKILL.md — reference under a new "Known personas" subsection

Without these links, the cheat-sheet rots from day 1 — a new engineer who doesn't know it exists will never find it.

Step 6e: Password rotation runbook. Create docs/handoffs/qa-password-rotation.md covering the rotation procedure (referenced by Risk 11):

  1. Generate new password (21+ chars, same entropy class as current prototype/staging QA credentials).
  2. Update config/testing.php's persona_password value.
  3. If QA server .env overrides TESTING_PERSONA_PASSWORD, update it there too.
  4. Update docs/QA_PERSONAS.md header.
  5. Re-deploy to QA.
  6. Trigger /testing/reset to re-seed with new bcrypt.
  7. Run a small smoke subset (npx playwright test --grep qa-demo) to verify login still works.
  8. Slack #qa with the new password (or reference to the secrets manager where it's stored).

Step 7: Commit. feat(testing): persona catalog with .edu variants + CI parity check + discoverability links


Task 3: Gate-bypass + Stripe-pattern + persona-parity adversarial probes

Files:

  • Create: tests/Feature/TestingResetGateBypassTest.php (gate layers)
  • Create: tests/Feature/StripeTestResetPatternGuardTest.php (Ahmed Concern #3 — pattern validation)
  • Create: tests/Feature/PersonaEmailPurgePatternTest.php (Ahmed Concern #5 — persona ↔ purge invariant; already referenced in Task 2 but the probe suite is this task's responsibility)

Coverage:

  1. Gate-bypass probes: every combination of missing gate condition against a simulated staging/prod environment. Hard-blocks any request that shouldn't succeed (see Task 1 test file — Task 3 extends it with fuzzing over unexpected combos).

  2. Stripe pattern guard (Ahmed Concern #3; allowlist implementation per v2 review): StripeTestReset::ALLOWED_PATTERNS is a hard-coded 4-entry set (qa-*@*, qa-smoke-*@*, qa-smoketest@*, qa-proplan-test@*). Test matrix covers membership, not pattern shape:

    • Allowed (pass): each of the 4 entries verbatim — qa-*@*, qa-smoke-*@*, qa-smoketest@*, qa-proplan-test@*.
    • Rejected (throw InvalidArgumentException): null, "", " ", "*", "*@*", "@*", ".+", "qa-*@example.com" (not in allowlist — reject even though qa- prefix present), "qa-smoke-*@example.com" (same — narrow variant not in list), "edu-*@*" (no qa- prefix).
    • Test asserts HTTP response is 500 (not 200) when endpoint is called with a rejected pattern. The "add a new pattern requires a PR" invariant is enforced here: any pattern outside the allowlist fails loudly at construction, and widening the allowlist requires editing StripeTestReset.php.
  3. Persona-email purge-pattern invariant (new, Ahmed Concern #5): the PersonaEmailPurgePatternTest described in Task 2 Step 3c — every seeded persona email must match the default QA_STRIPE_EMAIL_PATTERN. Prevents the failure mode where a persona added later doesn't match the purge pattern and accumulates Stripe customers across runs.

Implementation blocker (flagged in Round 3 review): the EnsureQaResetAllowed middleware contains a || app()->environment('testing') bypass that makes the hostname check a no-op during PHPUnit feature tests. If Task 3 naively calls withServerVariables while APP_ENV=testing, the middleware passes via the bypass rather than the hostname check — so the adversarial probe tests nothing. Resolution options:

  1. Mock the environment in each probe test: $this->app->detectEnvironment(fn () => 'production') before the request. Asserts the hostname gate fires in a non-testing env.
  2. Add a narrow configconfig('app.testing_host_bypass') === true — defaulting to false, explicitly set to true only in tests that need the bypass (gate tests in Task 1 set it to false).
  3. Drop the bypass entirely and require all feature tests to set HTTP_HOST=qa.rotatingroom.com when calling /testing/reset. Simplest — no production-code branch.

Preferred: Option 3 (simplest, no prod-code branch). Update Task 1's EnsureQaResetAllowed middleware to remove the testing bypass, and update any existing feature test that hits /testing/reset to set the Host header.

Commit: test(testing): adversarial gate-bypass probe; drop testing-env bypass


Phase B — Smoke runner integration

Task 4: Call reset from smoke runner before each run

Files:

  • Modify: .claude/skills/qa-smoke/SKILL.md section 2 ("Clean previous state")
  • Modify: scripts/smoke/run.sh (if exists) or inline in skill

Step 1: Add reset step to skill — QA-only, gated on $TARGET (Codex P1 finding).

/qa-smoke is a multi-environment entry point (qa, staging, production, local). Reset MUST NOT run against any env except QA. Hard-gate the reset call on $TARGET:

# Section 2.1: reset target database to baseline state — QA ONLY.
if [ "$TARGET" != "qa" ]; then
  echo "Skipping reset: target is '$TARGET' (reset only runs against qa)."
  echo "  - staging/production: reset is blocked by multiple gates AND is catastrophic; never automate."
  echo "  - local: developer manages their own DB reset via 'php artisan migrate:fresh --seed --seeder=Database\\Seeders\\TestDatabaseSeeder --force'."
else
  echo "Resetting QA database..."
  RESET_START=$(date +%s)

  # Capture curl's exit status IMMEDIATELY — any intervening command (date, echo, assignment)
  # overwrites $?. Use a variable to preserve it.
  if ! curl -sf -X POST "https://qa.rotatingroom.com/testing/reset" \
    -H "X-Testing-Token: $QA_TESTING_TOKEN" \
    -H "Content-Type: application/json" \
    --max-time 60; then
    echo "ERROR: Reset failed — aborting smoke run. Running against un-reset QA defeats the purpose."
    exit 2
  fi

  RESET_END=$(date +%s)
  RESET_DURATION=$(( RESET_END - RESET_START ))
  echo "Reset complete in ${RESET_DURATION}s"

  # Telemetry: log reset duration to the smoke dashboard (non-blocking)
  curl -s -X POST "http://localhost:3456/api/smoke/reset-telemetry" \
    -H 'Content-Type: application/json' \
    -d "{\"env\":\"$TARGET\",\"durationSec\":$RESET_DURATION}" --max-time 5 2>/dev/null || true
fi

Note on if ! curl pattern: the reset-failure check is intentionally the first thing that runs after curl returns. Any intervening commands (date, echo, assignment) would overwrite $?, making the subsequent [ $? -ne 0 ] test meaningless — it would check the last command's status, not curl's. This is the fail-open pattern Codex Round 3 flagged in the Round-2 draft of the plan.

423 Locked retry behavior (per Risk 14, v2 review): when /testing/reset returns 423 Locked (another operator's reset is in flight), the skill script retries up to 3 times with 30s backoff before aborting — the second operator's suite queues cleanly behind the first rather than failing immediately. Pseudocode (uses explicit status-code check rather than -sf, which has cross-version quirks around -w output when --fail triggers):

RESET_ATTEMPTS=0
MAX_ATTEMPTS=3
while true; do
  HTTP_CODE=$(curl -s -o /tmp/reset-out.json -w '%{http_code}' -X POST \
    "https://qa.rotatingroom.com/testing/reset" \
    -H "X-Testing-Token: $QA_TESTING_TOKEN" \
    -H "Content-Type: application/json" \
    --max-time 60)

  if [ "$HTTP_CODE" -ge 200 ] && [ "$HTTP_CODE" -lt 300 ]; then
    break  # success — 2xx
  fi

  if [ "$HTTP_CODE" = "423" ] && [ "$RESET_ATTEMPTS" -lt "$MAX_ATTEMPTS" ]; then
    RESET_ATTEMPTS=$(( RESET_ATTEMPTS + 1 ))
    echo "Reset in progress (another operator); waiting 30s (attempt $RESET_ATTEMPTS/$MAX_ATTEMPTS)..."
    sleep 30
    continue
  fi

  echo "ERROR: Reset failed with HTTP $HTTP_CODE — aborting."
  cat /tmp/reset-out.json 2>/dev/null
  exit 2
done

Commit: feat(qa-smoke): call reset via Playwright setup project; retry on 423, abort on other failures


Task 4a: Pre-QA canary (local clone validation)

Files:

  • Create: docs/handoffs/2026-04-XX-smoke-reset-canary.md (canary results — populated when Task 4a runs)

Steps:

  1. pg_dump current QA DB to a local rotatingroom_qa_canary database. Do NOT truncate QA.
  2. Point a local Laravel instance at the canary DB (temporary .env.canary or connection override).
  3. Run /testing/reset 20 times against the canary. Record: p50/p95/p99, any FK violations, any orphan rows, any Stripe quota warnings, Mailpit reachability.
  4. If any iteration fails, fix before proceeding to Phase B Task 4.

Exit criterion: 20 clean iterations on canary; handoff doc published with metrics.

Commit: docs(smoke): local canary results for reset loop


Task 4b: Cut-over announcement

Files:

  • Create: docs/handoffs/2026-04-XX-smoke-reset-cutover.md (populated with actual date when Phase B ships)

Before Task 4 lands, post in #qa:

"Heads up — starting [date], every smoke run resets the QA database to a clean seeded state. If you're doing manual UI testing on QA, please (a) finish your session before [time], or (b) ping this thread to request a reset-free window."

Track acknowledgements from Mahmoud + Megan before landing Task 4.

Commit: docs(qa): cut-over announcement for deterministic reset


Task 4c: Pre-run active-session coordination (replaces original 4h window)

Files:

  • Modify: .claude/skills/qa-smoke/SKILL.md (add pre-reset coordination step before reset call)
  • Create: app/Http/Controllers/TestingController@activeSessions endpoint (read-only — returns active Backpack admin sessions + recent login activity)

Why this is simpler than the original 4h window (per Megan's feedback + Gaurav direction): the original plan proposed an operator-settable QA_RESET_ALLOWED=0 window that auto-expires after 4 hours. Megan correctly pointed out that QA review sessions can bleed into meetings, support calls, or next-day follow-ups — a hard 4h cap creates false urgency. A scheduled auto-resume is also a new moving part to fail.

Simpler replacement: reset only runs when someone invokes /qa-smoke. No scheduled window, no auto-expire, no env-flag toggling. The skill adds a pre-reset coordination check:

# New step in qa-smoke skill, before calling /testing/reset:
ACTIVE=$(curl -sf -H "X-Testing-Token: $QA_TESTING_TOKEN" \
  "https://qa.rotatingroom.com/testing/active-sessions")

if echo "$ACTIVE" | jq -e '.backpack_admin_active == true' >/dev/null; then
  echo "⚠️  Active Backpack admin session on QA — likely someone doing manual review."
  echo "    Last activity: $(echo "$ACTIVE" | jq -r '.last_admin_activity')"
  echo ""
  echo "Options:"
  echo "  1. Continue anyway (their session gets wiped)"
  echo "  2. Abort and ping #qa to coordinate"
  read -r -p "Choice [1/2]: " CHOICE
  [ "$CHOICE" = "2" ] && { echo "Aborted. Coordinate in #qa."; exit 3; }
fi

Why this works: we already coordinate for any shared-resource action on QA. The pre-run check surfaces the conflict at the moment it matters (not via a timed window that may or may not still be active), gives the operator an explicit choice, and costs nothing when QA is idle (the fast path is the common path — backpack_admin_active == false).

The active-sessions endpoint: read-only, returns JSON {backpack_admin_active: bool, last_admin_activity: iso8601, recent_logins: [...]}. Gated by the same 3-layer stack as /testing/reset (Task 1) — same token, same hostname, same env flag. No destructive side effects, but the information it returns is sensitive, so gate it the same way.

Commit: feat(qa-smoke): pre-reset active-session coordination (replaces 4h window)


Phase C — Spec migration (multiple PRs, one per spec cluster)

Task 5: Migrate Spec 34 (fraud-and-moderation) to use fraudFlaggedRoom fixture

Replace adminUpdateUser(demoId, {is_fraudulent: 0}) cleanup call (currently a no-op) with a reliance on the reset contract — the fixture guarantees the exact starting state, and no cleanup is needed because the next run's reset handles it.

Commit: test(smoke): migrate spec 34 to use fraudFlaggedRoom fixture

Task 6: Migrate journey edu-* specs to use .edu personas

Replace all seededDemo logins in tests/playwright/journeys/edu-*.spec.ts with eduUnverified / eduVerified / eduLocked as scenario demands. Remove test.skip() fallbacks that existed for the unknown-state case.

Commit: test(journeys): use .edu personas; remove silent-skip fallbacks

Task 7: Migrate spec 13/21 (verification flow) to use eduLocked fixture

Spec 21's lockout test currently races the 5-failed-attempt counter; with eduLocked persona pre-seeded to failed_attempts=5, the test begins from locked state deterministically.

Commit: test(smoke): spec 21 uses eduLocked fixture for deterministic lockout

Task 8: Migrate spec 18 (Stripe payment plans) to rely on reset contract

Spec 18 switches from ad-hoc payment-listing.json cleanup to relying on the reset contract. The reset contract comprises Task 1's /testing/reset endpoint + Task 0c's StripeTestReset helper (the endpoint calls the helper internally — see Task 1 Step 3 for the integration point). Remove savePaymentListing(null) workaround.

Commit: test(smoke): spec 18 relies on reset contract; remove null-marker workaround

Task 9: Remove ownership-guard workaround from qa-helpers.ts

Once specs rely on seeded fixtures instead of inheriting ephemeral listings, the ownership-guard helper probeValidRoomId in tests/playwright/utils/qa-helpers.ts:59 becomes redundant (it's the only such helper on master; getListingIdOwnedByCurrentLister and readOwnedListingId were added on branch tests/smoke-fixes-26-35-42-ownership-guard in PR #4283 — if that PR has merged by the time this task runs, include those in the deprecation too; if not, just probeValidRoomId). Mark @deprecated; specs 35/42/etc. switch to fixture-based room IDs from the persona catalog. Remove the helper(s) entirely once no specs import them (measured via CI grep).

Commit: refactor(smoke): deprecate ownership-guard; specs use seeded room fixtures


Phase D — Observability

Shock-test deferrals (noted here, not blocking Phase A–C):

  • Reset latency under real QA load is unmeasured. Task 0's 20-run local baseline is a floor. QA carries webhook traffic, Horizon queues, cron jobs — advisory lock contention + real Horizon row-locks could push p95 higher. Week 1 post-armament, Task 11's telemetry widget should flag if observed p95 > local p95 × 2 ("QA is significantly slower than predicted — investigate lock contention / Horizon drain").
  • Automated state-drift detector beyond reviewer tagging. Success metric #1 relies on reviewers explicitly tagging state-drift. An independent check — weekly cron that compares current QA users count against baseline (should be exactly 11 + any ephemeral users created since the most recent reset, all with created_at > last-reset-time) — catches drift that slips past reviewers. Deferred to post-launch as a Task 12 extension.

Task 10: Weekly orphan-image pruner for DO Spaces — split into separate plan (issue TBD)

Status (per v2 review): Task 10 is thematically adjacent but NOT on the determinism critical path. The 7 safeguards it requires (dry-run arming, count/percentage caps, soft-delete inclusion, primary-DB read, weekly Slack summary, adversarial probes, DO Spaces versioning precondition) plus a 3-week arming runbook amount to a 1-2 day sub-project that would pull this plan's center of gravity toward Spaces cleanup.

Decision: spin Task 10 out into its own plan/issue, owned by whoever picks up the Spaces-orphan work. This plan ships Phases A–D without Task 10; the orphan-image cleanup work happens in parallel or after.

Rationale: Ahmed's Concern #6 raised the safeguards in the context of this plan, but the safeguards ARE the plan for Task 10 — they're the whole surface area. Keeping Task 10 here compounds scope; splitting it addresses the concern at the right level (its own review cycle) without dragging this plan.

Issue to file: "Weekly DO Spaces orphan-image pruner — production-safe rollout" (references this plan's Appendix A.6 for the state class and Ahmed's 7 safeguards). All design details previously under Task 10 move to that new plan.

Design details (retained here for reference until split issue lands):

Expand original Task 10 design

Files:

Files:

  • Create: app/Console/Commands/PruneOrphanListingImages.php
  • Create: config/spaces_prune.php (dedicated config with hard bounds, never read from arbitrary env)
  • Modify: routes/console.php (schedule weekly, onOneServer() + withoutOverlapping())
  • Create: tests/Feature/PruneOrphanListingImagesTest.php (adversarial — empty-room, replica-lag, soft-deleted scenarios)

Why this task needs production-grade safeguards (Ahmed Concern #6): a bug in the orphan-detection join (missed relation, read-replica lag, soft-deleted rooms not considered) could delete live listing images at scale. Unlike Task 0c's Stripe helper (which operates on Stripe test-mode — low blast radius), this command operates on production DO Spaces — real user photos, not a test sandbox. The destructive-action surface is large enough that the command needs layered defenses before it's trusted to run unattended.

Seven safeguards:

  1. Dry-run by default. SPACES_PRUNE_EXECUTE env defaults to false. Without the explicit opt-in, the command logs "WOULD DELETE: N objects" but calls no delete APIs. First 3 runs must be dry-run; the Slack summary shows what it would have deleted and an operator reviews before arming. Documented in docs/handoffs/spaces-prune-armament.md.

  2. Hard deletion cap per run. config('spaces_prune.max_deletions_per_run') = 100 (compile-time default, not env-driven). If the orphan count exceeds the cap, the command aborts and posts #qa with "REFUSED: N orphans detected, cap is M. Investigate before raising the cap." Legitimate orphan volume from a healthy smoke cycle is <20/week.

  3. Percentage cap. Refuse to delete more than 5% of total objects under listings/* in a single run. Catches the failure mode where the room set is empty (e.g., test database accidentally used as source) and everything looks orphaned. Both the count cap AND the percentage cap must be under their limits; either alone is insufficient.

  4. Include soft-deleted rooms in the "still in use" set. Room::withTrashed()->pluck('id') — a soft-deleted listing may be restored and still need its images. An orphan is ONLY an S3 object whose room ID (parsed from path prefix) isn't in the combined active + soft-deleted set.

  5. Force primary-DB read. Wrap the room-ID query in DB::connection('pgsql')->useWriteConnection() (or equivalent). Replica lag of even a few seconds could make freshly-created rooms look orphaned on the replica while their images are already in S3.

  6. Weekly #qa summary. After every run (dry OR armed), post: "Spaces prune: N orphans detected, M deleted, K bytes freed, {dry-run|armed}." Silent destructive jobs rot — visibility keeps them honest. If no summary posts, something's wrong with the scheduled run.

  7. Adversarial probe test. PruneOrphanListingImagesTest covers:

    • test_aborts_when_rooms_table_is_empty — refuses to treat empty-table as "all orphaned"
    • test_includes_soft_deleted_rooms_in_still_in_use_set — does not delete images for soft-deleted listings
    • test_count_cap_halts_at_threshold — at 101 orphans with cap=100, aborts + posts to Slack
    • test_percentage_cap_halts_at_threshold — at 5.1% of listings/* orphaned, aborts
    • test_dry_run_makes_no_delete_calls — asserts zero S3 delete calls when SPACES_PRUNE_EXECUTE=false
    • test_posts_slack_summary_every_run — both dry-run and armed runs post summaries

Arming sequence (runbook, not code):

  • Week 1: ships with SPACES_PRUNE_EXECUTE=false. Scheduled run posts weekly summary. Operator reviews the "would delete" counts for 3 consecutive weeks.
  • Week 4+: if summaries are sane (small orphan counts matching expected smoke churn, no false positives for soft-deleted rooms), operator sets SPACES_PRUNE_EXECUTE=true. Summary continues weekly.
  • Any week the count exceeds the cap: command aborts + #qa alert. Operator investigates before next run.

Rollback: SPACES_PRUNE_EXECUTE=false instantly reverts to dry-run. No data recovery needed in normal operation. If catastrophic delete ever happens (shouldn't — count+percentage caps prevent it): DO Spaces supports versioning + 30-day recovery if versioning was enabled. Arming precondition: verify DO Spaces versioning is enabled on the bucket before switching to SPACES_PRUNE_EXECUTE=true.

Commit: feat(cleanup): weekly Spaces orphan pruner with dry-run, count cap, percentage cap, and weekly Slack summary

Task 11: Reset-duration guardrail + state-drift disposition vocabulary

Files:

  • Modify: ~/dev-server/api/smoke.js (add /api/smoke/reset-telemetry endpoint)
  • Modify: tests/playwright/qa-smoke/dashboard-golden.html (p95 widget + state-drift counter)
  • Modify: .claude/skills/smoke-feedback-review/SKILL.md (add state-drift disposition)
  • Modify: the smoke-feedback dashboard UI (~/dev-server/sites/smoketest/... or labs equivalent) to accept the new tag

Deliverable 1 — reset telemetry: smoke runner POSTs {durationSec, env, runKey} after each reset; widget shows p50/p95/p99 over last 10 runs; alert posts to #qa if p95 > 30s.

Deliverable 2 — state-drift disposition: add a new disposition option alongside existing confirmed / test-needs-fix / not-a-bug. Reviewer documentation: "Use state-drift when the finding exists because prior-run state leaked into this run (not a test-logic bug, not an app bug). Examples: demo user still fraud-flagged from spec 34; listing.json owned by a prior run's ephemeral lister; subscription left past_due from spec 18." Dashboard counts state-drift per run and exposes the metric consumed by Success Metric #1.

Deliverable 3 — historical backfill: Task 11 retrospectively codes the last 4 smoke audits (2026-04-07, 2026-04-14, 2026-04-21, plus one post-Phase-A run) with dispositions so the baseline has a distribution, not a single point. Adds the distribution to docs/handoffs/2026-04-24-smoke-reset-baseline.md.

Commit: feat(smoke-dashboard): reset-p95 widget + state-drift disposition + historical baseline

Task 12: Monthly gate-audit review

Add to /post-deploy-verify skill: a monthly check of access logs for /testing/* endpoints against staging/prod hostnames. Any hit → P1 alert.

Commit: chore(audit): monthly gate-audit for testing endpoints


Phase ordering & dependencies

Phase A (infra) — Tasks 0, 0b, 0c, 1, 2, 3
  ↓ (measurement → CI checks → Stripe helper → gate → personas → gate-probe)
  ↓ Task 0 delivers before Task 1 starts (metric baselines)
  ↓ PR #3722 (pg_dump backup) MUST land before Phase B (catastrophic-loss rollback precondition)
Phase B (runner) — Tasks 4, 4b, 4c
  ↓ (Playwright setup project + cut-over announcement + reset-free-window escape hatch)
Phase C (specs) — Tasks 5, 6, 7, 8, 9 (can parallelize; each a separate PR)
  ↓ (specs rely on new determinism)
Phase D (observability) — Tasks 11, 12 (can run concurrently with Phase C; Task 10 split into a separate follow-up plan — see Task 10 placeholder)
Phase E (follow-up, conditional) — nightly snapshot approach from Alternative 7
  ↓ (triggered if Phase A-D reset p95 exceeds 30s for 2 consecutive weeks)

Branch Strategy

  • Feature branch: feature/smoke-deterministic-reset (from master)
  • Sub-branches:
    • feature/smoke-deterministic-reset/task-0-baseline — Phase A Task 0 (measurement day, docs-only)
    • feature/smoke-deterministic-reset/task-0b-ci — Phase A Task 0b (seeder-migration CI)
    • feature/smoke-deterministic-reset/task-0c-stripe — Phase A Task 0c (Stripe reset helper)
    • feature/smoke-deterministic-reset/task-1-gate — Phase A Task 1 (triple-gate + CSRF + mutex)
    • feature/smoke-deterministic-reset/task-2-personas — Phase A Task 2 (persona catalog)
    • feature/smoke-deterministic-reset/task-3-probe — Phase A Task 3 (gate-bypass adversarial probe)
    • feature/smoke-deterministic-reset/task-4-runner — Phase B Task 4 (Playwright setup project)
    • feature/smoke-deterministic-reset/task-4b-cutover — Phase B Task 4b (announcement)
    • feature/smoke-deterministic-reset/task-4c-window — Phase B Task 4c (reset-free window)
    • Phase C sub-branches: one per spec cluster (5-9)
    • Phase D sub-branches: one per observability task (11-12; Task 10 shipped via its own follow-up plan on a separate branch)

Merge order (hard preconditions):

  1. Task 0 (baseline) must land before Task 1 — metrics need real numbers.
  2. PR #3722 (pg_dump backup) must land before Phase B — catastrophic-loss rollback.
  3. Task 0c (Stripe) merges independently of gate work; no ordering constraint with Task 1.
  4. Tasks 1, 2, 3 must all pass CI before Phase B lands.
  5. Task 4b (cut-over announcement) posts ≥24h before Task 4 merges.
  6. Phase C can start once Phase B is on staging and one clean run observed.

Orchestrator: main session on feature/smoke-deterministic-reset manages sub-branch merges.

Each sub-branch passes /iterative-review before merging into the feature branch. Feature branch passes one final /ready-for-review cycle before PR to master.

@majones919

Copy link
Copy Markdown

Confirmed review.

A lot of the current smoke tests I'm re-testing are a result of the QA environment constantly changing, so this in theory will help reduce the extra noise added by those false positives.

Only slight concern- ability to pause the 4h reset - if we're in the middle of a manual review. Not all of our QA time frames align, and can expand over meetings, support calls, bleed into next day. But I think mostly doable.

@AhmedEssamElNaggar

Copy link
Copy Markdown

I don't think this is correct, Every time we push something to QA we create a fresh copy of the master branch and pick the feature branches or bugfix branches that we need to deploy to prod, then we push the new deployment branch into QA, so QA always get reset. The whole point of this doc is based on incorrect assumption.

@mahmoudessam7

Copy link
Copy Markdown

Reviewed, sounds like a good, well thought about plan.

@gsingal

gsingal commented Apr 23, 2026

Copy link
Copy Markdown
Author

@ApxSnowflake — I want to make sure we're lined up on what state persists vs resets on deploy. I queried QA DB directly just now (2026-04-23, a few hours after the a0e61717e deploy at 18:02 UTC):

Evidence of cross-run state persistence on QA:

Query Result Would look like this if deploys reset DB
Total users in users table 96,111 (oldest 2020-02-12, newest today) ~7 baseline users
Ephemeral qa-smoke* / qa-lister* / qa-traveler* users 42 accounts, oldest from 2026-03-30 0 (ephemeral pattern would only exist mid-run)
qa-smoke users created on 2026-04-11 alone 19 accounts still present 12 days later 0
demo@rotatingroom.com Single row, created_at = 2023-03-14 Row would have today's created_at
Orphan rooms (owner not in baseline 7) 44,695 rooms 0
email_verification_codes rows 2,555 rows, oldest 2025-09-08 0
Users with email_verification_failed_attempts > 0 75 users in locked/semi-locked state 0
subscriptions table 16,858 rows from 2020-02-13 to today Baseline only

I think the confusion is that code deploys and DB resets are separate. Forge's deploy script runs php artisan migrate (additive — applies new migrations, preserves existing rows) rather than php artisan migrate:fresh --seed (destructive — drops + reseeds). The deployment branch you build contains new code, but QA's database has been continuously accumulating since 2020.

So the plan's foundational premise (state persists across runs) holds — and the failure modes we've been patching (PR #4283 ownership guard for stale listing.json, PR #4279 admin unflag for is_fraudulent, silent-skip cascades in .edu specs) are all downstream of this.

Does this match what you see on your side? Happy to hop on a call or share a terminal if you want to reproduce the queries.

@gsingal

gsingal commented Apr 23, 2026

Copy link
Copy Markdown
Author

@ApxSnowflake — I want to make sure we're lined up on what state persists vs resets on deploy. I queried QA DB directly just now (2026-04-23, a few hours after the a0e61717e deploy at 18:02 UTC):

Evidence of cross-run state persistence on QA:

Query Result Would look like this if deploys reset DB
Total users in users table 96,111 (oldest 2020-02-12, newest today) ~7 baseline users
Ephemeral qa-smoke* / qa-lister* / qa-traveler* users 42 accounts, oldest from 2026-03-30 0 (ephemeral pattern would only exist mid-run)
qa-smoke users created on 2026-04-11 alone 19 accounts still present 12 days later 0
demo@rotatingroom.com Single row, created_at = 2023-03-14 Row would have today's created_at
Orphan rooms (owner not in baseline 7) 44,695 rooms 0
email_verification_codes rows 2,555 rows, oldest 2025-09-08 0
Users with email_verification_failed_attempts > 0 75 users in locked/semi-locked state 0
subscriptions table 16,858 rows from 2020-02-13 to today Baseline only

I think the confusion is that code deploys and DB resets are separate. Forge's deploy script runs php artisan migrate (additive — applies new migrations, preserves existing rows) rather than php artisan migrate:fresh --seed (destructive — drops + reseeds). The deployment branch you build contains new code, but QA's database has been continuously accumulating since 2020.

So the plan's foundational premise (state persists across runs) holds — and the failure modes we've been patching (PR #4283 ownership guard for stale listing.json, PR #4279 admin unflag for is_fraudulent, silent-skip cascades in .edu specs) are all downstream of this.

Does this match what you see on your side? Happy to hop on a call or share a terminal if you want to reproduce the queries.

@gsingal

gsingal commented Apr 23, 2026

Copy link
Copy Markdown
Author

@ApxSnowflake — I want to make sure we're lined up on what state persists vs resets on deploy. I queried QA DB directly just now (2026-04-23, a few hours after the a0e61717e deploy at 18:02 UTC):

Evidence of cross-run state persistence on QA:

Query Result Would look like this if deploys reset DB
Total users in users table 96,111 (oldest 2020-02-12, newest today) ~7 baseline users
Ephemeral qa-smoke* / qa-lister* / qa-traveler* users 42 accounts, oldest from 2026-03-30 0 (ephemeral pattern would only exist mid-run)
qa-smoke users created on 2026-04-11 alone 19 accounts still present 12 days later 0
demo@rotatingroom.com Single row, created_at = 2023-03-14 Row would have today's created_at
Orphan rooms (owner not in baseline 7) 44,695 rooms 0
email_verification_codes rows 2,555 rows, oldest 2025-09-08 0
Users with email_verification_failed_attempts > 0 75 users in locked/semi-locked state 0
subscriptions table 16,858 rows from 2020-02-13 to today Baseline only

I think the confusion is that code deploys and DB resets are separate. Forge's deploy script runs php artisan migrate (additive — applies new migrations, preserves existing rows) rather than php artisan migrate:fresh --seed (destructive — drops + reseeds). The deployment branch you build contains new code, but QA's database has been continuously accumulating since 2020.

So the plan's foundational premise (state persists across runs) holds — and the failure modes we've been patching (PR #4283 ownership guard for stale listing.json, PR #4279 admin unflag for is_fraudulent, silent-skip cascades in .edu specs) are all downstream of this.

Does this match what you see on your side? Happy to hop on a call or share a terminal if you want to reproduce the queries.

@gsingal

gsingal commented Apr 23, 2026

Copy link
Copy Markdown
Author

@majones919 — great point on the 4-hour cap, and honestly the whole auto-resuming window is over-engineered now that I think about it.

Simpler design: reset only happens when someone runs /qa-smoke. No scheduled window, no auto-resume, no cap to calibrate. If you're mid-review and I kick off a smoke run, the smoke skill now shows a pre-reset check (e.g., "there's an active Backpack admin session on QA — continue and wipe it, or hold?"). You and I coordinate like we would for any other shared-resource action.

This removes the whole "window expiry" class of problem you flagged (bleed into next day, expanding calls, meetings). The tradeoff is smoke runs need a one-line #qa ping before kicking off — but we already do that in practice, and it's a smaller cost than engineering an extendable timed window.

I'll fold this into the plan (Task 4c gets replaced with a "pre-run coordination check in /qa-smoke" instead of a 4-hour scheduled window). Let me know if that works.

@gsingal

gsingal commented Apr 23, 2026

Copy link
Copy Markdown
Author

@majones919 — great point on the 4-hour cap, and honestly the whole auto-resuming window is over-engineered now that I think about it.

Simpler design: reset only happens when someone runs /qa-smoke. No scheduled window, no auto-resume, no cap to calibrate. If you're mid-review and I kick off a smoke run, the smoke skill now shows a pre-reset check (e.g., "there's an active Backpack admin session on QA — continue and wipe it, or hold?"). You and I coordinate like we would for any other shared-resource action.

This removes the whole "window expiry" class of problem you flagged (bleed into next day, expanding calls, meetings). The tradeoff is smoke runs need a one-line #qa ping before kicking off — but we already do that in practice, and it's a smaller cost than engineering an extendable timed window.

I'll fold this into the plan (Task 4c gets replaced with a "pre-run coordination check in /qa-smoke" instead of a 4-hour scheduled window). Let me know if that works.

@AhmedEssamElNaggar

Copy link
Copy Markdown

I see your point, if we're talking about database, yes we don't reset it often.

@AhmedEssamElNaggar

Copy link
Copy Markdown

Solid plan — it attacks the actual root cause instead of the symptoms, reuses existing infrastructure, and the multi-layer gate plus phased rollout show the author respects how catastrophic a misfire against prod would be.

@AhmedEssamElNaggar

Copy link
Copy Markdown

One hard invariant worth adding to Task 0c: StripeTestReset should reject unsafe patterns at construction time. If QA_STRIPE_EMAIL_PATTERN is null, empty, *, *@*, or anything that doesnt contain the hard-coded qa-smoke- prefix, the helper throws and the reset endpoint returns 500 — never silently treats it as "match all." Otherwise a forgotten env var on first deploy could wipe the entire Stripe test-mode environment (manual QA, dev experiments, other harnesses). Task 3s adversarial probe should include null/empty/wildcard pattern cases.

@AhmedEssamElNaggar

Copy link
Copy Markdown

Concern on Task 1: TestingController::reset() holds a Postgres advisory lock around the entire body — truncate, reseed, Mailpit clear, Stripe API calls, cache/queue flush. Stripe's API has a long tail (occasional multi-second hangs); if a Stripe call stalls, the advisory lock stays held until PHP's request timeout fires, blocking every subsequent smoke run queued behind it. The 30s p95 target assumes Stripe behaves — one bad roundtrip breaks the budget and cascades. Suggest either (a) strict per-step timeout on the Stripe call (e.g. 5s hard cap, fail-open with a logged warning), or (b) move Stripe cleanup outside the advisory lock so DB reset can release the lock independently.

@AhmedEssamElNaggar

Copy link
Copy Markdown

Concern linking Task 0c + Task 2: the Stripe reset helper only deletes customers matching QA_STRIPE_EMAIL_PATTERN (e.g. qa-smoke-*), but the plan never pins that the seeded personas (Users #1–#13 in the persona catalog) use emails matching that prefix. If they don't, their Stripe customers get created by specs 18/35 and never cleaned up, accumulating across runs — re-creating the exact Stripe-side drift the plan is trying to eliminate. Suggest: (a) require all seeded personas in Task 2 to use the qa-smoke- prefix, and (b) extend Task 3's adversarial probe to assert every persona email matches the pattern that StripeTestReset will purge.

@AhmedEssamElNaggar

Copy link
Copy Markdown

Concern on Task 10 (DO Spaces orphan pruner): this task destructively acts on live production S3 storage based on a DB join, but gets one task line with no safeguards — unlike the Stripe helper in Task 0c, which correctly has pattern validation, quota limits, and fail-open behavior. A bug in the join (missed relation, read-replica lag, soft-deleted rooms not considered) could delete live listing images at scale.

Suggest mirroring the Stripe helper's safeguards:

  1. Dry-run by default — log-only until SPACES_PRUNE_EXECUTE=1 is set; compare "would delete" output against reality for the first few runs before arming.
  2. Hard deletion cap per run (e.g. 100 objects) — abort and alert #qa if exceeded; legitimate weekly orphan volume is small.
  3. Percentage cap — refuse to delete more than ~5% of objects under listings/* in a single run; catches the "empty room set → everything looks orphaned" failure mode.
  4. Include soft-deleted rooms in the join — soft-deleted listings may be restored and still need their images.
  5. Force primary DB read — replica lag could make freshly-created rooms look orphaned.
  6. Weekly summary to #qa with deletion counts — silent destructive jobs rot; visibility keeps them honest.
  7. Adversarial probe (Task 3) — add an empty-rooms scenario to verify the pruner refuses rather than nukes everything.

@gsingal

gsingal commented Apr 25, 2026

Copy link
Copy Markdown
Author

v2 plan posted — all of yesterday's feedback addressed + 1 round of additional iterative review (3 Claude rounds + Codex challenge + expert shock test).

Quick map of what changed since the version you reviewed:

Addressing @ApxSnowflake's concerns:

  1. StripeTestReset now uses a literal allowlist (4 hard-coded patterns), not a regex-blacklist. Adding a new pattern requires editing the file → forces code review. Strictly safer than the original regex-validation design.
  2. ✅ Stripe API calls inside reset() get a 5s hard timeout + fail-open; partial-success result returned. The advisory lock no longer can be held by a Stripe tail-latency hang.
  3. ✅ Persona emails now all have qa- prefix (qa-edu-locked@example.edu, qa-blocked@rotatingroom.com, etc.) — matches the Stripe purge pattern. New PersonaEmailPurgePatternTest enforces the invariant.
  4. ✅ Task 10 (DO Spaces orphan pruner) split out into its own follow-up plan — your 7 safeguards are correct but they ARE the plan for that work; keeping it here was scope creep. This plan ships Phases A–D without Task 10.

Addressing @majones919's feedback:

  • ✅ The 4-hour reset-free window is gone. Replaced with pre-run active-session coordination: when someone runs /qa-smoke, the skill checks for an active Backpack admin session first and prompts the operator to continue/abort. No timed window, no expiry, no auto-resume — coordinate at the moment it matters.

New from @gsingal direction:

  • ✅ Persona naming convention: every email is self-documenting (qa-edu-locked@example.edu, qa-blocked@rotatingroom.com) so a developer reads the email and knows the state.
  • ✅ Shared password: every persona uses RR4Life! (same as team's prototype/staging QA credentials) — no separate creds to look up.
  • docs/QA_PERSONAS.md cheat-sheet linked from CLAUDE.md, smoke dashboard header, #qa topic.

Found in the additional review rounds (would have shipped with bugs without these):

  • 5 new risks added covering shared-password leak, persona-email collision with prod, Stripe timeout drift, cap false-positives, concurrent-operator queueing
  • config/testing.php was referenced everywhere but never explicitly created in any task's Files list — fixed
  • Task 4 was POSTing to QA reset on every smoke run including /qa-smoke staging (would have wiped QA when running against staging) — now per-env gated
  • Task 4 retry-on-423 logic so concurrent operators queue instead of abort
  • Post-seed state verification test (catches seeder typos before they reach QA)

Final ratings:

  • Iterative review: 7.5 → 8.5 → 9.5/10 → APPROVE
  • Codex adversarial challenge: 3 P1/P2 found and fixed (per-env gate, TS persona mirror, allowlist test matrix)
  • Expert shock test: 1 addition (post-seed verification), 2 deferrals to Phase D

Ready to start Phase A (Task 0 = measurement day, no code changes — safe first step). Estimated full execution: 2–3 weeks across A→B→C→D.

Plan branch: docs/smoke-reset-plan. Updated gist above is the canonical v2-final.

@gsingal

gsingal commented Apr 25, 2026

Copy link
Copy Markdown
Author

v2 plan posted — all of yesterday's feedback addressed + 1 round of additional iterative review (3 Claude rounds + Codex challenge + expert shock test).

Quick map of what changed since the version you reviewed:

Addressing @ApxSnowflake's concerns:

  1. StripeTestReset now uses a literal allowlist (4 hard-coded patterns), not a regex-blacklist. Adding a new pattern requires editing the file → forces code review. Strictly safer than the original regex-validation design.
  2. ✅ Stripe API calls inside reset() get a 5s hard timeout + fail-open; partial-success result returned. The advisory lock no longer can be held by a Stripe tail-latency hang.
  3. ✅ Persona emails now all have qa- prefix (qa-edu-locked@example.edu, qa-blocked@rotatingroom.com, etc.) — matches the Stripe purge pattern. New PersonaEmailPurgePatternTest enforces the invariant.
  4. ✅ Task 10 (DO Spaces orphan pruner) split out into its own follow-up plan — your 7 safeguards are correct but they ARE the plan for that work; keeping it here was scope creep. This plan ships Phases A–D without Task 10.

Addressing @majones919's feedback:

  • ✅ The 4-hour reset-free window is gone. Replaced with pre-run active-session coordination: when someone runs /qa-smoke, the skill checks for an active Backpack admin session first and prompts the operator to continue/abort. No timed window, no expiry, no auto-resume — coordinate at the moment it matters.

New from @gsingal direction:

  • ✅ Persona naming convention: every email is self-documenting (qa-edu-locked@example.edu, qa-blocked@rotatingroom.com) so a developer reads the email and knows the state.
  • ✅ Shared password: every persona uses RR4Life! (same as team's prototype/staging QA credentials) — no separate creds to look up.
  • docs/QA_PERSONAS.md cheat-sheet linked from CLAUDE.md, smoke dashboard header, #qa topic.

Found in the additional review rounds (would have shipped with bugs without these):

  • 5 new risks added covering shared-password leak, persona-email collision with prod, Stripe timeout drift, cap false-positives, concurrent-operator queueing
  • config/testing.php was referenced everywhere but never explicitly created in any task's Files list — fixed
  • Task 4 was POSTing to QA reset on every smoke run including /qa-smoke staging (would have wiped QA when running against staging) — now per-env gated
  • Task 4 retry-on-423 logic so concurrent operators queue instead of abort
  • Post-seed state verification test (catches seeder typos before they reach QA)

Final ratings:

  • Iterative review: 7.5 → 8.5 → 9.5/10 → APPROVE
  • Codex adversarial challenge: 3 P1/P2 found and fixed (per-env gate, TS persona mirror, allowlist test matrix)
  • Expert shock test: 1 addition (post-seed verification), 2 deferrals to Phase D

Ready to start Phase A (Task 0 = measurement day, no code changes — safe first step). Estimated full execution: 2–3 weeks across A→B→C→D.

Plan branch: docs/smoke-reset-plan. Updated gist above is the canonical v2-final.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment