gsingal/2026-04-23-smoke-deterministic-reset.md

Last active April 25, 2026 07:42

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/gsingal/46ade46d244d68728cc5e994712b3851.js"></script>
Save gsingal/46ade46d244d68728cc5e994712b3851 to your computer and use it in GitHub Desktop.

Download ZIP

QA Smoke Deterministic Reset Foundation — Strategic Plan for team review

Raw

2026-04-23-smoke-deterministic-reset.md

QA Smoke Deterministic Reset Foundation

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Problem & Why Now ← Dimension 1

Every smoke run inherits state from prior runs because QA server state is never fully reset between runs. The team has been patching symptoms for months — ownership guards for stale listing.json (PR #4283), silent-skip fallbacks when the demo user is fraud-flagged (PR #4279), test.skip() cascades when ephemeral lister state is wrong, content tests that accept "any of three states" as passing (#3655 Playwright specs). Each workaround makes the suite weaker: it proves fewer things, accepts more failure modes as "green," and hides real regressions under fake passes.

The 2026-04-21 smoke audit made the cost concrete: 4 of Megan's 7 dispositioned findings were not real app bugs but cross-run state drift. The 3 remaining required app-code fixes in PR #4279 — but those fixes only unblock the workarounds; they don't remove the class of bug. The next run-state drift (subscription left active, room left under_review, user left blocked, verification code row stuck in failed_attempts=5) will cause the same silent-skip cascade.

"Reset the DB before each run" is the obvious fix. We already have TestingController::reset() + TestDatabaseSeeder that does it locally. The current gate has two layers: (a) bootstrap/app.php:26-30 only loads routes/testing.php when APP_ENV=testing, and (b) TestingController::__construct only denies production|staging — so any non-prod/non-staging env (local, dev, custom) is fail-open once the routes are loaded. QA today has APP_ENV=production, so the routes aren't loaded and reset is unreachable. This plan meaningfully tightens the gate (route-layer + controller-layer + hostname + token) rather than just extending it.

Simpler interventions considered and rejected in Alternatives — this needs a proper foundation.

Prior Art & Research ← External best practices

Anthropic claude-code own testing playbook (internal). Test databases are recreated per test job, not per-test. Persona fixtures live in seed data, not inline factories. Ephemeral state is banned from tests — every "did this feature work" check begins from a named known state.
Playwright best practices 2026 (playwright.dev/docs/best-practices). Recommends per-run database reset via test-only API endpoint OR per-test database transaction rollback. Rolls up both to one rule: "no test may depend on another test's side effects."
Cypress + Testcontainers pattern (cypress.io/blog/database-seeding-cypress). Seeding into a dedicated "smoke" database snapshot that's restored before each run. Faster than truncate-and-reseed for large schemas.
CLAUDE.md E2E rule #1 (this repo): "Tests must replicate exact user actions" — this is only possible when preconditions are controlled. Silent-skip + state-drift fallbacks directly violate the spirit of that rule.
Prior in-repo work: TestingController::reset() (already exists, local only), ownership guard in qa-helpers.ts::readOwnedListingId() (workaround for the same class of bug), PR #4279 admin email_verified_at / is_fraudulent (unblocks manual recovery — but only needed because reset is missing).

Alternatives Considered ← Dimension 2

1. Do nothing — keep patching symptoms. Continue adding ownership guards, silent-skip fallbacks, and workarounds as they arise. Cost: every new smoke finding requires triage to separate "real bug" from "state drift," which has repeatedly cost 30-60 min of debug time. Every new feature that touches state (subscriptions, fraud, verification) adds to the drift surface. Verdict: Rejected — compounding technical debt.

2. Truncate+reseed via TestingController::reset() called from the smoke runner before each run. Enable /testing/* routes on QA server (relax the APP_ENV=testing gate to also permit a specific QA_SMOKE_TESTING=1 env flag). Smoke runner curls /testing/reset before each run. Cost: truncating ~80 tables + reseeding baseline + re-creating FK indexes takes ~20s per reset; small. Risk: running reset against staging/prod is a catastrophic data-loss event, so the gate needs multiple independent checks. Verdict: Chosen — see rationale below.

3. Per-test transaction wrapping. Each Playwright test runs inside a DB transaction that rolls back at teardown. Cost: requires app-level transaction awareness (Playwright can't drive DB transactions directly on an HTTP server). Would need test-controller endpoint to begin/rollback. Also doesn't work for tests that cross HTTP boundaries where the server commits mid-test. Verdict: Rejected — incompatible with HTTP boundary tests.

4. Dedicated QA database snapshot + restore. Dump a known-good DB state to a file; restore via pg_restore before each run. Faster than reseed (~5s for 80 tables). Cost: snapshot must be regenerated whenever schema changes; adds a CI step. Verdict: Rejected for now — reseeding is fast enough, complexity budget is better spent elsewhere. May revisit if reset time exceeds 60s.

5. Ephemeral-only (no persistent seeded users at all). Every spec creates its own user via registration UI at the top of the spec, tears down at the end. Cost: registration itself is a spec under test (spec 12), so circular dependency. Also 50+ specs × 10s registration = 8+ minutes added per run. Verdict: Rejected — too slow + circular.

6. Playwright project dependencies (projects: [{ name: 'reset-setup', testMatch: /reset/ }, { name: 'qa-smoke', dependencies: ['reset-setup'] }]). Native Playwright mechanism for "run this before the suite." Would call /testing/reset as a proper setup project, auto-isolate reset failures from spec failures, and give the dashboard native per-run visibility into reset success. Cost: adds a new Playwright project, requires refactoring playwright.config.qa.ts. Benefit over shell-script approach (Task 4): reset failure is a visible Playwright test failure in the report rather than a pre-run shell abort, and the desktop/mobile projects both inherit the dependency without runner-level duplication. Verdict: Adopted for Task 4 — the plan's "call reset from shell before npx playwright test" approach is replaced with a Playwright dependency project. See Task 4 for the revised spec.

7. Nightly snapshot + per-run restore. Distinct from Alternative 4 (schema-change-triggered regeneration): a cron runs pg_dump --data-only --format=custom nightly against a freshly-truncated-and-seeded DB; each smoke run does pg_restore --data-only --clean (~3-5s). Cost: one nightly cron, one restore command in the runner. Benefit: 4-5x faster than truncate-and-reseed at the point of each run. Verdict: Deferred to Phase E follow-up — only triggered if Phase A-D reset p95 exceeds 30s; described in Non-Goals.

Chosen — Alternative 2 because it reuses existing infrastructure (TestingController, TestDatabaseSeeder), the reset cost is bounded (~20s/run), and it makes the smoke suite's invariants explicit ("every run starts from a known baseline") rather than inferring them from patchwork guards.

Assumptions ← Dimension 3

Truncating the QA database is acceptable. QA has no real user data; it's a dev environment seeded with fixture data. If wrong (e.g., someone is using QA for sales demos), reset would erase their context — mitigate: require an explicit env flag on the server (QA_RESET_ALLOWED=1) AND a dated annotation in /qa.rotatingroom.com/README documenting the reset behavior.
Truncate+reseed completes in <30s. Spot-checked locally at ~20s for the current schema. If wrong (large seed data, slow FK rebuild), smoke runs get slower by 30-60s — acceptable but tracked as a guardrail metric.
Stripe test-mode state doesn't leak between runs. Stripe test mode has its own state (customers, subscriptions, coupons). The seeder doesn't reset Stripe. If wrong (a prior run's customer is in a weird state), spec 18 could fail. Mitigate: Task 0c provides a StripeTestReset helper invoked inside /testing/reset (Task 1) to delete test customers matching the run's email pattern; accept residual Stripe objects from other sources.
File uploads (S3/Spaces) don't meaningfully persist between runs. Listings in DB are truncated; their images in Spaces become orphans. Storage cost is trivial for test data. If wrong (lifecycle issues surface), Task 6 adds an orphan-image pruner.
PostHog/analytics events in a test run are discardable. Tests fire real PostHog events to a dedicated test project. If wrong (production PostHog project receives test data), cardinality blows up metrics dashboards. Mitigate: audit PostHog project config before Phase A lands; ensure QA points to test project, not prod.
TestDatabaseSeeder seeds every table the suite's FK graph needs. Currently the seeder populates ~8 table groups (users, admin_users, room_types, etc.). The app has 80+ tables, and specs create rows that FK into unseeded tables (institutions, transit_scores, blog_posts, plans, permission rules). Breaks if wrong: Phase C spec migrations silently fail with FK violations after the first truncate. Detect via: Phase A Task 0 (new — FK-transitive audit). Mitigate: Phase A Task 0 extends seeder to cover every table the current suite touches; CI parity check flags future drift.
Reset runs are serialized. Two operators triggering smoke simultaneously (one via /qa-smoke, another via CI or a teammate) would race, producing a half-reset DB during the second run. Breaks if wrong: second run starts mid-truncate → FK violations, spurious failures, or corrupted seed state. Mitigate: Task 1 wraps the reset body in a PostgreSQL advisory lock (chosen over Redis SETNX — see Approach section); second caller receives 423 Locked with retry-after. Implementation note: prefer pg_advisory_xact_lock inside a transaction OR pin the PDO connection across lock/unlock to avoid session-scope leaks if connections cycle.
personas.php ↔ personas.ts parity test's regex parser is robust to TS idioms. The round-1 plan proposed preg_match_all('/^\s*([a-z][a-zA-Z]+):\s*\{/m'), which misses satisfies annotations, object shorthand {name, email}, and as const suffixes. Breaks if wrong: CI check is a tautology (always passes) and persona drift ships undetected. Mitigate: Task 2 uses a proper TS AST parser (@typescript-eslint/parser or ts-node executing the module) to extract keys, not regex.

Approach & Rationale ← Dimension 4

ADR-style Y-statement:

In the context of a smoke suite whose reliability is eroded by cross-run state drift, facing the choice between continued symptom-patching and a foundational reset protocol, we chose to enable TestingController::reset() on QA (gated behind a new QA_RESET_ALLOWED env flag), expand TestDatabaseSeeder to include named .edu personas and pre-seeded scenario fixtures, and call reset from the smoke runner before every run, to achieve deterministic preconditions for every spec (eliminating silent-skip fallbacks), accepting a ~20s per-run cost and the operational risk of a misconfigured reset gate on staging/prod (mitigated by triple-gate — env flag, explicit token, and server hostname allowlist).

Architectural changes

TestingController gate — three independent layers, all request-context-aware.

The constraint bootstrap/app.php's then: closure runs at boot time (before any request exists) means we cannot gate route-registration on hostname — hostname is only known per-request. The working design:
- Layer 1 — Route registration (env-var only, boot-time): bootstrap/app.php:26-30 registers routes/testing.php when APP_ENV=testing OR config('app.qa_reset_allowed') === true. This is a coarse switch: on QA we set QA_RESET_ALLOWED=1, on staging/production we never set it. Staging/prod → routes don't exist → 404 on any /testing/* request. Hostname is NOT checked here.
- Layer 2 — Per-request middleware (EnsureQaResetAllowed): new middleware attached to the testing.php route group; on every request verifies the QA flag AND request()->getHost() === 'qa.rotatingroom.com'. Returns 403 if either fails. This is where hostname is checked — it runs in request context, so request()->getHost() is valid. In tests, the middleware reads the Host header the test set via Playwright extraHTTPHeaders or the PHP feature-test's ->withServerVariables(['HTTP_HOST' => 'qa.rotatingroom.com']).
- Layer 3 — Controller-layer token: TestingController::__construct validates X-Testing-Token against config('app.testing_token'). Even if layers 1+2 pass, a wrong or missing token returns 403. Important: Task 1 also removes the current abort(403) on production|staging — that check would prevent QA (where APP_ENV=production) from ever passing through to the token check. Environment gating is handled entirely by Layer 1 (boot env-var) + Layer 2 (hostname).
- CSRF exception: bootstrap/app.php:52-54 currently appends testing/* to the CSRF-except list only when APP_ENV=testing. Task 1 extends this to include the QA-flag condition — otherwise the smoke runner's curl receives 419 Page Expired instead of executing the reset. This oversight would have broken Task 4 on Day 1.
- Concurrency lock: Task 1 wraps the entire reset() body (truncate + reseed + Mailpit clear + Stripe reset + cache/queue flush) in a PostgreSQL advisory lock (pg_try_advisory_lock($key) at start, pg_advisory_unlock($key) at end). Chosen over Redis SETNX because advisory locks auto-release on connection close (handles crash-mid-reset cleanly) and don't require Redis availability. A second caller during an in-progress reset gets 423 Locked with Retry-After: 30. Lock scope explicitly covers every sub-operation so a second caller never sees half-complete state.
- Staging and production set neither the QA_RESET_ALLOWED flag nor the HTTP_HOST of qa.rotatingroom.com; they cannot reset even if someone exports the flag in their shell.
Test seam: feature tests drive the middleware via HTTP-level Host header ($this->withServerVariables(['HTTP_HOST' => 'qa.rotatingroom.com']) or ->call('POST', '/testing/reset', [], [], [], ['HTTP_HOST' => ...])). No production-code branches for test-only config keys — the plan's earlier config('app.hostname_override') proposal is withdrawn.
Named persona contract — self-documenting + shared password + purge-pattern-safe. TestDatabaseSeeder grows an explicit persona catalog (mirrors tests/playwright/utils/test-data.ts::personas). Every persona has: user record, verification state, blocked/fraudulent flags, owned rooms (status + plan + needs_edu_verification), subscription state, and derivative rows (verification codes, activity_log, bounce records).

Naming convention: every email starts with qa- and encodes the persona's state in the local-part — e.g., qa-edu-locked@example.edu, qa-blocked@rotatingroom.com, qa-verify-expired@rotatingroom.com. A developer reading loginAs('qa-edu-locked@example.edu') knows the persona's state without looking it up. The qa- prefix also ensures the email matches QA_STRIPE_EMAIL_PATTERN so Stripe customers created for these personas get cleaned up automatically (prevents the Stripe-side drift Ahmed Concern #5 flagged).

Shared password: every persona uses RR4Life! (the same password the team already uses for prototype + staging QA credentials). Stored in config('testing.persona_password') — one value, no persona-specific lookup. Reduces friction for the team; a teammate who wants to poke at qa-blocked@rotatingroom.com on QA just types the password they already know.

Documentation: docs/QA_PERSONAS.md is a team-facing cheat-sheet linked from CLAUDE.md, smoke dashboard header, and #qa channel topic — everyone knows what personas exist and when to log in as each.

Removing a persona requires a test-writing skill update. Adding one requires a migration-style review. Both are enforced by CI via the persona parity test (Task 2) and the persona-email-matches-purge-pattern test (Task 3).
Scenario fixtures. Beyond user personas, seed named scenario objects: an unsubscribed lister with a paid draft, a user mid-way through /verification-request, a user with 3-of-5 failed verification attempts (so the next is locked), a room flagged under_review, a coupon that expires tomorrow. Each scenario has a stable ID + documentation in tests/fixtures/scenarios.md.
Smoke runner prefix. Before npx playwright test, the runner calls POST /testing/reset. On non-200 response, abort with clear error (no silent continuation — running a smoke suite against an unreset QA is the failure mode we're trying to eliminate).
Playwright spec migration. Remove test.skip() fallbacks that exist solely because the prior state was unknown. Specs now assert the precondition is present (e.g., "eduUnverified persona exists AND has pending free listing"), and fail loudly if the seeder didn't produce that state. This surfaces seeder regressions immediately.

Why phased

Flipping the flag globally and rewriting 50 specs in one PR is too risky. Phases A–D decouple the infrastructure work from the spec rewrites so each one lands independently and can be reverted cleanly. See Tasks.

Risks & Rollback ← Dimension 5

#	Risk	Probability	Impact	Mitigation	Rollback
1	`/testing/reset` mis-fires against staging or production, erasing the DB	Low (two-layer gate + CSRF + hostname + token)	Catastrophic	Route-layer gate (routes don't exist → 404) + controller-layer gate + hostname check + CSRF protection on non-QA + token; gate-bypass probe in Phase A Task 3 hard-fails CI on any regression	Hard precondition: PR #3722 (pg_dump backup) must land before Phase B. If backup isn't in place, we have no rollback for catastrophic loss. Runbook: `docs/handoffs/smoke-reset-incident.md`
2	Reseed is too slow, smoke runs balloon past 30min	Medium	Moderate	Guardrail metric: reset duration ≤30s; if exceeded, Phase E (snapshot approach — Alternative 7) triggers automatically	Revert Task 4 (Playwright setup project); runs continue with current cross-run drift while we investigate
3	Personas diverge between seeder and `test-data.ts`, causing "correct persona but wrong state" failures	Medium	Low-moderate	Task 2 adds a CI check using a proper TS AST parser (not regex); fails build on drift	Fix drift; no runtime rollback needed
4	Specs that rely on pre-existing production-like data (blog posts, institution affiliates, plans, permission rules) break when truncated due to FK-transitive gaps	High	Moderate	Phase A Task 0 (new) audits every table the suite FKs into; seeder extended to cover all of them; CI check prevents future drift	Hold Phase B (runner integration) until Task 0 audit is complete and seeder covers all FK dependencies
5	Stripe test-mode state accumulates (orphan customers, subscriptions)	Medium	Low	Task 5a includes a Stripe reset helper; Stripe has its own test-mode GC	Skip Stripe reset if quota hit; accept orphans until quarterly cleanup
6	Two smoke runs race on `/testing/reset`, producing half-reset DB	Medium (will happen the first time two people are both trying to QA on the same day)	Moderate	Task 1 uses PostgreSQL advisory lock (`pg_try_advisory_lock`); second caller receives 423 Locked + `Retry-After: 30`. Implementation must pin the PDO connection or use `pg_advisory_xact_lock` to prevent session-scope leaks on connection cycling	Lock acquisition failure → runner aborts with clear message; no partial-reset state possible (advisory locks auto-release on session close)
7	Seeder-schema drift: someone adds a required column without updating `TestDatabaseSeeder`, every reset fails afterward	High (schema changes happen weekly)	Moderate	Phase A Task 0b (new) adds CI check that runs `migrate:fresh --seed=TestDatabaseSeeder` on every PR touching `database/migrations/` or `database/seeders/`; fails build on seeder-migration incompatibility	Revert offending migration or quick-patch seeder; CI catches before merge
8	Reset-under-load: reset fires while Postmark webhooks or real users are hitting QA, transactions pile up, DB locks cascade	Low-medium	Moderate	Reset runs during known smoke windows only; Task 11 dashboard widget flags p95 >30s; Phase B cut-over announces windows in #qa	If cascading locks observed, reduce reset frequency; investigate lock contention; consider snapshot approach (Phase E)
9	QA users (Mahmoud manual UI testing, Megan spot-checks) lose their in-progress session/data when first reset runs	High (will happen to Mahmoud on Day 1)	Low-moderate	Phase B Task 4b (new) adds a 24-hour cut-over announcement in `#qa`; Task 4c (v2) replaces the 4h window with a pre-reset active-session coordination check — `/qa-smoke` pauses and prompts the operator if a Backpack admin session is active; operator decides continue/abort	Teammates can abort `/qa-smoke` at the prompt; no data-loss from reset itself (same as Risk 2 — revert Task 4)
10	`qa-@` persona email collides with an existing real user on prod or QA, and the Stripe purge targets that user's test-mode customer	Low (persona prefixes are unusual) but must be verified before Phase A	Moderate	Phase A Task 0 deliverable #5 runs `SELECT * FROM users WHERE email LIKE 'qa-%'` against both prod replica and QA; if any match is not in our persona catalog, narrow `QA_STRIPE_EMAIL_PATTERN` to allowlist only the verified persona emails	Narrow the allowlist in `StripeTestReset::ALLOWED_PATTERNS` to exclude the collision range
11	Shared password `RR4Life!` leaks (accidental commit, Slack paste, screenshot, or public gist) — someone external can authenticate as every persona on QA	Medium (shared passwords leak eventually)	Low-moderate (QA has no real user data, but can still be used to generate spam, trigger webhooks, stress Stripe test quotas)	Documented rotation runbook in `docs/handoffs/qa-password-rotation.md` (Task 2 sub-deliverable); password stored in `config/testing.php` which is git-tracked but lives alongside the `APP_KEY` — same security posture as other test credentials. Rotation: generate new password, update `config/testing.php`, update QA server `.env` if overridden, update `docs/QA_PERSONAS.md`, re-deploy, trigger reset, verify specs still pass	Rotate to new password; previous password stops working on next reset
12	Stripe's 5s hard-timeout causes persistent partial-purge drift during sustained Stripe incidents	Medium (Stripe p99 can spike to 10-30s during region incidents)	Low (state accumulates over time but doesn't break tests immediately)	Telemetry tracks `skipped_timeout` count per reset (Task 11 dashboard widget); decision rule: if `skipped_timeout` > 10% of purges for >3 consecutive runs, raise the timeout or schedule a separate nightly purge job	Widen timeout to 15s or move purge to background; short-term accept partial drift
14	Two operators invoke `/qa-smoke` concurrently; second receives 423 Locked but aborts instead of waiting → second operator's suite never runs	Low-medium (multi-operator smoke is infrequent but possible)	Low (second operator re-runs manually, but it's avoidable friction)	Task 4 explicitly specifies 423 retry behavior: `/qa-smoke` on 423 Locked waits 30s and retries up to 3 times before aborting. Second operator's reset queues cleanly behind the first, then its suite runs against that already-reset state (which is what they wanted anyway)	Manual re-run; the retry logic is ~5 lines in the skill script so unlikely to need rollback

Expert shock-test additions (2026-04-23):

Pre-QA canary: before Phase B lands, run the full reset + reseed path against a local QA clone (pg_dump current QA → restore locally → point the TestingController at it → run 20 reset cycles → verify no FK violations, no orphans, no quota issues). Treat as a hard precondition for landing Task 4. New Task 4a formalizes this.
On-call / escalation: reset failure on QA must page the active on-call via #qa mention + Slack app. The plan's "abort with clear message" is necessary but insufficient. Task 11 extended: if the reset-telemetry endpoint receives three consecutive failures OR no heartbeat in 24h, post @channel to #qa. Owner: whoever is assigned the weekly on-call rotation (currently ad-hoc — plan flags this as a dependency on #4303 or a new on-call rotation issue).
Fixture schema versioning: database/seeders/data/personas.php returns a versioned array (['version' => 2, 'personas' => [...]]). Parity test compares versions; if TS side's PERSONAS_VERSION constant doesn't match PHP version, CI fails. Prevents silent drift when a new persona field is added PHP-side but specs still expect the old shape.

Rollback plan if the whole initiative needs to be reversed:

gh pr revert the runner change (Task 4) — smoke reverts to current behavior instantly.
Leave the expanded seeder + personas (Phase A) in place; they're additive and don't break anything.
Tasks 7+ (spec rewrites) each land as their own PR; individual reverts if any causes regressions.

Non-Goals ← Dimension 6

Reset on production or staging. Production: never. Staging: also never (staging is the accumulator branch pre-push; its data is the ship-candidate QA'd before deploy). Complexity: trivial (just don't add the env flag). Justified — these environments have their own data contracts.
Per-test reset (between Playwright tests within a spec). The framework supports it (test.beforeEach → curl /testing/reset), but wall-clock cost is prohibitive (80 tests × 20s = 27 min added). Complexity: trivial (call reset in beforeEach). Justified — smoke suite is designed to run sequentially with shared setup (see specs 11-12 dependency chain).
Snapshot-based reset (Alternative 7 — nightly snapshot + per-run restore). Moved from "non-goal forever" to "Phase E follow-up, auto-triggered if reset p95 > 30s." Complexity: moderate (1-2 days — nightly cron + restore command in runner). If Phase A-D reset stays under 30s p95 for 4 weeks, skip Phase E. Justification for Phase E path (not immediate adoption): truncate+reseed is simpler, reset p95 is unmeasured so we shouldn't optimize pre-emptively.
Reset Stripe test mode comprehensively. Stripe API quota doesn't permit clearing all test customers per run; Task 5's helper clears only the specific emails this run generates. Complexity: moderate (would need a background worker that runs nightly). Justified — residual Stripe state rarely causes test flakes (confirmed by 2026-04-21 audit).
Reset PostHog / GA4 / Rollbar event history. These are append-only analytics systems. Test events live in a dedicated test project; production projects don't see them. Complexity: complex (requires analytics-provider cooperation). Justified — out of scope.

Explicitly in scope (to remove ambiguity):

Mailpit inbox clear. Task 5b (promoted from "supplementary" to a named acceptance criterion): /testing/reset endpoint also calls Mailpit's DELETE /api/v1/messages so spec 27 (email flows) starts from zero messages every run.

Success Criteria ← Dimension 7

Within 2 weeks of merge: Zero new smoke findings attributed to "cross-run state drift" (tracked via /smoke-feedback-review disposition test-needs-fix where the root cause is prior-run state, not a test-logic bug).
Within 4 weeks: ≥10 test.skip() calls removed from the Playwright suite (those that exist solely because of unknown prior state). Measured by grep diff against master.
Reset completes in ≤30s (p95 over 10 consecutive runs). Measured from smoke-runner log timestamps.
No accidental reset against staging or production — verified by gate-bypass probe in Phase A Task 3 and monthly review of /testing/* access logs (Task 12).
Persona reset contract compiles — TypeScript type + PHP PHPDoc pair, CI fails if they diverge (Task 2 CI check).

Launch Metrics ← Post-launch impact tracking

Success (what we're improving)

Baselines pending Phase A Task 0 measurement day (claimed numbers below are flagged — exact baselines land with Task 0's measurement-day deliverable).

Metric	Source	Baseline	Target	Timepoint
Smoke findings tagged `state-drift` per run	New `state-drift` disposition tag (Task 11 adds the dashboard counter AND formalizes the reviewer-tagging vocabulary in `/smoke-feedback-review`; reviewer tags `test-needs-fix` items where root cause is prior-run state)	4 per run (2026-04-21 audit: 4 of 7 findings were cross-run state drift)	0 per run	Day 14
Test-skip count due to state-drift (narrow metric)	Task 0 deliverable: inventory of `test.skip()` calls in `tests/playwright/qa-smoke/` and `tests/playwright/journeys/`, each categorized as `state-drift` / `prod-gate` / `legitimate-conditional`. Target metric is only the `state-drift` subset.	TBD — Phase A Task 0 delivers the count. Total `test.skip` today is ~473 non-prod-gate calls; the state-drift subset is the ≤25% of those that exist because prior-run state was unknown.	0 state-drift skips remaining	Day 28
Smoke pass rate (passes / total, skips count against)	dashboard data.json	95.4% (2026-04-22 run)	≥97%	Day 14
Reset endpoint p95 latency	smoke-runner telemetry (Task 11 dashboard widget)	TBD — Phase A Task 0 measures 20 runs on current QA	≤30s	Day 7

Proxy validation:

state-drift disposition count is a direct measurement of the problem (not a proxy). Reviewers explicitly tag findings as state-drift vs real-bug — the count going to zero is the outcome we want.
state-drift skip count is a direct measurement — each skip call exists in the code; we count them before and after.
smoke pass rate ≥97% is a weaker proxy — could improve for unrelated reasons or stay flat if new specs land that rely on yet-unseeded state. Included as a guardrail-ish directional indicator, not a causal claim. Correlation with the other two metrics will be tracked on the dashboard.
reset p95 ≤30s is the operational health metric — direct measurement.

Guardrails (what must not break)

Metric	Source	Current	Threshold
Reset endpoint response time	smoke-runner log	n/a (new)	≤30s p95
Total smoke run wall time	dashboard timings	~30 min	must stay ≤35 min
QA server uptime during run	Uptime Kuma	100%	must stay ≥99%
Stripe test quota consumption	Stripe dashboard	current baseline	must stay within quota
Accidental resets against staging/prod	audit log review (monthly)	0	must stay at 0 (any violation = immediate rollback)

Decision Rules

Day 7: if reset p95 >30s → investigate seed size; continue with Phase C spec rewrites only if reset stable.
Day 14: if smoke findings attributed to state-drift > 1 per run on average → Phase C rewrite plan needs to be accelerated; escalate to Gaurav.
Day 28: if test.skip() count hasn't dropped by ≥10 → Phase C spec rewrites aren't landing; reopen scope with Gaurav.
Guardrail violation: revert regardless of success metric. Gate-bypass probe failure → immediate rollback of Task 2 (route change).

Requirements Input ← Dimension 10

User request (2026-04-23): "What else should be reset besides personas? Write a comprehensive, deterministic testing plan. Specifically around what things should be reset between runs."

Requirement (from input)	Addressed by	Status
Enumerate every class of state that drifts	Appendix A (State Drift Catalog below)	In scope — Appendix A
Design a reset protocol	Tasks 1-4 (gate, seeder, runner, scenario fixtures)	In scope
Phase migration so tests don't all break at once	Phases A-D (Task grouping)	In scope
Rollback/failure modes	Risks & Rollback section + per-Phase rollback plan	In scope
Identify guardrail metrics	Launch Metrics / Guardrails table	In scope
Personas (specifically)	Task 2 (persona catalog)	In scope

Appendix B: Post-Reset Baseline State (what everything gets set to)

This is the canonical answer to "after /testing/reset completes, what exists in the DB?" The current TestDatabaseSeeder defines most of it; Task 2 extends it with .edu personas + scenario fixtures. Every spec must be able to assume this state exists before it runs.

B.1 — Users (11 personas with self-documenting names + shared password)

Design goals:

Self-documenting email names — an engineer or QA reviewer should be able to read a test that does loginAs('qa-lister@rotatingroom.com') and know exactly what state that user is in without cross-referencing the persona catalog.
Shared password RR4Life! — same password every team member already uses for manual QA across prototypes and staging. No separate credentials to look up, no password cycling per persona. (Bcrypt at $2y$12$...; stored in config('testing.persona_password').)
qa- prefix on every email — matches the qa-smoke-* and qa-* pattern that StripeTestReset is allowed to purge (per Ahmed's Concern #5), so specs 18/35 that create Stripe customers against these personas get cleaned up automatically.
Stable IDs in the 1000s — keeps baseline personas out of the range that ephemeral spec-12 registrations use (which auto-assign from the sequence starting at ~14), so there's never an ID collision even if a spec forgets to clean up.

Shared password: RR4Life! (bcrypt in the seeder, plain in config('testing.persona_password') for test-helper login). Every persona below uses this.

Admin: Admin #1 = qa-admin@rotatingroom.com (password RR4Life!) — only one admin, lives in admin_users table.

#	ID	Email	State	Purpose
1	1001	`qa-lister@rotatingroom.com`	active, email verified	Seeded lister — owns all 60 baseline rooms + 4 scenario fixtures
2	1002	`qa-support@rotatingroom.com`	active	Support/Megan-style persona — ops flows
3	1003	`qa-founder@rotatingroom.com`	active	Founder/Gaurav-style persona — strategic flows
4	1004	`qa-demo@rotatingroom.com`	active, verified	Non-`.edu` general user — canonical login for E2E demos
5	1005	`qa-verify-expired@rotatingroom.com`	active, `email_verified_at = 7 months ago`	Expired-verification re-prompt UI
6	1006	`qa-blocked@rotatingroom.com`	active, `blocked=1`	Blocked-user gates + admin-unblock flow
7	1007	`qa-inactive@rotatingroom.com`	`active=0`	Inactive-account paths
8	1010	`qa-edu-unverified@example.edu`	unverified `.edu`, has pending free listing	First-time verify flow
9	1011	`qa-edu-verified@example.edu`	verified `.edu`	Post-verify "no banner" flow
10	1012	`qa-edu-locked@example.edu`	`failed_attempts=5`, locked	Lockout UI + admin-unlock
11	1013	`qa-edu-verifying@example.edu`	`failed_attempts=3`	Exercises final attempts 4–5 (expert shock test finding)

Readability test: a developer sees loginAs('qa-edu-locked@example.edu') and immediately knows "this is the locked .edu user." No catalog lookup. Contrast with loginAs('edu-locked@example.edu') or loginAs(personas.eduLocked) — both force the reader to know conventions.

Stripe-purge compatibility: every email starts with qa-, which matches the QA_STRIPE_EMAIL_PATTERN (qa-*@*) StripeTestReset is configured to purge. See Task 0c + Task 3 probe for the fail-closed validation that ensures this pattern stays narrow.

Not seeded (must be created per-run by specs): the ephemeral qa-smoke-<timestamp>-lister@rotatingroom.com + qa-smoke-<timestamp>-traveler@rotatingroom.com from specs 11-12. These test registration itself, so they're intentionally created per-run and also match the qa- purge pattern.

Team-facing documentation: docs/QA_PERSONAS.md (created by Task 2) is a one-page cheat-sheet with the email list, password, and "when would I log in as each?" column. Linked from CLAUDE.md and the smoke dashboard header.

B.2 — Rooms (60 today, + scenario fixtures from Task 2)

All 60 seeded rooms belong to User #1001 (qa-lister@rotatingroom.com). 10 rooms per city across 6 cities. Covers the full filter/sort space:

Dimension	Range seeded
Cities (6)	Boston ($800-1600), NY ($1400-2200), Chicago ($700-1500), LA ($1200-2000), SF ($1600-2400), Houston ($550-1350) — each has 10 rooms with rent ±$400 around base
Room type	Alternates `private_room` / `entire_place` per city
Bedrooms	1, 2, or 3 (staggered: `$i % 3 + 1`)
Bathrooms	1 or 2
Availability	Starts today through +5 months depending on room; all 6-month windows
Plan	Alternates `rr-monthly`, `rr-quarterly`, `rr-annually`
Status	All `active`, all `under_review = false`
Lat/Lon	Jittered ±0.025° around each city center
Each has an address row	10 Test Street through 6000 Test Street per city

Task 2 scenario fixture rooms (added on top of the 60):

ID	Type	State	Purpose
9001	Paid draft listing	`status=pending_payment`, plan=`rr-monthly`, owner=User #1001	Spec 18 "blocked by paid draft" path — replaces the `savePaymentListing(null)` workaround
9002	Fraud-flagged room	`is_fraudulent=true`, status=`inactive`, owner=User #1001	Spec 34 cleanup no-op fix (PR #4279 unblocks manual recovery, fixture guarantees starting state)
9003	Under-review room	`under_review=true`, status=`active`, owner=User #1001	Stripe Radar simulation starting state
9004	Pending free listing	status=`pending_free`, plan=`rr-free`, owner=User #1010 (`qa-edu-unverified`)	`.edu` journey spec: "user finishes verification → pending listing activates"

Fixture room IDs in the 9000s keep them out of both the baseline range (1–60) and the ephemeral-sequence range (61+).

B.3 — Stripe Plans (10 plans)

Code	Name	Monthly	Upfront	Duration
`rr-free`	Free	$0	$0	—
`rr-monthly`	Monthly	$25	$25	monthly
`rr-quarterly`	Quarterly (default)	$20	$60	quarterly
`rr-annually`	Annual	$15	$180	annually
`rr-premium-monthly`	Premium Monthly	$35	$35	monthly
`rr-premium-quarterly`	Premium Quarterly	$28	$84	quarterly
`rr-premium-annually`	Premium Annually	$21	$252	annually
`rr-pro-monthly`	Pro Monthly	$49	$49	monthly
`rr-pro-quarterly`	Pro Quarterly	$39	$117	quarterly
`rr-pro-annually`	Pro Annually	$29	$348	annually

B.4 — Cities (6 cities)

Boston, New York, Chicago, Los Angeles, San Francisco, Houston — each with slug, state, lat/lon.

B.5 — Permission rules (allows + restrictions)

Allows (4 rules):

#	Regex	create_account	post_paid	send_queries	Purpose
1	catch-all `.+[@.].+\..+$`	1	1	0	Any domain can register + post paid
2	`.org$`	1	1	0	.org tier (same as catch-all)
100	`.edu$`	1	1	1	University tier — `.edu` gets send_queries
4	`queensu.ca$`	1	1	1	Queen's University sample

Restrictions (2 rules):

#	Regex	Blocks	Purpose
3	`(alum	alumni)`	create_account
4	`(protonmail.com	proton.me	pm.me)$`

B.6 — Everything else: empty by design

All other tables are empty after reset:

conversations, chat_messages, broadcasts → 0 rows (specs 14-15 create them)
stripe_subscriptions, stripe_coupons → 0 rows (spec 18 creates them; Stripe test mode also reset)
email_verification_codes, password_reset_tokens, personal_access_tokens → 0 rows
bounced_emails, notifications → 0 rows
activity_log → 0 rows
failed_jobs, jobs → 0 rows
email_verification_failed_attempts on most users → 0 (except User #1012 = 5 locked; User #1013 = 3 per Task 2 personas)
DO Spaces bucket: stays as-is per reset (orphan pruner is thematically adjacent — split into its own follow-up plan; unused room IDs become orphaned but don't affect DB correctness or smoke determinism)
Mailpit inbox → cleared by Task 1 reset body
Stripe test mode → customers matching QA_STRIPE_EMAIL_PATTERN (default qa-*@*) purged by Task 0c, with fail-closed validation against unsafe patterns (per Ahmed Concern #3)

B.7 — Auto-increment sequences

After seed, all PG sequences advance past max seeded ID so spec inserts don't collide:

users → next ID starts at 1014 (after User #1013; baseline IDs 1001–1013 keep spec-generated ephemeral IDs well above them)
rooms → next ID starts at 9005 (after scenario fixture #9004; baseline IDs 1–60 leave room for spec-generated ephemeral IDs starting at 61)
admin_users → next ID starts at 2
cities → next ID starts at 7
stripe_plans → next ID starts at 11
allows → next ID starts at 101 (after .edu tier #100)
restrictions → next ID starts at 5

This is critical: spec 11-12 creates timestamped-email QA accounts that auto-assign IDs — if sequences aren't advanced, they'd collide with seeded IDs and fail. Note on the 1000-baseline / 9000-fixtures gap: the sequence-advance step sets users.id to >= 1014. Ephemeral users created by spec 11 get IDs 1014, 1015, etc. — never colliding with the baseline 1001–1013 range and never pushed into a 9000+ range reserved for future scenario fixtures.

B.8 — What is NOT in the baseline (must be created by specs or NOT exist)

Ephemeral QA lister + traveler (spec 11-12 creates — expected to NOT exist at reset time; the whole point)
.auth/*.json files (cleared pre-run by smoke runner rm -rf)
Any production data (there is no QA reset against staging/prod — Layer 1+2 gates prevent it)
Any photo uploads beyond DO Spaces orphans (which don't affect DB correctness)

Appendix A: State Drift Catalog (comprehensive enumeration)

Every class of state that currently drifts between smoke runs, and how the reset protocol handles each:

A.1 — User state

What drifts	Current cause	Reset protocol
`users.is_fraudulent` on `demo@rotatingroom.com`	Spec 34 flags it; cleanup admin call was a no-op pre-#4279	Full `users` table truncate + reseed
`users.blocked`	Spec 15/19 toggles it	Full truncate
`users.email_verified_at`	Admin CRUD edits (PR #4279) + verification specs	Full truncate
`users.email_verification_failed_attempts`, `_locked_at`	Spec 21 locks users out	Full truncate
`users.send_queries`	Spec 21 gates	Full truncate
Ephemeral QA lister/traveler accounts from prior runs	Spec 11/12 creates with timestamped email; never deletes	Full truncate — these accounts cease to exist between runs
`users.is_sent_queries_disabled`	Moderation flag changed by spec 15/19	Full truncate

A.2 — Listing / Room state

What drifts	Current cause	Reset protocol
`rooms.status` (active → inactive → pending)	Specs 13, 22, 34 change status	Full truncate
`rooms.is_fraudulent`	Spec 34, Stripe radar hooks	Full truncate
`rooms.under_review`	`TestingController::setRoomUnderReview`	Full truncate
`rooms.needs_edu_verification`	Spec 13/21	Full truncate
`rooms.plan_id`	Specs 18, 22 (plan changes)	Full truncate
Orphan rooms owned by deleted ephemeral listers	Spec 13 creates; previous lister removed but rooms remain	Full truncate — orphans gone

A.3 — Subscription / Billing state

What drifts	Current cause	Reset protocol
`stripe_subscriptions.stripe_status` (`active`, `past_due`, `canceled`)	Specs 18, 35	Full truncate (DB side)
`stripe_subscriptions.ends_at`, `trial_ends_at`	Spec 18	Full truncate
Past_due + retry state	Spec 35	Full truncate
Stripe test-mode customers (remote)	Spec 18 creates; never deletes	Task 5: `StripeTestReset` helper — deletes customers matching this run's email pattern; accept orphans from other sources
Stripe coupons created by spec 19	Manual test	Hands-off — coupons expire; quarterly cleanup

A.4 — Messaging / Conversation state

What drifts	Current cause	Reset protocol
`conversations` rows (renter → lister)	Spec 14 creates inquiries	Full truncate
`chat_messages`	Spec 14	Full truncate
Duplicate-conversation detection window (1-month, per PR #4264)	PR #4264	Full truncate — conversations gone means no duplicate detection fires
`broadcasts` + broadcast-recipient join rows	Spec 15	Full truncate of both `broadcasts` and `broadcast_recipients` (or whatever the join-table is per schema)

A.5 — Verification / Fraud / Auth state

What drifts	Current cause	Reset protocol
`email_verification_codes` rows	Spec 21 + verification flow	Full truncate
`failed_attempts`, `locked_at`	Spec 21 lockout test	Full truncate
Verification documents uploaded	Spec 21	Full truncate of DB-side record; physical file cleanup handled by the DO Spaces pruner follow-up plan (see Appendix A.6)
`bounced_emails` (Postmark webhook state)	Spec 27 webhook simulation	Full truncate
`activity_log` entries for fraud_flag_cleared, etc.	Admin actions, PR #4279	Full truncate (but note: activity_log is a large production table; observers that read it may behave differently after truncate — Task 0 audit flags any reader)
`password_reset_tokens`	Spec 27 password reset flow; don't auto-clean	Full truncate
`personal_access_tokens`	Any Laravel Sanctum-based test	Full truncate
Notifications (`notifications` table — Laravel queued email/Slack)	Any notifying action	Full truncate

A.6 — Filesystem / Client state

What drifts	Current cause	Reset protocol
`tests/playwright/.auth/accounts.json`	Ephemeral accounts from specs 11-12	Already cleared pre-run (`rm -rf tests/playwright/.auth`)
`tests/playwright/.auth/listing.json`	Spec 13	Already cleared pre-run
`tests/playwright/.auth/payment-listing.json`	Spec 18	Already cleared pre-run
`test-results/` screenshots + traces	Playwright	Already cleared pre-run
DO Spaces (S3) uploaded photos	Spec 13/22/42	Task 6: orphan-image pruner — runs weekly, not per-run
Browser state (cookies, localStorage)	Playwright	Isolated per test context — no reset needed

A.7 — External-service state

What drifts	Current cause	Reset protocol
Mailpit inbox (QA)	All email sends	Task 5 supplementary: `POST /testing/mailpit-clear` wraps Mailpit's own clear API
Rollbar errors logged	Any test-triggered error	Append-only; not reset per-run. Category-filter in Rollbar review.
PostHog events	Analytics-emitting specs	Append-only on `test` project; cardinality contained by project isolation.
Slack notifications (e.g., new subscription)	Webhook specs	Accept as noise in `#dev-feed`; quarterly cleanup
Cloudflare cache	Affected by geo-lookup specs	TTL-based; not reset per-run

A.8 — Schema / Config / Infrastructure state

What drifts	Current cause	Reset protocol
`migrations` table	Deploys	Never truncate (Task 1 adds this to `$skipTables`)
Cache (Redis)	Any request	`php artisan cache:clear` at reset time (Task 1)
Session data (Redis-backed sessions)	Authenticated request	Redis flushed with cache at reset time
Queue jobs (Redis)	Mailer queue, analytics	Task 1: flush queue via `php artisan queue:clear`
`failed_jobs` table (Horizon/queue failures)	Any job failure, including test-induced failures	Full truncate — otherwise failed_jobs accumulates noise across runs
`jobs` table (if queue driver is database)	Queued actions	Full truncate
Config cache	Any deploy	Not reset per-run; only on deploy
Full-text search vectors (`rooms.search_vector`, other tsvector columns)	Any room CRUD; Scout updates these via observers	Task 1 addendum: after truncate+reseed, run `php artisan search:reindex` OR explicitly trigger observers during seed so tsvectors populate. Without this, search specs return empty results from the fresh seeded rooms.
Scout index (external provider, if configured)	Room CRUD	`php artisan scout:flush "App\Models\Room"` + `search:import` at reset (or rely on Scout driver's own reset)

Tasks ← Dimension 8

Phase A — Infrastructure (must land first, no spec changes)

Task 0: Measurement Day (1-day spike, no code changes)

Deliverables (published as docs/handoffs/2026-04-24-smoke-reset-baseline.md):

Reset-duration baseline: run TestingController::reset() locally 20 times; publish p50/p95/p99.
test.skip inventory: full grep -rn 'test\.skip' tests/playwright/qa-smoke/ tests/playwright/journeys/ with each call categorized as:
- state-drift (exists because prior-run state was unknown — target for removal)
- prod-gate (exists because the spec shouldn't run against production — keep)
- legitimate-conditional (exists because the test genuinely doesn't apply to every run — keep)
FK-transitive audit: query information_schema.table_constraints for every FK the suite transitively depends on; cross-reference against TestDatabaseSeeder's seedUsers, seedAdmins, seedRoomTypes, etc. Produce a gap list of tables the suite FKs into but the seeder doesn't populate (e.g., institutions, permission rules, plans).
Rebaseline Success Metrics in the Launch Metrics table with actual numbers.

Prod & QA qa-*@* collision check (per v2 review):

-- Run against both prod read-replica AND current QA DB
SELECT COUNT(*) AS qa_prefix_users, MIN(created_at) AS oldest, MAX(created_at) AS newest
  FROM users WHERE email LIKE 'qa-%';
SELECT email, created_at FROM users WHERE email LIKE 'qa-%' ORDER BY created_at LIMIT 20;

If any prod users match qa-*@*, the StripeTestReset default pattern cannot be qa-*@* — narrow it. If any QA users match but aren't in our persona list, investigate before Phase A Task 1.

Existing Stripe test-mode customer survey: via Stripe CLI stripe customers list --limit 100 | grep -E 'qa-|smoke'. Count pre-existing customers the pattern would delete; confirm they're all test data (not accidentally in live mode).

Exit criterion: the plan's Launch Metrics table has concrete numbers (not TBDs) AND the qa-*@* collision check produces an explicit branch decision:

Branch A (no collisions): zero prod-DB matches for email LIKE 'qa-%' → StripeTestReset::ALLOWED_PATTERNS default is qa-*@*; proceed to Phase A Task 1.
Branch B (collisions present): any prod-DB matches exist → default narrows to qa-smoke-*@* (the narrower pattern that excludes qa-* human signups). Each pre-existing prod qa-% user is documented in the handoff with email, signup date, and inferred owner. Proceed to Phase A Task 1 with the narrower pattern.

"No unexpected matches" without this explicit branching would be ambiguous — Branch B defines the concrete alternative rather than blocking on "investigate further." Phase A Task 1 depends on one of these two branches being selected in the handoff doc.

Commit: docs(smoke): Task 0 measurement-day baseline for deterministic reset

Task 0b: CI check — seeder-migration parity

Files:

Create: .github/workflows/seeder-migration-check.yml

Step 1: on every PR touching database/migrations/ or database/seeders/, run php artisan migrate:fresh --seed --seeder="Database\Seeders\TestDatabaseSeeder" --force against a fresh Postgres in CI. Note: --seed enables seeding and --seeder=<FQCN> specifies which seeder class; --seed=<name> (one token) is invalid Laravel syntax. If the seeder fails (missing column default, FK violation, etc.), fail the check.

Step 2: verify Laravel surfaces seeder failure as non-zero exit. Laravel's default behavior wraps seed exceptions in \RuntimeException; in some versions the seeder swallows exceptions and exits 0. Add a post-step check: php artisan tinker --execute='echo DB::selectOne("SELECT count(*) c FROM users")->c . PHP_EOL;' — fails the job if count is 0, catching silent seeder aborts.

Exit criterion: the check passes on master today; fails loudly on a deliberately-broken test migration (probe commit in the PR).

Commit: ci: seeder-migration parity check to prevent reset breakage

Task 0c: Stripe test-mode reset helper (moved from Phase C)

Files:

Create: app/Services/Testing/StripeTestReset.php
Create: config/testing.php (new — holds stripe_email_pattern, persona_password; referenced by Tasks 0c, 1, 2, 3, 4c — without this file, every config('testing.*') call returns null and the fail-closed guards throw on every invocation)
Modify: app/Http/Controllers/TestingController.php (call from reset() when QA_RESET_ALLOWED=1)

Deletes Stripe test-mode customers whose email matches a configured pattern. Pattern source: config('testing.stripe_email_pattern') (default qa-*@*). Service exposes purgeTestCustomers() with no args; callers configure the pattern via env (QA_STRIPE_EMAIL_PATTERN). Accepts orphans from other sources. Gated behind reset endpoint — only callable from within TestingController::reset().

Fail-closed pattern validation — literal allowlist (Ahmed Concern #3, refined via plan review v2). Earlier drafts used a regex-blacklist approach (reject patterns matching "dangerous" shapes). The v2 review flagged this as over-engineered: every blacklist is a test of "did I think of every bad pattern?" whereas an allowlist asks "is this exactly one of my known-good patterns?" — strictly safer, ~25 lines less code, no edge-case attack surface.

StripeTestReset's constructor validates the pattern against a hard-coded allowlist:

final class StripeTestReset {
    private const ALLOWED_PATTERNS = [
        'qa-*@*',              // default — matches all QA personas + ephemerals
        'qa-smoke-*@*',        // narrower — just spec-11/12 ephemerals
        'qa-smoketest@*',      // single existing spec user
        'qa-proplan-test@*',   // single existing spec user
    ];

    public function __construct(string $pattern) {
        if (!in_array($pattern, self::ALLOWED_PATTERNS, true)) {
            throw new InvalidArgumentException(
                "Stripe email pattern '{$pattern}' is not in the allowlist. " .
                "Adding a new pattern requires a PR to this file — intentional, " .
                "to force review of any widening of Stripe purge scope."
            );
        }
        $this->pattern = $pattern;
    }
}

Adding a new pattern requires editing this file (code review + tests run). A misconfigured env var gets rejected loudly (HTTP 500 on reset). No regex parsing, no wildcard semantics to defend — the check is set-membership, which is trivial to reason about.

Stripe API hard-timeout (Ahmed Concern #4): each call into Stripe's API (list, delete) is wrapped in a 5-second hard cap. On timeout, the helper:

logs a warning with the customer count not yet purged
returns a partial-success result ['purged' => N, 'skipped_timeout' => M]
does NOT throw — the rest of the reset body (which still holds the advisory lock) completes normally, the advisory lock is released, and residual Stripe customers get cleaned up next run
"Fail-open on Stripe, fail-closed on gate" is the deliberate asymmetry: missing a Stripe purge is annoying, but holding the DB lock for a multi-second Stripe tail blocks the entire smoke queue.

Structured result:

public function purgeTestCustomers(): array {
    return [
        'pattern_used' => $this->pattern,
        'purged' => 12,
        'skipped_timeout' => 0,
        'duration_ms' => 847,
    ];
}

TestingController::reset() includes this result in its JSON response so the smoke dashboard can surface Stripe-reset health.

Rationale for moving to Phase A: this is infrastructure (belongs with the reset contract), not a per-spec migration.

Commit: feat(testing): Stripe test-mode reset helper bundled with DB reset

Task 1: Three-layer gate + advisory lock on reset endpoint

Files:

Modify: bootstrap/app.php:26-30 (Layer 1 — env-var-only route registration)
Modify: bootstrap/app.php:52-54 (CSRF exception extended to QA flag)
Create: app/Http/Middleware/EnsureQaResetAllowed.php (Layer 2 — per-request hostname check)
Modify: routes/testing.php (attach middleware to group)
Modify: app/Http/Controllers/TestingController.php:13-19 — remove the production|staging abort. Layer 3 becomes token check only; environment gating is handled by Layer 1 (boot env-var) and Layer 2 (middleware hostname). Keeping the constructor abort on production would make the whole stack unreachable on QA (where APP_ENV=production).
Modify: app/Http/Controllers/TestingController.php:24-54 (advisory lock, Mailpit clear, Stripe reset call, cache/queue flush)
Add import: use Illuminate\Support\Facades\Http; to TestingController.php (needed for Mailpit clear call)
Create: tests/Feature/TestingResetGateTest.php (all gate tests via HTTP Host header — no production test seams)

Step 1: Write failing tests (gate behavior via HTTP Host header).

// tests/Feature/TestingResetGateTest.php
public function test_reset_blocked_on_wrong_hostname(): void {
    // FUNCTIONAL RISKS: unintended reset on staging/prod; Layer 2 hostname gate must block.
    config(['app.testing_token' => 'valid', 'app.qa_reset_allowed' => true]);

    $response = $this->withHeaders(['X-Testing-Token' => 'valid'])
        ->withServerVariables(['HTTP_HOST' => 'staging.rotatingroom.com'])
        ->postJson('/testing/reset');

    $response->assertStatus(403);
    $this->assertDatabaseHas('users', ['email' => 'ahmed@rotatingroom.com']); // not wiped
}

public function test_reset_allowed_on_qa_host_with_flag_and_token(): void {
    config(['app.testing_token' => 'valid', 'app.qa_reset_allowed' => true]);

    $response = $this->withHeaders(['X-Testing-Token' => 'valid'])
        ->withServerVariables(['HTTP_HOST' => 'qa.rotatingroom.com'])
        ->postJson('/testing/reset');

    $response->assertStatus(200);
}

public function test_reset_blocked_without_flag(): void {
    // Absence of flag at boot = routes never registered = 404.
    config(['app.testing_token' => 'valid', 'app.qa_reset_allowed' => false]);

    $response = $this->withHeaders(['X-Testing-Token' => 'valid'])
        ->withServerVariables(['HTTP_HOST' => 'qa.rotatingroom.com'])
        ->postJson('/testing/reset');

    $response->assertStatus(404);
}

public function test_reset_blocked_with_bad_token(): void {
    config(['app.testing_token' => 'valid', 'app.qa_reset_allowed' => true]);

    $response = $this->withHeaders(['X-Testing-Token' => 'wrong'])
        ->withServerVariables(['HTTP_HOST' => 'qa.rotatingroom.com'])
        ->postJson('/testing/reset');

    $response->assertStatus(403);
}

public function test_concurrent_reset_returns_423_locked(): void {
    // Hold pg_advisory_lock in a separate DB connection; second caller must return 423.
    // See Task 1 Step 3 implementation for lock key.
    config(['app.testing_token' => 'valid', 'app.qa_reset_allowed' => true]);
    $lockKey = crc32('qa-testing-reset');

    DB::connection('primary-raw')->select('SELECT pg_advisory_lock(?)', [$lockKey]);
    try {
        $response = $this->withHeaders(['X-Testing-Token' => 'valid'])
            ->withServerVariables(['HTTP_HOST' => 'qa.rotatingroom.com'])
            ->postJson('/testing/reset');
        $response->assertStatus(423);
        $this->assertEquals('30', $response->headers->get('Retry-After'));
    } finally {
        DB::connection('primary-raw')->select('SELECT pg_advisory_unlock(?)', [$lockKey]);
    }
}

Step 2: Run tests → FAIL (gate + lock not implemented).

Step 3: Implement all three layers + advisory lock.

Layer 1 — bootstrap route registration (env-var only, boot time):

// bootstrap/app.php:26-30
then: function () {
    if (env('APP_ENV') === 'testing' || env('QA_RESET_ALLOWED') === '1') {
        Route::middleware(['web', \App\Http\Middleware\EnsureQaResetAllowed::class])
            ->group(base_path('routes/testing.php'));
    }
},

CSRF exception (bootstrap/app.php:52-54):

if (env('APP_ENV') === 'testing' || env('QA_RESET_ALLOWED') === '1') {
    $middleware->validateCsrfTokens(except: ['testing/*']);
}

Layer 2 — per-request hostname middleware:

// app/Http/Middleware/EnsureQaResetAllowed.php
public function handle(Request $request, Closure $next): mixed {
    $allowed = config('app.qa_reset_allowed') === true;
    $onQaHost = $request->getHost() === 'qa.rotatingroom.com'
                || app()->environment('testing'); // local/CI bypass

    if (!$allowed || !$onQaHost) {
        abort(403, 'Testing endpoints are not available for this host.');
    }

    return $next($request);
}

Layer 3 — controller with advisory lock + expanded reset:

// app/Http/Controllers/TestingController.php
public function reset(Request $request): JsonResponse {
    $this->validateToken($request);

    $lockKey = crc32('qa-testing-reset');
    $acquired = DB::selectOne('SELECT pg_try_advisory_lock(?) AS got', [$lockKey])->got;
    if (!$acquired) {
        return response()
            ->json(['error' => 'reset in progress'], 423)
            ->header('Retry-After', '30');
    }

    try {
        Schema::disableForeignKeyConstraints();
        $skipTables = ['migrations', 'spatial_ref_sys', 'geometry_columns', 'geography_columns', 'raster_columns', 'raster_overviews'];
        foreach (Schema::getTableListing() as $t) {
            if (!in_array($t, $skipTables)) DB::table($t)->truncate();
        }
        Schema::enableForeignKeyConstraints();

        Artisan::call('db:seed', ['--class' => 'Database\\Seeders\\TestDatabaseSeeder', '--force' => true]);
        Artisan::call('cache:clear');
        Artisan::call('queue:clear');

        // Mailpit clear (A.7)
        if ($url = config('services.mailpit.url')) {
            try { Http::timeout(3)->delete("{$url}/api/v1/messages"); }
            catch (Throwable $e) { /* non-fatal — Mailpit unreachable shouldn't block reset */ }
        }

        // Stripe test reset (Task 0c integration point).
        // Hard 5s cap per Stripe API call inside the helper; fail-open on timeout
        // (see Ahmed Concern #4 — a Stripe tail-latency hang would otherwise hold
        // the advisory lock for seconds and block the queued smoke runs).
        $stripeResult = app(\App\Services\Testing\StripeTestReset::class)->purgeTestCustomers();

        return response()->json([
            'status' => 'reset_complete',
            'stripe' => $stripeResult,  // {pattern_used, purged, skipped_timeout, duration_ms}
        ]);
    } finally {
        DB::statement('SELECT pg_advisory_unlock(?)', [$lockKey]);
    }
}

Lock-scope rationale (addresses Ahmed Concern #4): the advisory lock covers the full reset body including Stripe purge, but StripeTestReset::purgeTestCustomers() internally enforces a 5s hard cap per API call and returns partial-success on timeout (never throws). This keeps atomicity — a second caller sees the full post-reset state, not half of it — while bounding the worst-case lock hold to (DB truncate + reseed + Mailpit + Stripe timeout fallback) ≈ 20-25s + 5s = ≤30s p95. Moving Stripe outside the lock would be simpler but re-introduces the race: a second caller could see DB in "reset" state while Stripe still has prior-run customers. The plan accepts the tighter lock scope + per-call timeout as the right tradeoff.

Remove the fabricated hostname_override code path from any earlier plan draft — test seam is HTTP Host header only, no production-code branches.

Step 4: Run tests → PASS (all 5 gate tests + reset body).

Step 5: Commit. feat(testing): three-layer gate + advisory lock + expanded reset body

Task 2: Expand TestDatabaseSeeder with persona catalog + scenario fixtures

Files:

Modify: database/seeders/TestDatabaseSeeder.php (add seedEduPersonas, seedScenarios)
Create: database/seeders/data/personas.php (canonical persona list)
Modify: tests/playwright/utils/test-data.ts (mirror persona catalog, add .edu variants)
Create: tests/Feature/PersonaCatalogParityTest.php (CI drift check)
Create: tests/fixtures/scenarios.md (human-readable scenario documentation)

Step 1: Write parity test — AST-based, not regex (per Assumption 8).

Instead of regexing the TS file (fragile against as const, satisfies, nested objects, object shorthand), run a small Node helper that imports the module and emits its persona keys as JSON. The PHP test invokes the helper via Process::fromShellCommandline (Symfony Process, already used elsewhere in the codebase) and compares JSON arrays.

// tests/Feature/PersonaCatalogParityTest.php
public function test_php_and_ts_persona_lists_match(): void {
    // FUNCTIONAL RISKS: persona drift between PHP seeder and TS test-data causes
    // "correct persona but wrong state" failures; CI must block drift.
    $phpPersonas = collect(include database_path('seeders/data/personas.php'))->keys()->sort()->values()->all();

    $process = new \Symfony\Component\Process\Process(['npx', '--yes', 'tsx', base_path('tests/utils/extract-persona-keys.mjs')]);
    $process->run();
    $output = trim($process->getOutput());
    $this->assertNotEmpty($output, 'TS persona extractor returned empty output: ' . $process->getErrorOutput());
    $tsPersonas = json_decode($output, true);
    $this->assertIsArray($tsPersonas, 'TS extractor output was not valid JSON: ' . $output);

    $this->assertEquals($phpPersonas, $tsPersonas, 'PHP persona list must match TS persona list');
}

Create tests/utils/extract-persona-keys.mjs:

// tests/utils/extract-persona-keys.mjs
import { personas } from '../playwright/utils/test-data.ts';
console.log(JSON.stringify(Object.keys(personas).sort()));

CI installs tsx via npm install -g tsx or uses npx --yes tsx per-invocation. This is robust against every TS idiom because the module is actually evaluated by Node's TS loader, not parsed with a regex. Symfony Process is used rather than PHP's shell functions because it's the project's standard (safer argument passing, existing dependency).

Step 2: Run → FAIL (file doesn't exist).

Step 3: Create database/seeders/data/personas.php — self-documenting names, shared password, qa- prefix on every email so StripeTestReset's purge pattern catches them (Ahmed Concern #5):

<?php
// Every persona email starts with `qa-` to match QA_STRIPE_EMAIL_PATTERN (`qa-*@*`)
// and the broader "recognizable QA user" convention (per 2026-04-24 decision).
// Password for every persona: `RR4Life!` (stored in config('testing.persona_password')).
// Keys are camelCase for TS parity; emails are kebab-case for URL/email readability.
return [
    // --- Stable non-.edu users ---
    'qaLister' => [
        'id' => 1001, 'email' => 'qa-lister@rotatingroom.com',
        'email_verified_at' => 'now', 'active' => 1,
        'owns_baseline_rooms' => true,  // owner of all 60 rooms + scenario fixtures 9001-9003
    ],
    'qaSupport' => [
        'id' => 1002, 'email' => 'qa-support@rotatingroom.com',
        'active' => 1,
    ],
    'qaFounder' => [
        'id' => 1003, 'email' => 'qa-founder@rotatingroom.com',
        'active' => 1,
    ],
    'qaDemo' => [
        'id' => 1004, 'email' => 'qa-demo@rotatingroom.com',
        'email_verified_at' => 'now', 'active' => 1,
    ],
    'qaVerifyExpired' => [
        'id' => 1005, 'email' => 'qa-verify-expired@rotatingroom.com',
        'email_verified_at' => '-7 months', 'active' => 1,
    ],
    'qaBlocked' => [
        'id' => 1006, 'email' => 'qa-blocked@rotatingroom.com',
        'active' => 1, 'blocked' => 1,
    ],
    'qaInactive' => [
        'id' => 1007, 'email' => 'qa-inactive@rotatingroom.com',
        'active' => 0,
    ],
    // --- .edu verification personas ---
    'qaEduUnverified' => [
        'id' => 1010, 'email' => 'qa-edu-unverified@example.edu',
        'email_verified_at' => null, 'active' => 1,
        'owns_pending_free_listing' => 9004,  // scenario fixture room ID
    ],
    'qaEduVerified' => [
        'id' => 1011, 'email' => 'qa-edu-verified@example.edu',
        'email_verified_at' => 'now', 'active' => 1,
    ],
    'qaEduLocked' => [
        'id' => 1012, 'email' => 'qa-edu-locked@example.edu',
        'email_verification_failed_attempts' => 5,
        'email_verification_locked_at' => 'now',
    ],
    'qaEduVerifying' => [
        'id' => 1013, 'email' => 'qa-edu-verifying@example.edu',
        'email_verification_failed_attempts' => 3,  // exercises attempts 4-5
    ],
];

Step 3b: Create database/seeders/data/scenarios.php — non-user fixtures (room-level state):

<?php
return [
    'paidDraftListing' => ['room_id' => 9001, 'owner_id' => 1001, 'plan' => 'rr-monthly', 'status' => 'pending_payment'],
    'fraudFlaggedRoom' => ['room_id' => 9002, 'owner_id' => 1001, 'is_fraudulent' => true, 'status' => 'inactive'],
    'underReviewRoom' => ['room_id' => 9003, 'owner_id' => 1001, 'under_review' => true, 'status' => 'active'],
    'eduPendingFreeListing' => ['room_id' => 9004, 'owner_id' => 1010, 'plan' => 'rr-free', 'status' => 'pending_free'],
];

Step 3c: Add persona-email-matches-purge-pattern invariant (Ahmed Concern #5 — Task 3 probe extension):

// tests/Feature/PersonaEmailPurgePatternTest.php
public function test_every_persona_email_matches_stripe_purge_pattern(): void {
    // FUNCTIONAL RISKS: if a persona email doesn't match QA_STRIPE_EMAIL_PATTERN,
    // specs 18/35 create Stripe customers for them that never get cleaned up,
    // re-creating the exact Stripe-side drift this plan is trying to eliminate.
    $personas = include database_path('seeders/data/personas.php');
    $pattern = config('testing.stripe_email_pattern', 'qa-*@*');
    // Convert wildcard pattern to regex (same logic as StripeTestReset)
    $regex = '/^' . str_replace(['*', '.'], ['.*', '\.'], $pattern) . '$/i';

    foreach ($personas as $key => $p) {
        $this->assertMatchesRegularExpression(
            $regex, $p['email'],
            "Persona '{$key}' email '{$p['email']}' does not match Stripe purge pattern '{$pattern}'. " .
            "Specs that create Stripe customers for this persona will leak test-mode data across runs."
        );
    }
}

Step 3d: Create docs/QA_PERSONAS.md — team-facing cheat-sheet:

# QA Personas — canonical test users on the QA server

All passwords: `RR4Life!` (same as prototype/staging QA credentials).
All reset to this state before every `/qa-smoke` run.

| Email | ID | State | When would I log in as this user? |
|-------|----|----|------|
| qa-lister@rotatingroom.com | 1001 | verified, owns 60 baseline rooms | "I want to see the lister dashboard / edit listing flows" |
| qa-demo@rotatingroom.com | 1004 | verified, non-.edu | "I want to test as a regular authenticated user" |
| qa-edu-unverified@example.edu | 1010 | unverified .edu with pending free listing | "I want to test the first-time .edu verification flow" |
| qa-edu-verified@example.edu | 1011 | verified .edu | "I want to test post-verification state" |
| qa-edu-locked@example.edu | 1012 | 5 failed attempts, locked | "I want to test the lockout UI" |
| qa-edu-verifying@example.edu | 1013 | 3 failed attempts | "I want to test attempts 4-5 of verification" |
| qa-blocked@rotatingroom.com | 1006 | blocked | "I want to test blocked-user paths" |
| qa-verify-expired@rotatingroom.com | 1005 | verified 7mo ago | "I want to test re-verification prompt" |
| qa-inactive@rotatingroom.com | 1007 | inactive | "I want to test inactive-account paths" |
| qa-support@rotatingroom.com | 1002 | active | "I want to test ops/support flows" |
| qa-founder@rotatingroom.com | 1003 | active | "I want to test founder/strategic flows" |
| qa-admin@rotatingroom.com (admin) | admin #1 | active | "I want to log in as Backpack admin" |

**Linked from:** `CLAUDE.md` (QA section), smoke dashboard header, `#qa` channel topic.

Task 2 ensures this doc is updated any time a persona is added/removed (parity test fails the build otherwise).

Step 4: Modify TestDatabaseSeeder::run() to iterate the catalog:

$catalog = include database_path('seeders/data/personas.php');
foreach ($catalog as $key => $config) {
    $this->seedPersona($key, $config);
}

Step 5: Mirror into tests/playwright/utils/test-data.ts — keys and emails must match personas.php exactly (parity test enforces it; per Codex v2 P1 finding). Every email carries the qa- prefix to match the Stripe purge pattern. Password comes from the shared constant, not per-persona:

// tests/playwright/utils/test-data.ts
export const PERSONA_PASSWORD = 'RR4Life!';  // shared across all personas; matches config('testing.persona_password')

export const personas = {
  qaLister:        { email: 'qa-lister@rotatingroom.com',         password: PERSONA_PASSWORD, name: 'QA Lister', canSendQueries: false },
  qaSupport:       { email: 'qa-support@rotatingroom.com',        password: PERSONA_PASSWORD, name: 'QA Support', canSendQueries: false },
  qaFounder:       { email: 'qa-founder@rotatingroom.com',        password: PERSONA_PASSWORD, name: 'QA Founder', canSendQueries: false },
  qaDemo:          { email: 'qa-demo@rotatingroom.com',           password: PERSONA_PASSWORD, name: 'QA Demo', canSendQueries: false },
  qaVerifyExpired: { email: 'qa-verify-expired@rotatingroom.com', password: PERSONA_PASSWORD, name: 'QA Verify Expired', canSendQueries: false },
  qaBlocked:       { email: 'qa-blocked@rotatingroom.com',        password: PERSONA_PASSWORD, name: 'QA Blocked', canSendQueries: false },
  qaInactive:      { email: 'qa-inactive@rotatingroom.com',       password: PERSONA_PASSWORD, name: 'QA Inactive', canSendQueries: false },
  qaEduUnverified: { email: 'qa-edu-unverified@example.edu',      password: PERSONA_PASSWORD, name: 'QA Edu Unverified', canSendQueries: true },
  qaEduVerified:   { email: 'qa-edu-verified@example.edu',        password: PERSONA_PASSWORD, name: 'QA Edu Verified', canSendQueries: true },
  qaEduLocked:     { email: 'qa-edu-locked@example.edu',          password: PERSONA_PASSWORD, name: 'QA Edu Locked', canSendQueries: true },
  qaEduVerifying:  { email: 'qa-edu-verifying@example.edu',       password: PERSONA_PASSWORD, name: 'QA Edu Verifying', canSendQueries: true },
} as const;

Step 6: Run parity test → PASS.

Step 6: Post-seed state verification test (orchestrator shock-test addition). Beyond the parity test (keys match between PHP and TS) and the CI migration-seeder test (seeder runs without errors), add a concrete assertion suite that the seeded state is exactly right:

// tests/Feature/SeededStateTest.php
public function test_seeded_users_exist_with_correct_state(): void {
    // FUNCTIONAL RISKS: seeder typo (wrong ID, missing field, wrong flag) silently
    // corrupts baseline state; every downstream spec inherits the corruption.
    $this->artisan('db:seed', ['--class' => 'Database\\Seeders\\TestDatabaseSeeder', '--force' => true]);

    // Spot-check each persona by exact state
    $this->assertDatabaseHas('users', ['id' => 1010, 'email' => 'qa-edu-unverified@example.edu', 'email_verified_at' => null, 'active' => 1]);
    $this->assertDatabaseHas('users', ['id' => 1012, 'email' => 'qa-edu-locked@example.edu', 'email_verification_failed_attempts' => 5]);
    $this->assertDatabaseHas('users', ['id' => 1006, 'email' => 'qa-blocked@rotatingroom.com', 'blocked' => 1]);
    // ... all 11 personas + admin
    $this->assertDatabaseHas('admin_users', ['email' => 'qa-admin@rotatingroom.com']);

    // Scenario fixture rooms
    $this->assertDatabaseHas('rooms', ['id' => 9001, 'user_id' => 1001, 'status' => 'pending_payment']);
    $this->assertDatabaseHas('rooms', ['id' => 9002, 'user_id' => 1001, 'is_fraudulent' => true]);
    $this->assertDatabaseHas('rooms', ['id' => 9004, 'user_id' => 1010]);

    // Baseline rooms count
    $this->assertEquals(60, \App\Models\Room::where('id', '<=', 60)->count());

    // Sequence advance
    $this->assertGreaterThanOrEqual(1014, DB::selectOne("SELECT nextval('users_id_seq') AS n")->n);
}

Runs in the same CI slot as the migration-seeder check. Catches seeder typos before they reach QA.

Step 6b: Verify admin seeding. qa-admin@rotatingroom.com is the Backpack admin for QA (appears in Appendix B.1). It lives in the admin_users table, not users, so it's not part of personas.php. TestDatabaseSeeder::seedAdmins() must still run and produce this exact email. Task 2 explicitly checks this and updates the seeder if the existing email differs from qa-admin@rotatingroom.com (per the naming convention).

Step 6c: Sequence advancement. After seeding, advance PG sequences past max seeded ID so spec inserts don't collide:

DB::statement("SELECT setval('users_id_seq', 1013)");        // next insert → 1014
DB::statement("SELECT setval('rooms_id_seq', 9004)");        // next insert → 9005
// plus admin_users, cities, stripe_plans, allows, restrictions per Appendix B.7

Step 6d: Team-facing documentation. In addition to docs/QA_PERSONAS.md (from Step 3d), Task 2 links the cheat-sheet from every discovery surface an engineer would hit:

CLAUDE.md Testing section — add a "QA Personas" subsection with link to docs/QA_PERSONAS.md
#qa Slack channel topic — update via Slack API to reference the cheat-sheet URL
Smoke dashboard header (tests/playwright/qa-smoke/dashboard-golden.html) — add a small "QA personas: [link]" header note
.claude/skills/qa-smoke/SKILL.md — reference under a new "Known personas" subsection

Without these links, the cheat-sheet rots from day 1 — a new engineer who doesn't know it exists will never find it.

Step 6e: Password rotation runbook. Create docs/handoffs/qa-password-rotation.md covering the rotation procedure (referenced by Risk 11):

Generate new password (21+ chars, same entropy class as current prototype/staging QA credentials).
Update config/testing.php's persona_password value.
If QA server .env overrides TESTING_PERSONA_PASSWORD, update it there too.
Update docs/QA_PERSONAS.md header.
Re-deploy to QA.
Trigger /testing/reset to re-seed with new bcrypt.
Run a small smoke subset (npx playwright test --grep qa-demo) to verify login still works.
Slack #qa with the new password (or reference to the secrets manager where it's stored).

Step 7: Commit. feat(testing): persona catalog with .edu variants + CI parity check + discoverability links

Task 3: Gate-bypass + Stripe-pattern + persona-parity adversarial probes

Files:

Create: tests/Feature/TestingResetGateBypassTest.php (gate layers)
Create: tests/Feature/StripeTestResetPatternGuardTest.php (Ahmed Concern #3 — pattern validation)
Create: tests/Feature/PersonaEmailPurgePatternTest.php (Ahmed Concern #5 — persona ↔ purge invariant; already referenced in Task 2 but the probe suite is this task's responsibility)

Coverage:

Gate-bypass probes: every combination of missing gate condition against a simulated staging/prod environment. Hard-blocks any request that shouldn't succeed (see Task 1 test file — Task 3 extends it with fuzzing over unexpected combos).
Stripe pattern guard (Ahmed Concern #3; allowlist implementation per v2 review): StripeTestReset::ALLOWED_PATTERNS is a hard-coded 4-entry set (qa-*@*, qa-smoke-*@*, qa-smoketest@*, qa-proplan-test@*). Test matrix covers membership, not pattern shape:
- Allowed (pass): each of the 4 entries verbatim — qa-*@*, qa-smoke-*@*, qa-smoketest@*, qa-proplan-test@*.
- Rejected (throw InvalidArgumentException): null, "", " ", "*", "*@*", "@*", ".+", "qa-*@example.com" (not in allowlist — reject even though qa- prefix present), "qa-smoke-*@example.com" (same — narrow variant not in list), "edu-*@*" (no qa- prefix).
- Test asserts HTTP response is 500 (not 200) when endpoint is called with a rejected pattern. The "add a new pattern requires a PR" invariant is enforced here: any pattern outside the allowlist fails loudly at construction, and widening the allowlist requires editing StripeTestReset.php.
Persona-email purge-pattern invariant (new, Ahmed Concern #5): the PersonaEmailPurgePatternTest described in Task 2 Step 3c — every seeded persona email must match the default QA_STRIPE_EMAIL_PATTERN. Prevents the failure mode where a persona added later doesn't match the purge pattern and accumulates Stripe customers across runs.

Implementation blocker (flagged in Round 3 review): the EnsureQaResetAllowed middleware contains a || app()->environment('testing') bypass that makes the hostname check a no-op during PHPUnit feature tests. If Task 3 naively calls withServerVariables while APP_ENV=testing, the middleware passes via the bypass rather than the hostname check — so the adversarial probe tests nothing. Resolution options:

Mock the environment in each probe test: $this->app->detectEnvironment(fn () => 'production') before the request. Asserts the hostname gate fires in a non-testing env.
Add a narrow config — config('app.testing_host_bypass') === true — defaulting to false, explicitly set to true only in tests that need the bypass (gate tests in Task 1 set it to false).
Drop the bypass entirely and require all feature tests to set HTTP_HOST=qa.rotatingroom.com when calling /testing/reset. Simplest — no production-code branch.

Preferred: Option 3 (simplest, no prod-code branch). Update Task 1's EnsureQaResetAllowed middleware to remove the testing bypass, and update any existing feature test that hits /testing/reset to set the Host header.

Commit: test(testing): adversarial gate-bypass probe; drop testing-env bypass

Phase B — Smoke runner integration

Task 4: Call reset from smoke runner before each run

Files:

Modify: .claude/skills/qa-smoke/SKILL.md section 2 ("Clean previous state")
Modify: scripts/smoke/run.sh (if exists) or inline in skill

Step 1: Add reset step to skill — QA-only, gated on $TARGET (Codex P1 finding).

/qa-smoke is a multi-environment entry point (qa, staging, production, local). Reset MUST NOT run against any env except QA. Hard-gate the reset call on $TARGET:

# Section 2.1: reset target database to baseline state — QA ONLY.
if [ "$TARGET" != "qa" ]; then
  echo "Skipping reset: target is '$TARGET' (reset only runs against qa)."
  echo "  - staging/production: reset is blocked by multiple gates AND is catastrophic; never automate."
  echo "  - local: developer manages their own DB reset via 'php artisan migrate:fresh --seed --seeder=Database\\Seeders\\TestDatabaseSeeder --force'."
else
  echo "Resetting QA database..."
  RESET_START=$(date +%s)

  # Capture curl's exit status IMMEDIATELY — any intervening command (date, echo, assignment)
  # overwrites $?. Use a variable to preserve it.
  if ! curl -sf -X POST "https://qa.rotatingroom.com/testing/reset" \
    -H "X-Testing-Token: $QA_TESTING_TOKEN" \
    -H "Content-Type: application/json" \
    --max-time 60; then
    echo "ERROR: Reset failed — aborting smoke run. Running against un-reset QA defeats the purpose."
    exit 2
  fi

  RESET_END=$(date +%s)
  RESET_DURATION=$(( RESET_END - RESET_START ))
  echo "Reset complete in ${RESET_DURATION}s"

  # Telemetry: log reset duration to the smoke dashboard (non-blocking)
  curl -s -X POST "http://localhost:3456/api/smoke/reset-telemetry" \
    -H 'Content-Type: application/json' \
    -d "{\"env\":\"$TARGET\",\"durationSec\":$RESET_DURATION}" --max-time 5 2>/dev/null || true
fi

Note on if ! curl pattern: the reset-failure check is intentionally the first thing that runs after curl returns. Any intervening commands (date, echo, assignment) would overwrite $?, making the subsequent [ $? -ne 0 ] test meaningless — it would check the last command's status, not curl's. This is the fail-open pattern Codex Round 3 flagged in the Round-2 draft of the plan.

423 Locked retry behavior (per Risk 14, v2 review): when /testing/reset returns 423 Locked (another operator's reset is in flight), the skill script retries up to 3 times with 30s backoff before aborting — the second operator's suite queues cleanly behind the first rather than failing immediately. Pseudocode (uses explicit status-code check rather than -sf, which has cross-version quirks around -w output when --fail triggers):

RESET_ATTEMPTS=0
MAX_ATTEMPTS=3
while true; do
  HTTP_CODE=$(curl -s -o /tmp/reset-out.json -w '%{http_code}' -X POST \
    "https://qa.rotatingroom.com/testing/reset" \
    -H "X-Testing-Token: $QA_TESTING_TOKEN" \
    -H "Content-Type: application/json" \
    --max-time 60)

  if [ "$HTTP_CODE" -ge 200 ] && [ "$HTTP_CODE" -lt 300 ]; then
    break  # success — 2xx
  fi

  if [ "$HTTP_CODE" = "423" ] && [ "$RESET_ATTEMPTS" -lt "$MAX_ATTEMPTS" ]; then
    RESET_ATTEMPTS=$(( RESET_ATTEMPTS + 1 ))
    echo "Reset in progress (another operator); waiting 30s (attempt $RESET_ATTEMPTS/$MAX_ATTEMPTS)..."
    sleep 30
    continue
  fi

  echo "ERROR: Reset failed with HTTP $HTTP_CODE — aborting."
  cat /tmp/reset-out.json 2>/dev/null
  exit 2
done

Commit: feat(qa-smoke): call reset via Playwright setup project; retry on 423, abort on other failures

Task 4a: Pre-QA canary (local clone validation)

Files:

Create: docs/handoffs/2026-04-XX-smoke-reset-canary.md (canary results — populated when Task 4a runs)

Steps:

pg_dump current QA DB to a local rotatingroom_qa_canary database. Do NOT truncate QA.
Point a local Laravel instance at the canary DB (temporary .env.canary or connection override).
Run /testing/reset 20 times against the canary. Record: p50/p95/p99, any FK violations, any orphan rows, any Stripe quota warnings, Mailpit reachability.
If any iteration fails, fix before proceeding to Phase B Task 4.

Exit criterion: 20 clean iterations on canary; handoff doc published with metrics.

Commit: docs(smoke): local canary results for reset loop

Task 4b: Cut-over announcement

Files:

Create: docs/handoffs/2026-04-XX-smoke-reset-cutover.md (populated with actual date when Phase B ships)

Before Task 4 lands, post in #qa:

"Heads up — starting [date], every smoke run resets the QA database to a clean seeded state. If you're doing manual UI testing on QA, please (a) finish your session before [time], or (b) ping this thread to request a reset-free window."

Track acknowledgements from Mahmoud + Megan before landing Task 4.

Commit: docs(qa): cut-over announcement for deterministic reset

Task 4c: Pre-run active-session coordination (replaces original 4h window)

Files:

Modify: .claude/skills/qa-smoke/SKILL.md (add pre-reset coordination step before reset call)
Create: app/Http/Controllers/TestingController@activeSessions endpoint (read-only — returns active Backpack admin sessions + recent login activity)

Why this is simpler than the original 4h window (per Megan's feedback + Gaurav direction): the original plan proposed an operator-settable QA_RESET_ALLOWED=0 window that auto-expires after 4 hours. Megan correctly pointed out that QA review sessions can bleed into meetings, support calls, or next-day follow-ups — a hard 4h cap creates false urgency. A scheduled auto-resume is also a new moving part to fail.

Simpler replacement: reset only runs when someone invokes /qa-smoke. No scheduled window, no auto-expire, no env-flag toggling. The skill adds a pre-reset coordination check:

# New step in qa-smoke skill, before calling /testing/reset:
ACTIVE=$(curl -sf -H "X-Testing-Token: $QA_TESTING_TOKEN" \
  "https://qa.rotatingroom.com/testing/active-sessions")

if echo "$ACTIVE" | jq -e '.backpack_admin_active == true' >/dev/null; then
  echo "⚠️  Active Backpack admin session on QA — likely someone doing manual review."
  echo "    Last activity: $(echo "$ACTIVE" | jq -r '.last_admin_activity')"
  echo ""
  echo "Options:"
  echo "  1. Continue anyway (their session gets wiped)"
  echo "  2. Abort and ping #qa to coordinate"
  read -r -p "Choice [1/2]: " CHOICE
  [ "$CHOICE" = "2" ] && { echo "Aborted. Coordinate in #qa."; exit 3; }
fi

Why this works: we already coordinate for any shared-resource action on QA. The pre-run check surfaces the conflict at the moment it matters (not via a timed window that may or may not still be active), gives the operator an explicit choice, and costs nothing when QA is idle (the fast path is the common path — backpack_admin_active == false).

The active-sessions endpoint: read-only, returns JSON {backpack_admin_active: bool, last_admin_activity: iso8601, recent_logins: [...]}. Gated by the same 3-layer stack as /testing/reset (Task 1) — same token, same hostname, same env flag. No destructive side effects, but the information it returns is sensitive, so gate it the same way.

Commit: feat(qa-smoke): pre-reset active-session coordination (replaces 4h window)

Phase C — Spec migration (multiple PRs, one per spec cluster)

Task 5: Migrate Spec 34 (fraud-and-moderation) to use `fraudFlaggedRoom` fixture

Replace adminUpdateUser(demoId, {is_fraudulent: 0}) cleanup call (currently a no-op) with a reliance on the reset contract — the fixture guarantees the exact starting state, and no cleanup is needed because the next run's reset handles it.

Commit: test(smoke): migrate spec 34 to use fraudFlaggedRoom fixture

Task 6: Migrate journey `edu-*` specs to use `.edu` personas

Replace all seededDemo logins in tests/playwright/journeys/edu-*.spec.ts with eduUnverified / eduVerified / eduLocked as scenario demands. Remove test.skip() fallbacks that existed for the unknown-state case.

Commit: test(journeys): use .edu personas; remove silent-skip fallbacks

Task 7: Migrate spec 13/21 (verification flow) to use `eduLocked` fixture

Spec 21's lockout test currently races the 5-failed-attempt counter; with eduLocked persona pre-seeded to failed_attempts=5, the test begins from locked state deterministically.

Commit: test(smoke): spec 21 uses eduLocked fixture for deterministic lockout

Task 8: Migrate spec 18 (Stripe payment plans) to rely on reset contract

Spec 18 switches from ad-hoc payment-listing.json cleanup to relying on the reset contract. The reset contract comprises Task 1's /testing/reset endpoint + Task 0c's StripeTestReset helper (the endpoint calls the helper internally — see Task 1 Step 3 for the integration point). Remove savePaymentListing(null) workaround.

Commit: test(smoke): spec 18 relies on reset contract; remove null-marker workaround

Task 9: Remove ownership-guard workaround from `qa-helpers.ts`

Once specs rely on seeded fixtures instead of inheriting ephemeral listings, the ownership-guard helper probeValidRoomId in tests/playwright/utils/qa-helpers.ts:59 becomes redundant (it's the only such helper on master; getListingIdOwnedByCurrentLister and readOwnedListingId were added on branch tests/smoke-fixes-26-35-42-ownership-guard in PR #4283 — if that PR has merged by the time this task runs, include those in the deprecation too; if not, just probeValidRoomId). Mark @deprecated; specs 35/42/etc. switch to fixture-based room IDs from the persona catalog. Remove the helper(s) entirely once no specs import them (measured via CI grep).

Commit: refactor(smoke): deprecate ownership-guard; specs use seeded room fixtures

Phase D — Observability

Shock-test deferrals (noted here, not blocking Phase A–C):

Reset latency under real QA load is unmeasured. Task 0's 20-run local baseline is a floor. QA carries webhook traffic, Horizon queues, cron jobs — advisory lock contention + real Horizon row-locks could push p95 higher. Week 1 post-armament, Task 11's telemetry widget should flag if observed p95 > local p95 × 2 ("QA is significantly slower than predicted — investigate lock contention / Horizon drain").
Automated state-drift detector beyond reviewer tagging. Success metric #1 relies on reviewers explicitly tagging state-drift. An independent check — weekly cron that compares current QA users count against baseline (should be exactly 11 + any ephemeral users created since the most recent reset, all with created_at > last-reset-time) — catches drift that slips past reviewers. Deferred to post-launch as a Task 12 extension.

Task 10: Weekly orphan-image pruner for DO Spaces — split into separate plan (issue TBD)

Status (per v2 review): Task 10 is thematically adjacent but NOT on the determinism critical path. The 7 safeguards it requires (dry-run arming, count/percentage caps, soft-delete inclusion, primary-DB read, weekly Slack summary, adversarial probes, DO Spaces versioning precondition) plus a 3-week arming runbook amount to a 1-2 day sub-project that would pull this plan's center of gravity toward Spaces cleanup.

Decision: spin Task 10 out into its own plan/issue, owned by whoever picks up the Spaces-orphan work. This plan ships Phases A–D without Task 10; the orphan-image cleanup work happens in parallel or after.

Rationale: Ahmed's Concern #6 raised the safeguards in the context of this plan, but the safeguards ARE the plan for Task 10 — they're the whole surface area. Keeping Task 10 here compounds scope; splitting it addresses the concern at the right level (its own review cycle) without dragging this plan.

Issue to file: "Weekly DO Spaces orphan-image pruner — production-safe rollout" (references this plan's Appendix A.6 for the state class and Ahmed's 7 safeguards). All design details previously under Task 10 move to that new plan.

Design details (retained here for reference until split issue lands):

Expand original Task 10 design

Files:

Create: app/Console/Commands/PruneOrphanListingImages.php
Create: config/spaces_prune.php (dedicated config with hard bounds, never read from arbitrary env)
Modify: routes/console.php (schedule weekly, onOneServer() + withoutOverlapping())
Create: tests/Feature/PruneOrphanListingImagesTest.php (adversarial — empty-room, replica-lag, soft-deleted scenarios)

Why this task needs production-grade safeguards (Ahmed Concern #6): a bug in the orphan-detection join (missed relation, read-replica lag, soft-deleted rooms not considered) could delete live listing images at scale. Unlike Task 0c's Stripe helper (which operates on Stripe test-mode — low blast radius), this command operates on production DO Spaces — real user photos, not a test sandbox. The destructive-action surface is large enough that the command needs layered defenses before it's trusted to run unattended.

Seven safeguards:

Dry-run by default. SPACES_PRUNE_EXECUTE env defaults to false. Without the explicit opt-in, the command logs "WOULD DELETE: N objects" but calls no delete APIs. First 3 runs must be dry-run; the Slack summary shows what it would have deleted and an operator reviews before arming. Documented in docs/handoffs/spaces-prune-armament.md.
Hard deletion cap per run. config('spaces_prune.max_deletions_per_run') = 100 (compile-time default, not env-driven). If the orphan count exceeds the cap, the command aborts and posts #qa with "REFUSED: N orphans detected, cap is M. Investigate before raising the cap." Legitimate orphan volume from a healthy smoke cycle is <20/week.
Percentage cap. Refuse to delete more than 5% of total objects under listings/* in a single run. Catches the failure mode where the room set is empty (e.g., test database accidentally used as source) and everything looks orphaned. Both the count cap AND the percentage cap must be under their limits; either alone is insufficient.
Include soft-deleted rooms in the "still in use" set. Room::withTrashed()->pluck('id') — a soft-deleted listing may be restored and still need its images. An orphan is ONLY an S3 object whose room ID (parsed from path prefix) isn't in the combined active + soft-deleted set.
Force primary-DB read. Wrap the room-ID query in DB::connection('pgsql')->useWriteConnection() (or equivalent). Replica lag of even a few seconds could make freshly-created rooms look orphaned on the replica while their images are already in S3.
Weekly #qa summary. After every run (dry OR armed), post: "Spaces prune: N orphans detected, M deleted, K bytes freed, {dry-run|armed}." Silent destructive jobs rot — visibility keeps them honest. If no summary posts, something's wrong with the scheduled run.
Adversarial probe test. PruneOrphanListingImagesTest covers:
- test_aborts_when_rooms_table_is_empty — refuses to treat empty-table as "all orphaned"
- test_includes_soft_deleted_rooms_in_still_in_use_set — does not delete images for soft-deleted listings
- test_count_cap_halts_at_threshold — at 101 orphans with cap=100, aborts + posts to Slack
- test_percentage_cap_halts_at_threshold — at 5.1% of listings/* orphaned, aborts
- test_dry_run_makes_no_delete_calls — asserts zero S3 delete calls when SPACES_PRUNE_EXECUTE=false
- test_posts_slack_summary_every_run — both dry-run and armed runs post summaries

Arming sequence (runbook, not code):

Week 1: ships with SPACES_PRUNE_EXECUTE=false. Scheduled run posts weekly summary. Operator reviews the "would delete" counts for 3 consecutive weeks.
Week 4+: if summaries are sane (small orphan counts matching expected smoke churn, no false positives for soft-deleted rooms), operator sets SPACES_PRUNE_EXECUTE=true. Summary continues weekly.
Any week the count exceeds the cap: command aborts + #qa alert. Operator investigates before next run.

Rollback: SPACES_PRUNE_EXECUTE=false instantly reverts to dry-run. No data recovery needed in normal operation. If catastrophic delete ever happens (shouldn't — count+percentage caps prevent it): DO Spaces supports versioning + 30-day recovery if versioning was enabled. Arming precondition: verify DO Spaces versioning is enabled on the bucket before switching to SPACES_PRUNE_EXECUTE=true.

Commit: feat(cleanup): weekly Spaces orphan pruner with dry-run, count cap, percentage cap, and weekly Slack summary

Task 11: Reset-duration guardrail + `state-drift` disposition vocabulary

Files:

Modify: ~/dev-server/api/smoke.js (add /api/smoke/reset-telemetry endpoint)
Modify: tests/playwright/qa-smoke/dashboard-golden.html (p95 widget + state-drift counter)
Modify: .claude/skills/smoke-feedback-review/SKILL.md (add state-drift disposition)
Modify: the smoke-feedback dashboard UI (~/dev-server/sites/smoketest/... or labs equivalent) to accept the new tag

Deliverable 1 — reset telemetry: smoke runner POSTs {durationSec, env, runKey} after each reset; widget shows p50/p95/p99 over last 10 runs; alert posts to #qa if p95 > 30s.

Deliverable 2 — state-drift disposition: add a new disposition option alongside existing confirmed / test-needs-fix / not-a-bug. Reviewer documentation: "Use state-drift when the finding exists because prior-run state leaked into this run (not a test-logic bug, not an app bug). Examples: demo user still fraud-flagged from spec 34; listing.json owned by a prior run's ephemeral lister; subscription left past_due from spec 18." Dashboard counts state-drift per run and exposes the metric consumed by Success Metric #1.

Deliverable 3 — historical backfill: Task 11 retrospectively codes the last 4 smoke audits (2026-04-07, 2026-04-14, 2026-04-21, plus one post-Phase-A run) with dispositions so the baseline has a distribution, not a single point. Adds the distribution to docs/handoffs/2026-04-24-smoke-reset-baseline.md.

Commit: feat(smoke-dashboard): reset-p95 widget + state-drift disposition + historical baseline

Task 12: Monthly gate-audit review

Add to /post-deploy-verify skill: a monthly check of access logs for /testing/* endpoints against staging/prod hostnames. Any hit → P1 alert.

Commit: chore(audit): monthly gate-audit for testing endpoints

Phase ordering & dependencies

Phase A (infra) — Tasks 0, 0b, 0c, 1, 2, 3
  ↓ (measurement → CI checks → Stripe helper → gate → personas → gate-probe)
  ↓ Task 0 delivers before Task 1 starts (metric baselines)
  ↓ PR #3722 (pg_dump backup) MUST land before Phase B (catastrophic-loss rollback precondition)
Phase B (runner) — Tasks 4, 4b, 4c
  ↓ (Playwright setup project + cut-over announcement + reset-free-window escape hatch)
Phase C (specs) — Tasks 5, 6, 7, 8, 9 (can parallelize; each a separate PR)
  ↓ (specs rely on new determinism)
Phase D (observability) — Tasks 11, 12 (can run concurrently with Phase C; Task 10 split into a separate follow-up plan — see Task 10 placeholder)
Phase E (follow-up, conditional) — nightly snapshot approach from Alternative 7
  ↓ (triggered if Phase A-D reset p95 exceeds 30s for 2 consecutive weeks)

Branch Strategy

Feature branch: feature/smoke-deterministic-reset (from master)
Sub-branches:
- feature/smoke-deterministic-reset/task-0-baseline — Phase A Task 0 (measurement day, docs-only)
- feature/smoke-deterministic-reset/task-0b-ci — Phase A Task 0b (seeder-migration CI)
- feature/smoke-deterministic-reset/task-0c-stripe — Phase A Task 0c (Stripe reset helper)
- feature/smoke-deterministic-reset/task-1-gate — Phase A Task 1 (triple-gate + CSRF + mutex)
- feature/smoke-deterministic-reset/task-2-personas — Phase A Task 2 (persona catalog)
- feature/smoke-deterministic-reset/task-3-probe — Phase A Task 3 (gate-bypass adversarial probe)
- feature/smoke-deterministic-reset/task-4-runner — Phase B Task 4 (Playwright setup project)
- feature/smoke-deterministic-reset/task-4b-cutover — Phase B Task 4b (announcement)
- feature/smoke-deterministic-reset/task-4c-window — Phase B Task 4c (reset-free window)
- Phase C sub-branches: one per spec cluster (5-9)
- Phase D sub-branches: one per observability task (11-12; Task 10 shipped via its own follow-up plan on a separate branch)

Merge order (hard preconditions):

Task 0 (baseline) must land before Task 1 — metrics need real numbers.
PR #3722 (pg_dump backup) must land before Phase B — catastrophic-loss rollback.
Task 0c (Stripe) merges independently of gate work; no ordering constraint with Task 1.
Tasks 1, 2, 3 must all pass CI before Phase B lands.
Task 4b (cut-over announcement) posts ≥24h before Task 4 merges.
Phase C can start once Phase B is on staging and one clean run observed.

Orchestrator: main session on feature/smoke-deterministic-reset manages sub-branch merges.

Each sub-branch passes /iterative-review before merging into the feature branch. Feature branch passes one final /ready-for-review cycle before PR to master.

majones919 commented Apr 23, 2026

Confirmed review.

A lot of the current smoke tests I'm re-testing are a result of the QA environment constantly changing, so this in theory will help reduce the extra noise added by those false positives.

Only slight concern- ability to pause the 4h reset - if we're in the middle of a manual review. Not all of our QA time frames align, and can expand over meetings, support calls, bleed into next day. But I think mostly doable.

AhmedEssamElNaggar commented Apr 23, 2026

I don't think this is correct, Every time we push something to QA we create a fresh copy of the master branch and pick the feature branches or bugfix branches that we need to deploy to prod, then we push the new deployment branch into QA, so QA always get reset. The whole point of this doc is based on incorrect assumption.

mahmoudessam7 commented Apr 23, 2026

Reviewed, sounds like a good, well thought about plan.

gsingal commented Apr 23, 2026

Author

@ApxSnowflake — I want to make sure we're lined up on what state persists vs resets on deploy. I queried QA DB directly just now (2026-04-23, a few hours after the a0e61717e deploy at 18:02 UTC):

Evidence of cross-run state persistence on QA:

Query	Result	Would look like this if deploys reset DB
Total users in `users` table	96,111 (oldest 2020-02-12, newest today)	~7 baseline users
Ephemeral `qa-smoke` / `qa-lister` / `qa-traveler*` users	42 accounts, oldest from 2026-03-30	0 (ephemeral pattern would only exist mid-run)
`qa-smoke` users created on 2026-04-11 alone	19 accounts still present 12 days later	0
`demo@rotatingroom.com`	Single row, `created_at = 2023-03-14`	Row would have today's `created_at`
Orphan rooms (owner not in baseline 7)	44,695 rooms	0
`email_verification_codes` rows	2,555 rows, oldest 2025-09-08	0
Users with `email_verification_failed_attempts > 0`	75 users in locked/semi-locked state	0
`subscriptions` table	16,858 rows from 2020-02-13 to today	Baseline only

I think the confusion is that code deploys and DB resets are separate. Forge's deploy script runs php artisan migrate (additive — applies new migrations, preserves existing rows) rather than php artisan migrate:fresh --seed (destructive — drops + reseeds). The deployment branch you build contains new code, but QA's database has been continuously accumulating since 2020.

So the plan's foundational premise (state persists across runs) holds — and the failure modes we've been patching (PR #4283 ownership guard for stale listing.json, PR #4279 admin unflag for is_fraudulent, silent-skip cascades in .edu specs) are all downstream of this.

Does this match what you see on your side? Happy to hop on a call or share a terminal if you want to reproduce the queries.

gsingal commented Apr 23, 2026

Author

@ApxSnowflake — I want to make sure we're lined up on what state persists vs resets on deploy. I queried QA DB directly just now (2026-04-23, a few hours after the a0e61717e deploy at 18:02 UTC):

Evidence of cross-run state persistence on QA:

Query	Result	Would look like this if deploys reset DB
Total users in `users` table	96,111 (oldest 2020-02-12, newest today)	~7 baseline users
Ephemeral `qa-smoke` / `qa-lister` / `qa-traveler*` users	42 accounts, oldest from 2026-03-30	0 (ephemeral pattern would only exist mid-run)
`qa-smoke` users created on 2026-04-11 alone	19 accounts still present 12 days later	0
`demo@rotatingroom.com`	Single row, `created_at = 2023-03-14`	Row would have today's `created_at`
Orphan rooms (owner not in baseline 7)	44,695 rooms	0
`email_verification_codes` rows	2,555 rows, oldest 2025-09-08	0
Users with `email_verification_failed_attempts > 0`	75 users in locked/semi-locked state	0
`subscriptions` table	16,858 rows from 2020-02-13 to today	Baseline only

Does this match what you see on your side? Happy to hop on a call or share a terminal if you want to reproduce the queries.

gsingal commented Apr 23, 2026

Author

@ApxSnowflake — I want to make sure we're lined up on what state persists vs resets on deploy. I queried QA DB directly just now (2026-04-23, a few hours after the a0e61717e deploy at 18:02 UTC):

Evidence of cross-run state persistence on QA:

Query	Result	Would look like this if deploys reset DB
Total users in `users` table	96,111 (oldest 2020-02-12, newest today)	~7 baseline users
Ephemeral `qa-smoke` / `qa-lister` / `qa-traveler*` users	42 accounts, oldest from 2026-03-30	0 (ephemeral pattern would only exist mid-run)
`qa-smoke` users created on 2026-04-11 alone	19 accounts still present 12 days later	0
`demo@rotatingroom.com`	Single row, `created_at = 2023-03-14`	Row would have today's `created_at`
Orphan rooms (owner not in baseline 7)	44,695 rooms	0
`email_verification_codes` rows	2,555 rows, oldest 2025-09-08	0
Users with `email_verification_failed_attempts > 0`	75 users in locked/semi-locked state	0
`subscriptions` table	16,858 rows from 2020-02-13 to today	Baseline only

Does this match what you see on your side? Happy to hop on a call or share a terminal if you want to reproduce the queries.

gsingal commented Apr 23, 2026

Author

@majones919 — great point on the 4-hour cap, and honestly the whole auto-resuming window is over-engineered now that I think about it.

Simpler design: reset only happens when someone runs /qa-smoke. No scheduled window, no auto-resume, no cap to calibrate. If you're mid-review and I kick off a smoke run, the smoke skill now shows a pre-reset check (e.g., "there's an active Backpack admin session on QA — continue and wipe it, or hold?"). You and I coordinate like we would for any other shared-resource action.

This removes the whole "window expiry" class of problem you flagged (bleed into next day, expanding calls, meetings). The tradeoff is smoke runs need a one-line #qa ping before kicking off — but we already do that in practice, and it's a smaller cost than engineering an extendable timed window.

I'll fold this into the plan (Task 4c gets replaced with a "pre-run coordination check in /qa-smoke" instead of a 4-hour scheduled window). Let me know if that works.

gsingal commented Apr 23, 2026

Author

@majones919 — great point on the 4-hour cap, and honestly the whole auto-resuming window is over-engineered now that I think about it.

I'll fold this into the plan (Task 4c gets replaced with a "pre-run coordination check in /qa-smoke" instead of a 4-hour scheduled window). Let me know if that works.

AhmedEssamElNaggar commented Apr 24, 2026

I see your point, if we're talking about database, yes we don't reset it often.

AhmedEssamElNaggar commented Apr 24, 2026

Solid plan — it attacks the actual root cause instead of the symptoms, reuses existing infrastructure, and the multi-layer gate plus phased rollout show the author respects how catastrophic a misfire against prod would be.

AhmedEssamElNaggar commented Apr 24, 2026

One hard invariant worth adding to Task 0c: StripeTestReset should reject unsafe patterns at construction time. If QA_STRIPE_EMAIL_PATTERN is null, empty, *, *@*, or anything that doesnt contain the hard-coded qa-smoke- prefix, the helper throws and the reset endpoint returns 500 — never silently treats it as "match all." Otherwise a forgotten env var on first deploy could wipe the entire Stripe test-mode environment (manual QA, dev experiments, other harnesses). Task 3s adversarial probe should include null/empty/wildcard pattern cases.

AhmedEssamElNaggar commented Apr 24, 2026

Concern on Task 1: TestingController::reset() holds a Postgres advisory lock around the entire body — truncate, reseed, Mailpit clear, Stripe API calls, cache/queue flush. Stripe's API has a long tail (occasional multi-second hangs); if a Stripe call stalls, the advisory lock stays held until PHP's request timeout fires, blocking every subsequent smoke run queued behind it. The 30s p95 target assumes Stripe behaves — one bad roundtrip breaks the budget and cascades. Suggest either (a) strict per-step timeout on the Stripe call (e.g. 5s hard cap, fail-open with a logged warning), or (b) move Stripe cleanup outside the advisory lock so DB reset can release the lock independently.

AhmedEssamElNaggar commented Apr 24, 2026

Concern linking Task 0c + Task 2: the Stripe reset helper only deletes customers matching QA_STRIPE_EMAIL_PATTERN (e.g. qa-smoke-*), but the plan never pins that the seeded personas (Users #1–#13 in the persona catalog) use emails matching that prefix. If they don't, their Stripe customers get created by specs 18/35 and never cleaned up, accumulating across runs — re-creating the exact Stripe-side drift the plan is trying to eliminate. Suggest: (a) require all seeded personas in Task 2 to use the qa-smoke- prefix, and (b) extend Task 3's adversarial probe to assert every persona email matches the pattern that StripeTestReset will purge.

AhmedEssamElNaggar commented Apr 24, 2026

Concern on Task 10 (DO Spaces orphan pruner): this task destructively acts on live production S3 storage based on a DB join, but gets one task line with no safeguards — unlike the Stripe helper in Task 0c, which correctly has pattern validation, quota limits, and fail-open behavior. A bug in the join (missed relation, read-replica lag, soft-deleted rooms not considered) could delete live listing images at scale.

Suggest mirroring the Stripe helper's safeguards:

Dry-run by default — log-only until SPACES_PRUNE_EXECUTE=1 is set; compare "would delete" output against reality for the first few runs before arming.
Hard deletion cap per run (e.g. 100 objects) — abort and alert #qa if exceeded; legitimate weekly orphan volume is small.
Percentage cap — refuse to delete more than ~5% of objects under listings/* in a single run; catches the "empty room set → everything looks orphaned" failure mode.
Include soft-deleted rooms in the join — soft-deleted listings may be restored and still need their images.
Force primary DB read — replica lag could make freshly-created rooms look orphaned.
Weekly summary to #qa with deletion counts — silent destructive jobs rot; visibility keeps them honest.
Adversarial probe (Task 3) — add an empty-rooms scenario to verify the pruner refuses rather than nukes everything.

gsingal commented Apr 25, 2026

Author

v2 plan posted — all of yesterday's feedback addressed + 1 round of additional iterative review (3 Claude rounds + Codex challenge + expert shock test).

Quick map of what changed since the version you reviewed:

Addressing @ApxSnowflake's concerns:

✅ StripeTestReset now uses a literal allowlist (4 hard-coded patterns), not a regex-blacklist. Adding a new pattern requires editing the file → forces code review. Strictly safer than the original regex-validation design.
✅ Stripe API calls inside reset() get a 5s hard timeout + fail-open; partial-success result returned. The advisory lock no longer can be held by a Stripe tail-latency hang.
✅ Persona emails now all have qa- prefix (qa-edu-locked@example.edu, qa-blocked@rotatingroom.com, etc.) — matches the Stripe purge pattern. New PersonaEmailPurgePatternTest enforces the invariant.
✅ Task 10 (DO Spaces orphan pruner) split out into its own follow-up plan — your 7 safeguards are correct but they ARE the plan for that work; keeping it here was scope creep. This plan ships Phases A–D without Task 10.

Addressing @majones919's feedback:

✅ The 4-hour reset-free window is gone. Replaced with pre-run active-session coordination: when someone runs /qa-smoke, the skill checks for an active Backpack admin session first and prompts the operator to continue/abort. No timed window, no expiry, no auto-resume — coordinate at the moment it matters.

New from @gsingal direction:

✅ Persona naming convention: every email is self-documenting (qa-edu-locked@example.edu, qa-blocked@rotatingroom.com) so a developer reads the email and knows the state.
✅ Shared password: every persona uses RR4Life! (same as team's prototype/staging QA credentials) — no separate creds to look up.
✅ docs/QA_PERSONAS.md cheat-sheet linked from CLAUDE.md, smoke dashboard header, #qa topic.

Found in the additional review rounds (would have shipped with bugs without these):

5 new risks added covering shared-password leak, persona-email collision with prod, Stripe timeout drift, cap false-positives, concurrent-operator queueing
config/testing.php was referenced everywhere but never explicitly created in any task's Files list — fixed
Task 4 was POSTing to QA reset on every smoke run including /qa-smoke staging (would have wiped QA when running against staging) — now per-env gated
Task 4 retry-on-423 logic so concurrent operators queue instead of abort
Post-seed state verification test (catches seeder typos before they reach QA)

Final ratings:

Iterative review: 7.5 → 8.5 → 9.5/10 → APPROVE
Codex adversarial challenge: 3 P1/P2 found and fixed (per-env gate, TS persona mirror, allowlist test matrix)
Expert shock test: 1 addition (post-seed verification), 2 deferrals to Phase D

Ready to start Phase A (Task 0 = measurement day, no code changes — safe first step). Estimated full execution: 2–3 weeks across A→B→C→D.

Plan branch: docs/smoke-reset-plan. Updated gist above is the canonical v2-final.

gsingal commented Apr 25, 2026

Author

v2 plan posted — all of yesterday's feedback addressed + 1 round of additional iterative review (3 Claude rounds + Codex challenge + expert shock test).

Quick map of what changed since the version you reviewed:

Addressing @ApxSnowflake's concerns:

✅ StripeTestReset now uses a literal allowlist (4 hard-coded patterns), not a regex-blacklist. Adding a new pattern requires editing the file → forces code review. Strictly safer than the original regex-validation design.
✅ Stripe API calls inside reset() get a 5s hard timeout + fail-open; partial-success result returned. The advisory lock no longer can be held by a Stripe tail-latency hang.
✅ Persona emails now all have qa- prefix (qa-edu-locked@example.edu, qa-blocked@rotatingroom.com, etc.) — matches the Stripe purge pattern. New PersonaEmailPurgePatternTest enforces the invariant.
✅ Task 10 (DO Spaces orphan pruner) split out into its own follow-up plan — your 7 safeguards are correct but they ARE the plan for that work; keeping it here was scope creep. This plan ships Phases A–D without Task 10.

Addressing @majones919's feedback:

✅ The 4-hour reset-free window is gone. Replaced with pre-run active-session coordination: when someone runs /qa-smoke, the skill checks for an active Backpack admin session first and prompts the operator to continue/abort. No timed window, no expiry, no auto-resume — coordinate at the moment it matters.

New from @gsingal direction:

✅ Persona naming convention: every email is self-documenting (qa-edu-locked@example.edu, qa-blocked@rotatingroom.com) so a developer reads the email and knows the state.
✅ Shared password: every persona uses RR4Life! (same as team's prototype/staging QA credentials) — no separate creds to look up.
✅ docs/QA_PERSONAS.md cheat-sheet linked from CLAUDE.md, smoke dashboard header, #qa topic.

Found in the additional review rounds (would have shipped with bugs without these):

5 new risks added covering shared-password leak, persona-email collision with prod, Stripe timeout drift, cap false-positives, concurrent-operator queueing
config/testing.php was referenced everywhere but never explicitly created in any task's Files list — fixed
Task 4 was POSTing to QA reset on every smoke run including /qa-smoke staging (would have wiped QA when running against staging) — now per-env gated
Task 4 retry-on-423 logic so concurrent operators queue instead of abort
Post-seed state verification test (catches seeder typos before they reach QA)

Final ratings:

Iterative review: 7.5 → 8.5 → 9.5/10 → APPROVE
Codex adversarial challenge: 3 P1/P2 found and fixed (per-env gate, TS persona mirror, allowlist test matrix)
Expert shock test: 1 addition (post-seed verification), 2 deferrals to Phase D

Ready to start Phase A (Task 0 = measurement day, no code changes — safe first step). Estimated full execution: 2–3 weeks across A→B→C→D.

Plan branch: docs/smoke-reset-plan. Updated gist above is the canonical v2-final.

gsingal/2026-04-23-smoke-deterministic-reset.md

QA Smoke Deterministic Reset Foundation

Problem & Why Now ← Dimension 1

Prior Art & Research ← External best practices

Alternatives Considered ← Dimension 2

Assumptions ← Dimension 3

Approach & Rationale ← Dimension 4

Architectural changes

Why phased

Risks & Rollback ← Dimension 5

Non-Goals ← Dimension 6

Success Criteria ← Dimension 7

Launch Metrics ← Post-launch impact tracking

Success (what we're improving)

Guardrails (what must not break)

Decision Rules

Requirements Input ← Dimension 10

Appendix B: Post-Reset Baseline State (what everything gets set to)

B.1 — Users (11 personas with self-documenting names + shared password)

B.2 — Rooms (60 today, + scenario fixtures from Task 2)

B.3 — Stripe Plans (10 plans)

B.4 — Cities (6 cities)

B.5 — Permission rules (allows + restrictions)

B.6 — Everything else: empty by design

B.7 — Auto-increment sequences

B.8 — What is NOT in the baseline (must be created by specs or NOT exist)

Appendix A: State Drift Catalog (comprehensive enumeration)

A.1 — User state

A.2 — Listing / Room state

A.3 — Subscription / Billing state

A.4 — Messaging / Conversation state

A.5 — Verification / Fraud / Auth state

A.6 — Filesystem / Client state

A.7 — External-service state

A.8 — Schema / Config / Infrastructure state

Tasks ← Dimension 8

Phase A — Infrastructure (must land first, no spec changes)

Task 0: Measurement Day (1-day spike, no code changes)

Task 0b: CI check — seeder-migration parity

Task 0c: Stripe test-mode reset helper (moved from Phase C)

Task 1: Three-layer gate + advisory lock on reset endpoint

Task 2: Expand TestDatabaseSeeder with persona catalog + scenario fixtures

Task 3: Gate-bypass + Stripe-pattern + persona-parity adversarial probes

Phase B — Smoke runner integration

Task 4: Call reset from smoke runner before each run

Task 4a: Pre-QA canary (local clone validation)

Task 4b: Cut-over announcement

Task 4c: Pre-run active-session coordination (replaces original 4h window)

Phase C — Spec migration (multiple PRs, one per spec cluster)

Task 5: Migrate Spec 34 (fraud-and-moderation) to use fraudFlaggedRoom fixture

Task 6: Migrate journey edu-* specs to use .edu personas

Task 7: Migrate spec 13/21 (verification flow) to use eduLocked fixture

Task 8: Migrate spec 18 (Stripe payment plans) to rely on reset contract

Task 9: Remove ownership-guard workaround from qa-helpers.ts

Phase D — Observability

Task 10: Weekly orphan-image pruner for DO Spaces — split into separate plan (issue TBD)

Task 11: Reset-duration guardrail + state-drift disposition vocabulary

Task 12: Monthly gate-audit review

Phase ordering & dependencies

Branch Strategy

majones919 commented Apr 23, 2026

Uh oh!

AhmedEssamElNaggar commented Apr 23, 2026

Uh oh!

mahmoudessam7 commented Apr 23, 2026

Uh oh!

gsingal commented Apr 23, 2026

Uh oh!

gsingal commented Apr 23, 2026

Uh oh!

gsingal commented Apr 23, 2026

Uh oh!

gsingal commented Apr 23, 2026

Uh oh!

gsingal commented Apr 23, 2026

Uh oh!

AhmedEssamElNaggar commented Apr 24, 2026

Uh oh!

AhmedEssamElNaggar commented Apr 24, 2026

Uh oh!

Task 5: Migrate Spec 34 (fraud-and-moderation) to use `fraudFlaggedRoom` fixture

Task 6: Migrate journey `edu-*` specs to use `.edu` personas

Task 7: Migrate spec 13/21 (verification flow) to use `eduLocked` fixture

Task 9: Remove ownership-guard workaround from `qa-helpers.ts`

Task 11: Reset-duration guardrail + `state-drift` disposition vocabulary