Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save florian583/5e77dc6503485051c099c1a419e903a4 to your computer and use it in GitHub Desktop.

Select an option

Save florian583/5e77dc6503485051c099c1a419e903a4 to your computer and use it in GitHub Desktop.
PR #2148 — Records-page perf refactor: plan + measured baseline (2026-05-12)
title Records-query perf — Phase 0 baseline measurements
type baseline
domain records-table
created 2026-05-12
owner florian
target_workspace ws=08cdaa0c-6a2c-4d1e-95a2-e08a9fbb9f3e (pk=510 "Wellapp test max", 1,245 active companies)
env local Docker stack — Hasura v2.x on PostgreSQL 17.9 (staging-ISO dump restored 2026-05-12 00:15Z)
related_plan docs/features/records-table/records-query-perf-refactor-plan-2026-05-12.md

Phase 0 baseline — measured numbers

TL;DR

The 8s /v1/records/query is NOT a Hasura problem. Hasura itself runs the records.companies GraphQL in 2.1s direct. The remaining ~5.8s is API-side overhead — specifically, traced-http.helper.ts logging the Hasura response body through a sanitize-then-truncate pipeline that re-walks and re-stringifies the 100KB payload 3–4 times per call.

Numbers

End-to-end (as observed in production-shape API logs)

Layer Time
Express middleware / auth ~30ms
resolveWorkspaceScope (parallel) ~22ms
DataViewsService.query → Hasura (measured by API client) 7,915ms
JSON response serialize + send ~80ms
Total observed 8,024ms

Hasura, called directly via curl

Same GraphQL body, same variables, same workspace, same row count. Bypasses the API.

baseline (all relations)     min=2,134ms   avg=2,322ms   bytes=113,176
warm hit after 19 variants   min=2,411ms   (negligible cache effect)

Gap = ~5,800ms in the API process between axios.request() returning and res.json() flushing.

Per-relation cost (Hasura side) — Δmin from drop

company_people              246ms   (people=163k rows, indexed)
invoices                    228ms   (60k rows)
relations                   209ms   (company_relations — small, but issuer+receiver joins)
workspace                   172ms   (computed-field per row via 'workspace.object')
payment_means               170ms
sourceWorkspaceConnector    168ms
accounts                    162ms
web_links                   105ms   (16KB payload, largest by bytes)
company_locations            80ms
company_phones               77ms
─────────────────────────────────
Sum top 10 (of 19)        1,617ms   ≈ 76% of 2,134ms baseline

Smaller contributors: cards 62ms, subscriptions 62ms, company_financial 60ms, object (computed) 53ms, checks 47ms, journal_entry_lines 38ms, targetWorkspaceConnector 35ms, primary_contact 17ms, company_emails 8ms.

Note checks / journal_entry_lines / company_categories are empty tables in this staging-ISO snapshot — their cost is the empty-aggregate roundtrip only (~40–70ms each). On a populated workspace this would scale.

Response-bytes by relation (Δbytes from drop)

web_links              16,047 bytes
object (computed)      11,819 bytes    (per-row jsonb {pk, name, domain, logo_url, ...})
workspace              11,550 bytes    (workspace.object computed per row)
company_locations       8,789 bytes
company_emails          7,458 bytes
company_media           5,950 bytes
company_phones          4,553 bytes
primary_contact         4,187 bytes
company_people          2,595 bytes

Full response ≈ 113KB.

Plan-cache cold-vs-warm

first call    2,134ms
last call     2,411ms   (after 19 different variants — plan cache likely cold-by-shape)

Plan-cache is not a meaningful contributor locally. Discard Phase 4 (cache warmup) from the refactor plan.


Root cause of the 5.8s gap — traced-http.helper.ts

The API's outbound HTTP client (used for every Hasura call) logs request + response bodies via Winston. For each call:

logIncoming(service, status, url, duration, response.data, response.headers, ...)
//   → sanitize(responseBody)        // recursive walk over entire nested object
//   → safeSerialize(...)            // JSON.stringify pass #1 (try/catch test)
//   → truncateBody(...)             // JSON.stringify pass #2 (length check)
//   → winston format/print          // JSON.stringify pass #3 (console transport)

sanitize() walks every property of every nested object in the response looking for secret|password|token|api_key|apikey|authorization|access_token|client_secret. For a 50-row × 20-relation × 2-3-level-deep response, this is tens of thousands of recursive function calls + property descriptor lookups. V8 doesn't optimize this well — it scales superlinearly with depth.

sanitize runs before truncateBody. The full 100KB object is walked even though the log line is then truncated to 2,000 characters and the walk is discarded.

The Winston console transport then re-stringifies the result.

Measured contribution: ~5.8s for a single 110KB response. Reproduces every records query.

Same pattern applies to all other outbound calls (Plaid, Attio, etc.) — but those return <10KB so the overhead is invisible. Only Hasura records-query trips it.


Implications for the refactor plan

The original plan assumed Hasura was the bottleneck. Reality:

Phase Original assumption Revised priority
Phase 0 — measure gate for rest DONE — this doc
Phase 0.5 — fix traced-http.helper.ts not in original plan NEW: highest priority. ~5.8s win locally, ~0 in prod
Phase 1 — view-driven query budget biggest win still high value. ~1s win on Hasura side by dropping unused relations
Phase 2 — collapse heavy relations into computed_fields high medium. company_people→people→media is 246ms — measurable but not catastrophic. Worth it if col-18 ships primary_contact swap regardless
Phase 3 — per-relation limit: 5 medium low. Sum of array-relation bytes ≈ 50KB; capping at 5 saves maybe 20-30KB on populated workspaces. Bigger gains in cell-level paint than total request time
Phase 4 — Hasura plan-cache warmup conditional drop entirely. Cache cold-warm delta is in noise

Phase 0.5 — proposed fix shape (smallest single-file change)

apps/api/src/helpers/traced-http.helper.ts:

  1. Stringify the body first with a size cap, before sanitize():

    function truncateBody(body: unknown): unknown {
      if (body == null) return body;
      const str = typeof body === "string" ? body : JSON.stringify(body);
      if (str.length > MAX_BODY_LOG_LENGTH) {
        return str.substring(0, MAX_BODY_LOG_LENGTH) + `... [truncated, total ${str.length} chars]`;
      }
      return body;
    }

    Then in logIncoming:

    // Truncate FIRST; only sanitize if the body is small enough to be worth logging in full
    const truncated = truncateBody(responseBody);
    const final = typeof truncated === "string"
      ? truncated                                             // already a slice; nothing to sanitize
      : safeSerialize(sanitize(truncated as Record<string, unknown>), "body");

    Same in logOutgoing.

  2. OR add an environment-driven body-log threshold:

    const HTTP_LOG_BODY_MAX_BYTES = parseInt(process.env.HTTP_LOG_BODY_MAX_BYTES ?? "10000", 10);

    When the body length exceeds the threshold, log only { bytes: <n>, truncated: true } — never sanitize or serialize. Default 10KB in dev, ~unlimited in prod (where Cloud Logging is async).

Either change is ~15 LoC. Both keep prod logging intact (small bodies still get full sanitized log entries).

Acceptance for Phase 0.5

  • /v1/records/query p50 drops from ~8s to ~2.5s on local Docker stack (matching direct-Hasura baseline + small overhead)
  • Sensitive-key redaction still works for small bodies (Plaid token, Stripe webhook, etc.)
  • Prod log volume / shape unchanged for typical responses

Phase 0.5 result — measured 2026-05-12

Shipped in apps/api/src/helpers/traced-http.helper.ts (single file, ~30 LoC added):

  • New prepareBodyForLog() helper. Single JSON.stringify size probe; bodies over HTTP_LOG_BODY_MAX_BYTES (default 10,000) log as { bytes: N, omitted: true }.
  • Replaces inline truncateBody(sanitize(...)) in both logOutgoing and logIncoming.
  • safeSerialize no longer called in the inbound path (kept in tracedSdkCall only).

3× curl /v1/records/query post-fix, same workspace (d8e8e830), same column set, same limit 50:

run 1   2,376ms   176,705 bytes downloaded
run 2   2,360ms   176,705
run 3   2,266ms   176,705

API log per-request:

⬇️ 200 hasura http://localhost:8081/v1/graphql  (2,282ms)
⬅️ POST /v1/records/query 200                   (2,340ms)

Delta: 8,024ms → 2,340ms — 71% reduction, 5,684ms saved per records query.

API overhead now ~58ms (was ~109ms). The 2,282ms Hasura call matches the direct-curl baseline (2,134ms ± noise — small drift from cold connection / different column set).

No behavioural regression observed. Records page renders identically; F-07 scope chip still resolves; PR-2132 +N navigation still works.


Files / artefacts

  • Profile script: /tmp/profile-records-query.mjs (saved this session; promote to scripts/profile-records-query.mjs in PR-R1)
  • Full tmux buffer with captured GraphQL bodies: /tmp/well-dev-buffer.txt (this session only)
  • Migration applied this session: apps/api/src/database/migrations/Migration20260512015500_records_page_sort_indexes.ts — already merged into the developer's local DB; ready to ship as a separate small PR

Open questions

  1. Does prod actually log Hasura response bodies, or does Winston route through @google-cloud/logging-winston (async, batched)? If prod is async, Phase 0.5 only helps local dev. Action: check config/logger.ts and confirm.
  2. Was this overhead worse before PR #2113 added +28 -4 lines to traced-http.helper.ts? Quick git log -p check on the file.
  3. Should the env knob default differ between dev and prod, or just use one knob driven by NODE_ENV?
title Records-query perf refactor — Companies first
type plan
domain records-table
created 2026-05-12
owner florian
authoring_source live profiling on local staging-ISO DB
gate post-alpha — get /records/companies to "feels instant" on the alpha workspace
related_docs
docs/features/records-table/horizontal-companies-cleanup.md
docs/features/records-table/qa-companies-2026-05-06.md
docs/features/records-table/qa-fix-plan-2026-05-07.md
related_prs_shipped
#2113 — F-06 + col 4 computed + display_type + view backfill (backend)
#2114 — EntityObject + LIST_OF_COMPOSITE + sort fallback (frontend)
#2131 — Hasura artifact regen
#2132 — +N overflow → filtered list nav
#2133 — spec-conformance drift test
#2134 — F-07 WorkspaceScopeChip
#2135 — TZ-aware datetime
#2142 — responsive list-of-composite maxVisible
#2146 — workspace-scope null/copy fix
locked_decisions
Junction direction (per-entity `{entity}_workspace_connectors`, Bastien lock from PR #2030)
Backend-owns editability via `overrides.yml editable`
logo.dev token hardcoded
Wave C composite editability deferred (locked by spec-conformance invariant test)

Records-query perf refactor — plan

This plan refactors the read path of /records/companies (and by extension /records/<root>) so first-paint feels instant on the alpha workspace. No schema changes. No junction-direction changes. No new cell components. All work is in the GraphQL shape, the API query-builder, and the data-layer hooks.


0. Baseline (measured 2026-05-12 on local staging-ISO DB)

Single POST /v1/records/query for ws=08cdaa0c-… (pk=510 "Wellapp test max", 1,245 active companies):

Layer Time
Express middleware / auth ~30ms
resolveWorkspaceScope (parallel) ~22ms
DataViewsService.query → Hasura 7,915ms
JSON response serialize / send ~80ms
Total 8,024ms

Hasura request body (decoded from well-dev tmux):

  • One GraphQL query, 34 columns, 50 row limit
  • ~20 array_relations per row: invoices, accounts, payment_means, company_people→people→media, company_workspace_connectors→workspace_connector→connector, cards, company_categories→category, checks, company_emails→email, journal_entry_lines, company_locations→location, company_media→media, company_phones→phone, relations→target_company, subscriptions, + a few smaller
  • 7 computed_fields per row: object, primary_contact, primary_location, primary_media, targetWorkspaceConnector, invoices_unified, related_companies
  • Total leaf field fetches per row ≈ 60+. With 50 rows ≈ 3,000+ leaf fetches per request.

Companies sort path is already Index Scan using idx_companies_workspace_created_active (0.10ms, applied by Migration20260512015500_records_page_sort_indexes). Sort is not the bottleneck.

Cost is in Hasura's jsonb_agg cascade over the array_relations and the resulting JSON serialization. Local Docker Hasura + cold plan cache make it worse than production; the order-of-magnitude is the same shape regardless.


1. North star

Metric Today Target
/v1/records/query p50 (50 rows, ws=510 local) 8.0s < 1.5s
/v1/records/query p95 (cold) ~10s < 2.5s
/v1/records/field-preview p50 (cold) ~12s < 3.0s
Records page visible cell content (LCP-equivalent) n/a < 2.0s local, < 4.0s staging cold
Spec-conformance test (PR #2133) 214/214 214/214 unchanged
F-01..F-07 alpha signals green green unchanged

Non-goals:

  • No schema migrations (other than purely additive computed-field functions if used)
  • No change to Hasura junction direction
  • No new cell components
  • No removal of any spec'd column

2. Phase 0 — measure (no code change)

Goal: replace hypotheses with per-relation numbers so phase 1–3 are evidence-driven.

Step Command / artefact
Capture exact GraphQL body tmux capture-pane -t well-dev -p -S -3000 | grep -A 1 'graphql.*workspace_id' | ... (already captured in /tmp/records-query.graphql during this session)
Run with Hasura explain curl -X POST http://localhost:8081/v1/graphql/explain -H "x-hasura-admin-secret: dev_admin_secret_123" -d @body.json
Per-relation timing curl http://localhost:8081/v1/graphql -d @body.json -w "%{time_total}" × 3 then drop each array_relation one at a time
JSON response size curl ... -o /dev/null -w "%{size_download}"
Plan-cache warm vs cold hit identical request 2x, compare

Deliverable: docs/features/records-table/records-query-perf-baseline-2026-05-12.md with:

  • Ordered list of array_relation cost contributions (ms + bytes)
  • Computed-field cost (each function isolated)
  • Plan-cache cold/warm delta
  • Output: top-3 cost contributors

Why this comes first: the rest of the plan is wrong if H2 ("company_people → people → media is the dominant cost") turns out to be false. Half the phases below depend on knowing which relation costs what.


3. Phase 1 — view-driven query budget

Problem: today the API/Hasura query asks for every relation referenced by any of the 34 spec'd columns, regardless of whether the user's active workspace_views.columns actually renders them. The records page lets users hide columns but the network query doesn't shrink.

Fix: the Hasura query builder must emit fragments for the active view's columns only, not for the entire COMPANIES_DEFAULT_FIELDS set.

Backend

  • apps/api/src/services/data-views/ query builder:

    • Take the activeView.columns[] (already in the request as fields[][]) and translate it to a minimal GraphQL selection set
    • Drop array_relations whose root is not referenced by any active column
    • Drop computed_fields whose .object is not referenced
    • Keep scalar columns always (cheap)
  • Backwards-compat: when the request omits fields entirely (e.g. legacy callsites, useRecordsFieldPreview warmup), fall back to today's full set behind a feature-flag-free env knob (DATA_VIEW_QUERY_BUDGET=on default in dev; on in prod once verified).

Frontend

  • The active view's columns[] is already the source of truth for what the table renders. Make sure useRecordsInfiniteQuery request payload includes the resolved column list — today it does, verify the wiring still works after backend reshape.

Acceptance

  • For a view with 12 visible columns (typical user state), the API request body's fields array has 12 entries.
  • Hasura GraphQL emitted contains only the array_relations / computed_fields needed by those 12.
  • Default view (34 columns) keeps today's full graph — no regression.
  • Spec-conformance test (PR #2133) still passes (it asserts on the canonical 34-column matrix, not on what's sent on wire).

4. Phase 2 — collapse high-cost relations into computed_fields

Hypothesis to verify in phase 0: company_people → people → media is the dominant single relation. If true, replace it with a primary_contact computed_field that returns the jsonb { person_id, full_name, avatar_url } directly — same pattern as the existing company_primary_media / company_primary_contact SQL functions in Migration20251007135956.

Two angles, run only if phase 0 says they pay off:

4.1. primary_contact already exists — wire it

core_api.company_primary_contact(company_row, hasura_session) is already defined and exposed as companies.primary_contact (computed_field). The records-table currently fetches company_people → people → media for the visible chip in col 18, which is the same job.

  • Switch col 18 rendering to consume primary_contact instead of the array_relation
  • composite_avatar_fullname cell reads data.companies[*].primary_contact.{full_name,avatar_url} instead of walking company_people[0].people.full_name + .people.media.url
  • The full people-list popover (+N chip overflow) still hits the dedicated /records/people?filter[companies_pk]=<id> page (PR #2132) — that page loads the real array_relation on demand

4.2. Optional: introduce companies.connectors_summary computed_field

If company_workspace_connectors → workspace_connector → connector shows up in phase 0 as a top cost, replicate the pattern:

  • New SQL function core_api.company_connectors_summary(row, session) returning jsonb[] of { direction, status, connector_name, connector_service_id } for the company's top-3 connectors
  • Hasura computed_field declared in core_api_companies.yaml
  • Frontend cell switches from walking the array to reading the summary jsonb
  • Same +N nav contract: overflow → dedicated connectors page filtered by company

Acceptance (per relation collapsed):

  • Network request payload size drops by ≥30%
  • GraphQL execution drops by ≥30% on the affected query
  • Visual output identical — same chip rendered, same +N count, same nav target

5. Phase 3 — per-relation limit budget

Hasura supports relation(limit: N, order_by: …) at the GraphQL level. The records-table only ever renders maxVisible chips per cell (1–4 depending on column width per PR #2142). Asking for the unbounded array_relation is waste.

  • Default limit: 5 on every list-of-composite array_relation in the records-query (covers max width 220px+ N=4 plus 1 for overflow probe)
  • The frontend already calls +N nav for overflow; the in-cell render only ever needs N+1 items
  • Apply at the query-builder level so it works uniformly for companies/people/invoices/transactions/accounts

Constraint: aggregate counts can NOT be derived from a limited array (+N badge needs total). Either:

  • Keep a separate <rel>_aggregate { aggregate { count } } field per cell (cheap; just a COUNT, no rows)
  • Or rely on the field-preview pipeline already shipping aggregates

Acceptance:

  • Each array_relation in the records-query payload is bounded by limit: 5
  • +N badge still shows accurate total
  • Per-relation jsonb payload size drops proportionally to (avg arr length / 5)

6. Phase 4 — Hasura plan-cache warmup (optional, defer until 1–3 measured)

The first /records/companies after Hasura process restart is the slowest. Each unique query shape compiles a fresh plan, then subsequent identical-shape queries hit cache.

  • Add a one-time warmup on API boot that issues the default-view companies query against a known empty workspace (so it doesn't pollute auth/state, just primes the plan cache)
  • Same for the other default record roots
  • No user-visible behaviour change; pure warmup

Run only if phase 0 confirms plan-cache cold is a >30% contributor on first hit. Otherwise skip.


7. PR breakdown

PR Phase Scope Files
PR-R1 0 Baseline measurement doc + tooling script docs/features/records-table/records-query-perf-baseline-2026-05-12.md, scripts/profile-records-query.ts
PR-R2 1 Backend query-builder honours fields[] apps/api/src/services/data-views/*
PR-R3 2.1 Switch col 18 to primary_contact computed_field apps/web/.../custom-cell.tsx, apps/web/.../cell/composite/*
PR-R4 2.2 (conditional) connectors_summary computed_field new SQL migration + Hasura YAML + frontend cell
PR-R5 3 Per-relation limit budget apps/api/src/services/data-views/query-builder.*
PR-R6 4 (conditional) Plan-cache warmup apps/api/src/index.ts, new services/data-views/warmup.ts

PR-R1 lands first (data-driven gate for the rest). PR-R2 lands second (biggest expected win regardless of phase-0 results). PR-R3–R5 land in parallel after PR-R2 merges. PR-R6 only if phase-0 says cache is the issue.


8. Out of scope

  • Schema migrations beyond optional new computed_field SQL functions
  • Junction direction changes (Bastien-locked in PR #2030 / W19 memory)
  • Backend↔frontend editability contract (locked in PR #2114 second pass — backend overrides.yml editable is SoT)
  • New cell components (Wave B/C still deferred)
  • F-06 logo backfill (separate item, see /tmp/2026-05-12-handoff §13)
  • PR-P1 URL validation (Maxime #22/#23 — separate, blocked on product decision)
  • Feature flags (Well is staging-only — no flags)
  • Hour/day estimates — dependency order + scope size only

9. Risk register

Risk Mitigation
Phase-1 active-fields shaping breaks a downstream caller that expects the full graph Audit useRecordsInfiniteQuery + useRecordsFieldPreview + chat-bridge data consumers; covered by spec-conformance test (PR #2133) for the canonical 34 columns
Phase-2 primary_contact swap regresses col-18 chip visuals Visual diff via Storybook story (PR-P3 ships the EntityObject story scaffold; reuse)
Phase-3 limit truncates and +N count breaks <rel>_aggregate { count } field continues to ship for the badge
Hasura plan-cache warmup pollutes metrics Warmup only runs against an empty/system workspace, never user workspaces
Local-Docker-Hasura is not representative of prod Re-measure on staging Cloud Run Hasura before claiming victory; staging-mirror DB is identical, runtime envelope differs
Refactor lands during a release freeze None active per current memory (no merge freezes recorded)

10. Brainstorm-MCP critique placeholder

To be filled by mcp__brainstorm__brainstorm (mode: api, participate: true, style: redteam, rounds: 2) before PR-R1 opens. Capture verdict + accepted critiques in a section below this one. Pass the baseline numbers (phase 0 output) as context.

Run after phase 0 measurement lands, not before — the critique needs real numbers to attack.


11. References

  • Live profiling (this session, 2026-05-12 local):
    • Single records-query trace: ~8s total, 7915ms in Hasura
    • Indexes audit applied: Migration20260512015500_records_page_sort_indexes
    • Workspace-scope resolver bug fixed (knex parameter style + unnest array binding)
  • Bastien locks:
    • Migration20260504240000_drop_target_workspace_connector_pk
    • Migration20260505200000_record_workspace_connectors
  • Maxime QA: docs/features/records-table/qa-fix-index-2026-05-07.md
  • Memory: a12-pr-strategy, w19-connector-junction-direction, no-feature-flags-staging-only, no-time-estimates, no-haiku-subagents
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment