Skip to content

Instantly share code, notes, and snippets.

@KyleAMathews
Created February 10, 2026 17:22
Show Gist options
  • Select an option

  • Save KyleAMathews/92794f64f265173fa8798684378df28a to your computer and use it in GitHub Desktop.

Select an option

Save KyleAMathews/92794f64f265173fa8798684378df28a to your computer and use it in GitHub Desktop.
electric-sql/electric community PR reviews & issue investigations (2026-02-10)

Now I have the complete picture. Let me write the investigation report.

Summary

The Electric container enters an unrecoverable error state after restart, emitting Stack not ready after 5000ms. Reason: Timeout waiting for shape data to be loaded. A simple docker compose restart doesn't fix it because persistent shape data survives the restart; only docker compose down && docker compose up (which removes volumes/data) resolves the issue. All API requests get 503 responses indefinitely.

Root cause analysis

★ Insight ───────────────────────────────────── The initialization chain is essentially a relay race — each process must hand off readiness to the next. If any runner drops the baton, the whole stack stalls permanently with no recovery path. ─────────────────────────────────────────────────

The error originates from a specific readiness condition (shape_log_collector_ready) that is never satisfied. Here's the chain:

  1. On startup, ShapeCache (shape_cache.ex:189) runs handle_continue(:wait_for_restore) which:

    • Calls PublicationManager.wait_for_restore() (blocks via GenServer.call with :infinity timeout until the RelationTracker finishes restoring publication filters from persisted shapes)
    • Then calls ShapeLogCollector.mark_as_ready() (shape_log_collector.ex:76-81, also :infinity timeout)
    • mark_as_ready is the only call site that triggers StatusMonitor.mark_shape_log_collector_ready (status_monitor.ex:101)
  2. The StatusMonitor (status_monitor.ex:262-273) holds incoming API requests for up to stack_ready_timeout (5000ms, configured in config.ex:66). When the timeout fires (status_monitor.ex:297-303), it calls timeout_message() (status_monitor.ex:349-382) which checks which condition is still false — in this case shape_log_collector_ready — producing the error message.

  3. The blocking scenario: During wait_for_restore, the ShapeCache calls restore_shape_and_dependencies (shape_cache.ex:309-345) for each persisted shape. If any consumer fails to start (start_shape returns :error at line 298-302), it logs the error and calls ShapeCleaner.remove_shape, but wait_for_restore itself doesn't fail — it just takes longer. The more likely scenario is that PublicationManager.wait_for_restore blocks because the RelationTracker's handle_continue(:restore_relations) (relation_tracker.ex:152-191) takes a long time — it iterates all persisted shapes, adds them to publication filters, then calls update_publication_if_necessary which issues SQL ALTER PUBLICATION commands against Postgres. If Postgres is slow to respond or the connection isn't fully ready, this blocks indefinitely.

  4. Why restart doesn't help: On restart, the persistent SQLite database (ShapeDb) and file storage still contain all the shapes. The same expensive restore cycle begins again, hitting the same timeout. The 5000ms stack_ready_timeout is a per-request timeout, not a startup timeout — the init chain itself has no overall deadline and runs with :infinity timeouts. So the system is stuck: the init takes >5s, every API request times out at 5s, but the init never gives up.

Key root cause: There is no timeout or error handling on the initialization chain itself. The wait_for_restore path in ShapeCache uses :infinity timeouts everywhere. If restoration takes too long (many shapes, slow Postgres, or Postgres connection issues), the shape_log_collector_ready condition is never set, and the stack is permanently "not ready" for all API requests.

Suggested fix

There are several approaches, likely best combined:

1. Add a timeout to the restore chain (primary fix)

In shape_cache.ex:194, the PublicationManager.wait_for_restore call should have a finite timeout, with fallback behavior (e.g., clean all shapes and start fresh):

# shape_cache.ex:194 - Add timeout and recovery
case PublicationManager.wait_for_restore(state.stack_id, timeout: 30_000) do
  :ok -> :ok
  {:error, :timeout} -> 
    Logger.warning("Shape restore timed out, cleaning stale shapes")
    ShapeStatus.clean_all_shapes(state.stack_id)
end

2. Add graceful degradation on consumer start failure

In shape_cache.ex:309-345 (restore_shape_and_dependencies), when a consumer fails to start, the shape is cleaned up but other shapes continue. However, if the failure is systemic (e.g., Postgres not ready), all shapes fail one by one. Consider adding an early exit or retry-with-backoff.

3. Increase or make configurable the stack_ready_timeout

In config.ex:66 and api.ex:58: The 5000ms default is aggressive. With many persisted shapes, restore can legitimately take longer. This should be configurable via environment variable (it may already be — check ELECTRIC_STACK_READY_TIMEOUT or similar).

4. Make the init chain non-blocking for the readiness signal

Instead of requiring the full restore to complete before marking shape_log_collector_ready, consider marking it ready optimistically and restoring shapes in the background. This would require ShapeLogCollector to queue incoming events while shapes are still restoring.

Key files to modify:

File Lines Change
lib/electric/shape_cache.ex 189-213 Add timeout to restore, add recovery path
lib/electric/replication/publication_manager/relation_tracker.ex 80-81 Make wait_for_restore accept a finite timeout
lib/electric/replication/shape_log_collector.ex 76-81 Consider adding a timeout to mark_as_ready
lib/electric/config.ex 66 Make stack_ready_timeout configurable/increase default

Effort estimate

Medium — The core fix (adding a timeout to the restore chain with a fallback to clean shapes) is straightforward, but testing the various failure modes (slow Postgres, many shapes, corrupted storage) requires integration test scenarios. The more ambitious "background restore" approach would be Large.

Suggested comment

Thanks for the detailed report and log file! This looks like the shape restore process during startup is taking longer than the 5s readiness timeout allows — and once stuck, there's no recovery path since the restore chain uses infinite timeouts internally. We'll look into adding a bounded timeout to the restore chain with graceful fallback (e.g., clearing stale shapes) so the container can self-heal on restart.

I have everything I need. Here's the investigation:


Summary

The issue requests that the table query parameter appear first in client-generated /v1/shape URLs. Currently, URLSearchParams.sort() sorts all params alphabetically, so table ends up buried near the end—after columns, handle, offset, etc. This makes it hard to quickly identify which table a request targets when scanning logs or DevTools.

Root cause analysis

There are three locations where searchParams.sort() is called, all producing alphabetically-sorted query strings:

  1. packages/typescript-client/src/client.ts:1036 — in the #constructUrl method of ShapeStream. This is the primary URL builder for all shape HTTP requests. After assembling all params (table, columns, offset, handle, etc.), it calls fetchUrl.searchParams.sort().

  2. packages/typescript-client/src/fetch.ts:457 — in getNextChunkUrl(), which constructs the prefetch URL for the next chunk. Also calls nextUrl.searchParams.sort().

  3. packages/typescript-client/src/client.ts:502 — in the standalone canonicalShapeKey() function, used for cache key generation. This one is less visible to users but would benefit from consistency.

The sorting is intentional for cache-key stability and CDN cache-hit rates—alphabetical ordering ensures the same logical request always produces the same URL string. The issue is purely cosmetic: table (the most human-useful param) sorts late alphabetically.

Suggested fix

Create a small utility function (e.g., sortParamsTableFirst) that:

  1. Calls params.sort() as today for canonical ordering
  2. Extracts the table value, deletes it, then rebuilds with table first

Replace the three .sort() call sites with this helper:

  • packages/typescript-client/src/client.ts:1036 — replace fetchUrl.searchParams.sort() with sortParamsTableFirst(fetchUrl)
  • packages/typescript-client/src/fetch.ts:457 — replace nextUrl.searchParams.sort() with sortParamsTableFirst(nextUrl)
  • packages/typescript-client/src/client.ts:502 — replace cleanUrl.searchParams.sort() with sortParamsTableFirst(cleanUrl)

The helper would look roughly like what the issue author proposed—sort, extract table, rebuild with table first then the rest in sorted order. The implementation in the issue is essentially correct. A good home for this utility would be in packages/typescript-client/src/client.ts near the existing canonicalShapeKey function, or in a shared helpers file.

★ Insight ─────────────────────────────────────

  • URLSearchParams.sort() sorts entries in-place by key name using UTF-16 code unit comparison — this is why columns < handle < table alphabetically, pushing table to the end.
  • The sorted params serve a dual purpose: stable cache keys and CDN-friendly URLs. The fix preserves both properties since table is always present and always has the same value for a given shape — moving it to a fixed position doesn't break canonicalization. ─────────────────────────────────────────────────

Effort estimate

Small — This is a ~15-line utility function plus swapping 3 call sites. No behavioral change, no new tests needed beyond verifying the param order. Existing tests may need minor URL-string assertions updated.

Suggested comment

Thanks for the well-written issue, @evanheckert! This is a straightforward change — there are three searchParams.sort() call sites in the TS client (client.ts and fetch.ts) that we'd wrap with a helper to put table first. Happy to accept a PR if you're still interested, or we can pick it up ourselves. The approach you outlined in the issue is essentially what's needed.

★ Insight ───────────────────────────────────── This is a classic lexicographic vs. numeric sorting bug. In Erlang/Elixir, small maps (≤32 keys) iterate in Erlang term order of their keys. Since the param keys are strings ("1", "2", ..., "10"), they sort lexicographically: "1" < "10" < "2" < "3" < .... After parsing to integers, you get [1, 10, 2, 3, ...] — which fails the sequential check at index 1 (expects 2, gets 10). This is reminiscent of the "natural sort order" problem that plagues many systems — like how file explorers used to list file10.txt before file2.txt. ─────────────────────────────────────────────────


Summary

When a shape request includes more than 9 subset__params (or top-level params), the server rejects it with "Parameters must be numbered sequentially, starting from 1". This happens because parameter keys are strings that sort lexicographically ("1", "10", "2", ...) rather than numerically, causing the sequential-numbering validation to fail.

Root Cause Analysis

The bug is in packages/sync-service/lib/electric/shapes/shape/validators.ex:20-27 in the all_keys_are_sequential/1 function.

The code path:

  1. Params arrive as a map with string keys (%{"1" => "val", "10" => "val", "2" => "val", ...})
  2. all_keys_are_numbers/1 (line 11) iterates the map via map_while_ok, parsing keys to integers — but preserving map iteration order, which for string keys is lexicographic
  3. This produces a list like [1, 10, 2, 3, 4, 5, 6, 7, 8, 9]
  4. all_keys_are_sequential/1 (line 20) checks key == index + 1 without sorting first — index 1 expects 2 but gets 10, so it fails

The same validation is called from two places:

  • shapes/shape/subset.ex:79 — for subset__params
  • shapes/shape.ex:286 — for top-level params

Suggested Fix

One-line fix in validators.ex:20 — sort the keys numerically before checking sequentiality:

defp all_keys_are_sequential(keys) do
  keys
  |> Enum.sort()
  |> Enum.with_index(fn key, index -> key == index + 1 end)
  |> Enum.all?()
  |> if(
    do: :ok,
    else: {:error, {:params, "Parameters must be numbered sequentially, starting from 1"}}
  )
end

Since the keys are already integers (parsed by all_keys_are_numbers/1), Enum.sort/1 will sort them numerically by default, giving [1, 2, 3, ..., 10].

A test case should be added to packages/sync-service/test/electric/shapes/shape/subset_test.exs with 10+ params to prevent regression.

Effort Estimate

Small — Single-line fix (|> Enum.sort()) plus a test case.

Suggested Comment

Thanks for the clear report! This is a lexicographic vs. numeric sorting bug — when there are 10+ params, the string keys sort as "1", "10", "2", ... instead of "1", "2", ..., "10", which breaks the sequential validation. The fix is straightforward: sort the parsed integer keys numerically before checking sequentiality. It's a one-liner in validators.ex.

I see — if filter_overloads_on_implicit_conversion already narrows to 1 choice, the reduce_while halts immediately and doesn't even reach these later steps. So the later steps only matter when step 1 leaves multiple candidates.

For [:text, :unknown] with =, step 1 leaves exactly 1 candidate (text = text → bool). So it halts and returns success. Done.

Final conclusion: My analysis strongly suggests that email = '[email protected]' SHOULD work when email is a TEXT column. The reported issue may have other dimensions I can't see from just the code (environment-specific, timing, or a misreporting of the actual column types).

However, there IS a clear latent bug: the not no_assume? on line 93 of lookups.ex is semantically inverted. It should be false instead of not no_assume? (or equivalently, just no_assume?). This means the "assumed types exact match" step is always skipped, forcing everything through the slower heuristic pipeline.

Additionally, the heuristic pipeline's implicit conversion filter would fail for types NOT in Electric's known type system (e.g., citext, jsonb, custom domains). And there are genuine type mismatches in the reported examples (UUID = int).

Let me now write up the final report.

Summary

When using subset__where with comparison operators (=, <>, LIKE, IN) against text or numeric column values, Electric returns "Could not select an operator overload". Boolean comparisons (activated=true) and NULL checks (IS NOT NULL) work fine. The reporter uses TimescaleDB (pg17) and tested with Electric v1.2.9 and latest.

Root cause analysis

After deeply tracing the operator overload resolution in lookups.ex, I found two issues:

1. Inverted logic in assumed-type exact match (latent bug, line 93)

In packages/sync-service/lib/electric/replication/eval/lookups.ex:93:

with :error <- overload_find_exact_match(choices, arg_types, any_unknowns?),
     :error <- overload_find_exact_match(choices, maybe_assumed_arg_types, not no_assume?) do

The third argument not no_assume? evaluates to true when assumed types ARE available, but overload_find_exact_match(_, _, true) (line 162) always returns :error as a guard — it's meant to skip when "there are unknowns." This means the assumed-type exact match is always skipped, forcing all unknown-arg cases into the heuristic pipeline. The fix should pass false (or no_assume?) instead of not no_assume?.

However, the heuristic fallback at line 94 (filter_overloads_on_heuristics) should still resolve the correct overload for [:text, :unknown] through the implicit conversion filter. My trace shows that for email = '[email protected]' where email is :text, only the text = text → bool overload survives step 1, yielding a successful resolution.

2. Missing operator overloads for non-core types (primary suspect)

Electric only registers comparison operators for a limited set of types (text, uuid, int2/4/8, float4/8, numeric, date, time, timestamp, timestamptz, bool, interval). Missing from known_functions.ex:

  • varchar — no comparison operators (relies on implicit cast to text, which SHOULD work via heuristics)
  • bpchar — no comparison operators
  • name — no comparison operators
  • jsonb — no comparison operators at all, no implicit cast path to text
  • citext (common extension) — completely unknown to Electric

If the reporter's actual Postgres column types (from pg_type.typname) are anything other than the registered types, AND implicit casting can't bridge the gap, overload resolution fails. The error message exactly matches: "Could not select an operator overload."

My best hypothesis: Either (a) the reporter's TimescaleDB environment returns an unexpected typname for their columns, (b) there's an extension type like citext in play, or (c) there's a subtle edge case in the heuristic pipeline I haven't caught. The id=1 failure with UUID primary key is a genuine type mismatch (UUID ≠ int4).

Key code paths

File Line(s) Role
packages/sync-service/lib/electric/replication/eval/lookups.ex 77-96 pick_concrete_operator_overload/3 — core overload selection with inverted not no_assume?
packages/sync-service/lib/electric/replication/eval/lookups.ex 161-162 overload_find_exact_match/3 guard that blocks assumed-type matching
packages/sync-service/lib/electric/replication/eval/lookups.ex 135-158 filter_overloads_on_heuristics/4 — 5-step heuristic fallback pipeline
packages/sync-service/lib/electric/replication/eval/parser.ex 1036-1048 find_operator_func/4 — where error is generated
packages/sync-service/lib/electric/replication/eval/parser.ex 449-450 String literals ('...') become UnknownConst (no type)
packages/sync-service/lib/electric/replication/eval/env/known_functions.ex 47-58 Registered comparison operators (limited type coverage)
packages/sync-service/lib/electric/replication/eval/env/implicit_casts.ex 1-64 Implicit cast paths between types
packages/sync-service/lib/electric/replication/eval/env/basic_types.ex 1-60 Known type categories (missing bpchar, etc.)
packages/sync-service/lib/electric/shapes/shape/subset.ex 69-87 Subset WHERE clause validation entry point
packages/sync-service/lib/electric/postgres/inspector.ex 144-155 columns_to_expr/1 — converts pg_type.typname to ref types

Suggested fix

Fix 1: Correct the inverted logic (quick fix)

In packages/sync-service/lib/electric/replication/eval/lookups.ex:93, change:

:error <- overload_find_exact_match(choices, maybe_assumed_arg_types, not no_assume?)

to:

:error <- overload_find_exact_match(choices, maybe_assumed_arg_types, no_assume?)

This allows the assumed-type exact match to actually run, which would short-circuit to the correct overload for [:text, :unknown][:text, :text] without needing heuristics. This fix is minimal and should be safe, but I'm not 100% sure it resolves the reporter's specific issue since the heuristic fallback should already work.

Fix 2: Add missing comparison operators or broaden type support

In packages/sync-service/lib/electric/replication/eval/env/known_functions.ex, add comparison operators for commonly-used string types:

defcompare("varchar", using: Kernel)
defcompare("bpchar", using: Kernel)  
defcompare("name", using: Kernel)

And add bpchar to basic_types.ex known types table. This handles cases where pg_type.typname returns a string subtype rather than text.

Fix 3: Add better error messaging

The error "Could not select an operator overload" is cryptic. Including the attempted types would help diagnosis:

:error -> {:error, {location, "Could not select an operator overload for #{arg_types_desc}"}}

Fix 4: Add test coverage

There are no tests for text_column = 'string_literal' in the subset path. Adding tests for common WHERE patterns would catch regressions:

  • text_ref = 'string'
  • varchar_ref = 'string'
  • int_ref = 1
  • text_ref LIKE 'pattern%'

Effort estimate

Small to Medium — Fix 1 is a one-line change. Fix 2 is a few lines in known_functions.ex + basic_types.ex. The main effort is in reproducing the issue reliably to verify the fix, since my code trace suggests the heuristic path should already work for [:text, :unknown]. Getting the reporter's exact environment details (actual pg_type.typname values for their columns) would help narrow this down.

Suggested comment

Thanks for the detailed report! I've investigated the operator overload resolution code and found a bug in the assumed-type exact match logic in lookups.ex:93 where the boolean is inverted, forcing all unknown-argument cases through the slower heuristic fallback. While the heuristic path should still resolve text = 'string' correctly, there may be edge cases depending on your actual column types from pg_type.typname. Could you share the output of SELECT attname, pg_type.typname FROM pg_attribute JOIN pg_type ON atttypid = pg_type.oid WHERE attrelid = 'users'::regclass to confirm the exact types? This would help us reproduce and verify the fix.

This confirms everything. The fix in commit 4a28afaf2 (PR #3640) addresses this exact issue. The key changes were:

  1. materializer.ex: Added catch :exit, _reason -> raise to convert the exit signal to a RuntimeError
  2. filter.ex: Changed rescue error -> to catch kind, error -> — this is the critical fix. The original rescue only catches Elixir exceptions (%RuntimeError{}), not Erlang exits. Before the materializer fix, the raw :exit signal from GenServer.call was NOT caught by rescue, only by catch. After adding the materializer catch+raise, the RuntimeError IS caught by rescue, but the filter catch was still changed for defense-in-depth.

Now I have the full picture. Let me write the investigation report.

Summary

When using subqueries (allow_subqueries feature flag), the ShapeLogCollector (SLC) GenServer crashes while processing incoming transactions. The crash occurs because the SLC's filter code tries to call Materializer.get_link_values on a Materializer process that no longer exists (shape was cleaned up), and the resulting :exit signal propagates uncaught, killing the SLC. With the SLC down, all shape requests fail with the "Stack not ready" timeout.

Root Cause Analysis

The bug is a race condition combined with insufficient error handling in the SLC's filter path.

The race condition:

  1. Shape B's consumer/materializer is stopped (shape cleanup via ShapeCleaner.remove_shape_immediate at shape_cleaner.ex:165)
  2. The removal of shape B from the SLC's EventRouter/Filter happens asynchronously via RequestBatcher (shape_cleaner.ex:170shape_log_collector.ex:112)
  3. Between steps 1 and 2, a new transaction arrives at the SLC
  4. The SLC's filter still references shape B. WhereCondition.other_shapes_affected (where_condition.ex:210) calls refs_fun.(shape)Materializer.get_all_as_refsget_link_valuesGenServer.call to the now-dead Materializer process
  5. The GenServer.call raises an :exit signal

The error handling gap (in v1.2.8):

  • filter.ex:122-124 had rescue error -> which only catches Elixir exceptions, not raw Erlang :exit signals
  • where_condition.ex:172 similarly had rescue error ->
  • materializer.ex:58 did NOT have a catch :exit to convert the exit to a RuntimeError
  • The :exit signal from GenServer.call propagated all the way up, crashing the SLC GenServer

Suggested Fix

This has already been fixed in PR #3640 (commit 4a28afaf2, merged Dec 24 2025):

  1. packages/sync-service/lib/electric/shapes/consumer/materializer.ex:58-63 — Added catch :exit, _reason -> raise RuntimeError to get_link_values/1, converting the exit into a catchable exception
  2. packages/sync-service/lib/electric/shapes/filter.ex:122-124 — Changed rescue error -> to catch kind, error -> for defense-in-depth, ensuring any kind of error (exit, throw, exception) is caught and falls back to routing to all shapes

The fix ensures that a missing materializer doesn't crash the SLC — instead, the filter falls back to treating all shapes as affected (a safe over-approximation), and individual consumers handle the filtering themselves.

Effort Estimate

Small — Already fixed on main. Just needs to be included in the next release.

Suggested Comment

This was a race condition where the ShapeLogCollector could crash when a shape's Materializer process was cleaned up while the SLC was still processing a transaction referencing it through subquery filters. This has been fixed in PR #3640 (merged Dec 24) — the fix ensures missing materializers are handled gracefully instead of crashing the SLC. The fix will be available in the next release after v1.2.8. In the meantime, restarting the container should recover things (the persisted shape state will be restored on startup, though you may hit it again if the race recurs).

No existing documentation for SSE timeout or keepalive in the config docs. Now I have everything I need for a complete analysis.


Summary

The user reports that SSE connections in live mode disconnect after 60 seconds, which appears to conflict with the documentation's implication that SSE connections persist indefinitely. The 21-second keep-alive heartbeat also seems at odds with a 60s hard timeout. They request: (1) an explanation of why the timeout exists, and (2) making it configurable via an environment variable like ELECTRIC_SSE_TIMEOUT.

Root Cause Analysis

This isn't a bug — the 60s SSE timeout is intentional and serves request collapsing (as evidenced by the cache header comment in response.ex:294). The SSE cache max-age is set to sse_timeout - 1 second (i.e., 59s), so CDNs/proxies can collapse concurrent SSE requests during that window.

However, the value is completely hardcoded with no way to configure it:

  • Default: sse_timeout: 60_000 in the Api struct at api.ex:56
  • Usage: A Process.send_after timer fires at api.ex:819, and when the :sse_timeout message arrives at api.ex:931, the SSE stream ends with mode: :done
  • Cache coupling: response.ex:302 derives max-age from this timeout: "public, max-age=#{max(1, div(api.sse_timeout, 1000) - 1)}"
  • Config gap: Unlike long_poll_timeout which flows through config.ex:57application.ex:159 → the Api struct, sse_timeout has no env var binding and no entry in runtime.exs

The keepalive_interval: 21_000 (api.ex:54) is also hardcoded and similarly not configurable.

Suggested Fix

Follow the existing pattern used by long_poll_timeout and other configurable timeouts:

  1. config/runtime.exs — Add env var parsing:

    sse_timeout: env!("ELECTRIC_SSE_TIMEOUT", &Electric.Config.parse_human_readable_time!/1, nil),
  2. lib/electric/config.ex (~line 57 in @defaults) — Add the default:

    sse_timeout: 60_000,
  3. lib/electric/application.ex (~line 159 in api_configuration/1) — Pass it through:

    sse_timeout: get_env(opts, :sse_timeout),
  4. website/docs/api/config.md — Document the new ELECTRIC_SSE_TIMEOUT env var using the existing <EnvVarConfig> component pattern, explaining the request collapsing rationale.

  5. Optionally do the same for keepalive_interval via ELECTRIC_SSE_KEEPALIVE_INTERVAL.

  6. Tests: The existing test helpers in serve_shape_plug_test.exs (lines 74, 877, 975, 1053) already pass custom sse_timeout values to the Api struct, so the plumbing is test-friendly. Just add a test verifying the env var flows through correctly.

★ Insight ───────────────────────────────────── The cache header coupling at response.ex:302 is the key architectural detail — max-age is derived from sse_timeout, meaning this timeout isn't just about connection lifecycle, it's integral to the CDN request collapsing strategy. Any documentation should make this clear so users understand that very long SSE timeouts may affect cacheability. ─────────────────────────────────────────────────

Effort Estimate

Small — This is a straightforward configuration plumbing task following a well-established pattern already used for long_poll_timeout, tcp_send_timeout, and others. The main work is adding 3-4 lines across config files plus documentation.

Suggested Comment

Thanks for the detailed report! The 60s SSE timeout is intentional — it's coupled to the Cache-Control: max-age header to enable request collapsing at CDN/proxy layers. That said, you're absolutely right that it should be configurable. The fix is straightforward: we just need to wire up an ELECTRIC_SSE_TIMEOUT env var following the same pattern we use for other timeouts. We'll also improve the docs to explain the request collapsing rationale so this doesn't look like a bug.

That's an extremely thorough exploration. Now I have a comprehensive understanding of the codebase. Let me write the investigation report.

Summary

The user wants to sync a sliding window of recent messages (e.g., last 30 days) while also pulling in the parent Conversation and Customer rows those messages reference — even if those parent rows are years old and outside the time window. Without this, the client has orphaned foreign keys and a broken UI. The user asks for an include-style API (removed in v1.0) or equivalent "graph sync" mechanism.

Root Cause Analysis

This isn't a bug — it's a feature request that is already largely addressed by existing (but experimental) functionality. Electric's tagged_subqueries feature flag enables exactly the pattern the user needs via subquery-based WHERE clauses. The mechanism:

  1. Subquery parsing in packages/sync-service/lib/electric/replication/eval/parser.ex (extract_subqueries/1, line 152) extracts IN (SELECT ...) sublinks from WHERE clauses.
  2. Dependency shapes are built in packages/sync-service/lib/electric/shapes/shape.ex (build_shape_dependencies/2, line 317) — each subquery becomes an inner shape whose values determine what rows are included in the outer shape.
  3. Incremental move-in/move-out is handled by packages/sync-service/lib/electric/shapes/consumer.ex (line 283-309). With tagged_subqueries enabled, dependency changes are streamed incrementally rather than triggering a 409 must-refetch.
  4. Materializer (packages/sync-service/lib/electric/shapes/consumer/materializer.ex) tracks the inner shape's current values and emits move events when they change.

The user's desired pattern translates to:

-- Conversations referenced by recent messages
table: conversations
where: id IN (SELECT conversation_id FROM messages WHERE created_at >= now() - interval '30 days')

This is documented in website/docs/guides/shapes.md (lines 119-167) and requires setting ELECTRIC_FEATURE_FLAGS=allow_subqueries,tagged_subqueries.

Current limitations (in consumer.ex lines 287-291):

  • Shapes with OR + subquery, NOT + subquery, or >1 subquery dependency still trigger 409 invalidation even with tagged_subqueries enabled.
  • Nested subqueries (conversations → customers in a single shape definition) would require two separate shapes or multiple subquery levels, which may hit the >1 dependency limitation.
  • Subsets (progressive loading) explicitly disallow subqueries (packages/sync-service/lib/electric/shapes/shape/subset.ex, line 89-90).

Suggested Fix

No code changes are strictly needed — this is a documentation and discoverability gap. The feature exists but the user couldn't find examples. Concrete improvements:

  1. Documentation (website/docs/guides/shapes.md): Add a "Dependent Data Hydration" or "Graph Integrity" recipe showing the exact sliding-window + parent-sync pattern with subqueries. The current docs show basic subquery examples but don't address this common use case.

  2. Issue response: Provide the user with concrete shape definitions for their three-table hierarchy (messages → conversations → customers), showing how to compose multiple shapes with subqueries.

  3. Longer term (if the multi-dependency limitation is a blocker): The restriction at consumer.ex:289 (length(state.shape.shape_dependencies) > 1) could be relaxed to support chained subqueries (conversations referencing messages, customers referencing conversations). This would require extending MoveHandling to track multiple independent tag dimensions.

Effort Estimate

Small — The core functionality already exists behind feature flags. The immediate need is documentation and a worked example. Relaxing the multi-dependency limitation would be Medium.

Suggested Comment

This is already supported using subqueries in WHERE clauses! You'd define separate shapes for each level of the dependency graph. For example, to sync conversations referenced by recent messages:

// Messages shape (the "hot" window)
table: 'messages', where: "created_at >= now() - interval '30 days'"

// Conversations referenced by those messages
table: 'conversations', where: "id IN (SELECT conversation_id FROM messages WHERE created_at >= now() - interval '30 days')"

// Customers referenced by those conversations
table: 'customers', where: "id IN (SELECT customer_id FROM conversations WHERE id IN (SELECT conversation_id FROM messages WHERE created_at >= now() - interval '30 days'))"

Enable this by setting ELECTRIC_FEATURE_FLAGS=allow_subqueries,tagged_subqueries. The tagged_subqueries flag ensures that when messages move in/out of the 30-day window, the parent rows are incrementally added/removed without triggering a 409 refetch. See the subqueries section of the shapes guide for more details. We should add a recipe for this exact pattern to the docs — thanks for the clear description of the use case!

That's a comprehensive picture. This is clearly a documentation/education issue that was already resolved, not a code bug. Let me write the investigation.


Summary

A user building a Python/Django durable stream consumer was confused about how Electric's HTTP API handles reconnection — specifically the role of handle, the relationship between LSN and offset, and how to resume without data loss. The issue was already answered and closed: Electric doesn't yet speak the Durable Streams protocol, and the user was pointed to the client development guide.

Root Cause Analysis

This is a documentation/protocol confusion issue, not a code bug. The user conflated two things:

  1. Electric's current HTTP sync protocol — uses handle (ephemeral shape log identifier), offset (format: {tx_offset}_{op_offset}), and cursor (cache-busting for live mode). Documented in website/docs/guides/client-development.md and website/docs/api/http.md.

  2. The Durable Streams protocol — a generalized, standalone streaming protocol announced in December 2025 (website/blog/posts/2025-12-09-announcing-durable-streams.md). Electric 2.0 will adopt this as its transport layer, but the current Electric service does not implement it yet.

The confusion is understandable: the Durable Streams announcement and protocol docs exist on the same website as Electric's docs, but the two protocols are distinct. The current Electric protocol's offset is not an LSN — it's a compound {tx_offset}_{op_offset} value returned in the electric-offset response header (see packages/sync-service/lib/electric/replication/log_offset.ex). The handle is ephemeral and server-scoped, returned in the electric-handle header.

Key files for understanding the current protocol:

  • website/docs/guides/client-development.md — the primary guide for building non-JS clients
  • website/docs/api/http.md — full HTTP API reference
  • packages/sync-service/lib/electric/shapes/api/response.ex — response header construction (electric-handle, electric-offset, electric-cursor, electric-up-to-date)
  • packages/sync-service/lib/electric/replication/log_offset.ex — offset struct definition (tx_offset + op_offset)

Suggested Fix

No code fix needed. This is resolved. However, two small documentation improvements could prevent similar confusion:

  1. In website/docs/guides/client-development.md — add a brief note clarifying that Electric's current protocol is distinct from the Durable Streams protocol, and that Electric 2.0 will adopt Durable Streams.

  2. In website/docs/api/http.md — the offset parameter documentation could more explicitly state that the offset is not a raw Postgres LSN, but a compound {tx_offset}_{op_offset} value that must be taken from response headers, not constructed by the client.

Effort Estimate

Small — Issue is already resolved. Optional doc improvements would be ~30 minutes of work.

Suggested Comment

This issue has been addressed — thanks for the question, @lekhnath! For anyone else landing here: Electric's current HTTP API uses handle + offset + cursor for resumable streaming (see the client development guide). The Durable Streams protocol is a separate, next-generation protocol that Electric 2.0 will adopt. The two are distinct for now.

Now I have a clear picture. Here's my investigation:


Summary

The user is trying to run the tanstack-db-web-starter example using Bun instead of Node.js. When accessing the app through the Caddy reverse proxy (port 5173), they get an HTTP 502 error with too many transfer encodings: ["chunked" "chunked"]. The app works fine when accessed directly on port 3000 (bypassing Caddy).

Root cause analysis

The error too many transfer encodings: ["chunked" "chunked"] is a duplicate Transfer-Encoding: chunked header problem. The chain is:

  1. Bun's HTTP server sends responses with Transfer-Encoding: chunked
  2. Caddy reverse-proxies this to the browser, and either Bun or Caddy's interaction produces a doubled chunked header

The specific issue is how Bun handles HTTP/1.1 responses differently from Node.js. When Bun's Vite dev server streams responses, it sets Transfer-Encoding: chunked. Caddy's reverse_proxy then proxies this upstream, and something in the Bun→Caddy handoff results in the header appearing twice (["chunked" "chunked"]). Go's net/http (which Caddy uses) is strict about this and rejects the response with a 502.

This is not an issue in the Electric codebase itself — it's a Bun runtime compatibility issue with how Bun's HTTP server emits transfer-encoding headers when proxied through Caddy. Node.js doesn't double up this header.

Key files involved:

  • src/vite-plugin-caddy.ts (lines 26-38) — generates the Caddyfile with reverse_proxy pointing at Vite's dev server
  • src/lib/electric-proxy.ts (lines 47-58) — the shape proxy deletes content-encoding and content-length but does not manage transfer-encoding, which may compound the issue for streamed Electric responses under Bun

Suggested fix

This is primarily a Bun runtime quirk rather than a bug in Electric's code. However, there are a few practical approaches:

  1. Quick workaround — strip duplicate headers in Caddy (src/vite-plugin-caddy.ts:26-38): Modify the generated Caddyfile to add a header_down directive that removes duplicate transfer-encoding:

    reverse_proxy localhost:{vitePort} {
        header_down -Transfer-Encoding
    }
    

    This lets Caddy re-add its own single Transfer-Encoding: chunked rather than forwarding Bun's.

  2. Alternative — flush encoding in electric-proxy.ts (line 50): Also delete transfer-encoding alongside content-encoding in proxyElectricRequest(), so the runtime re-negotiates it cleanly:

    headers.delete(`transfer-encoding`)
  3. Documentation approach: Add a note to the README that Bun is not yet a supported runtime, since TanStack Start / Nitro / Vite's ecosystem primarily targets Node.js. The package.json already specifies "engines": {"node": ">=20.19.0 || >=22.12.0"}.

Option 1 is the most robust because it fixes the issue at the proxy layer regardless of which responses trigger the double header.

Effort estimate

Small — The Caddyfile generation change is a one-liner, and the electric-proxy header fix is similarly trivial. Testing with Bun would be the main effort.

Suggested comment

Thanks for reporting this! The too many transfer encodings: ["chunked" "chunked"] error comes from Bun's HTTP server emitting a Transfer-Encoding: chunked header that gets duplicated when Caddy proxies the response — Go's HTTP stack is strict about this. The starter currently targets Node.js (see the engines field in package.json), but we could add a header_down -Transfer-Encoding directive to the generated Caddyfile in vite-plugin-caddy.ts to make it work with Bun. We'll look into adding this as a small fix. In the meantime, you can work around it by bypassing Caddy (accessing port 3000 directly, which you've already found works).

Now I have all the information I need. Here's my investigation:


Summary

When creating a ShapeStream in a browser, users must provide a full absolute URL (e.g., http://localhost:3000/v1/shape). Passing a relative path like /api/v1/shape causes a TypeError because the code uses new URL(url) which requires an absolute URL. This is particularly painful when the same app runs across multiple environments (local dev, staging, prod) where the origin differs but shouldn't matter.

Root cause analysis

The issue originates in packages/typescript-client/src/client.ts. The user-provided url string flows through two code paths where it's parsed as an absolute URL:

  1. Line 883#constructUrl() method:

    const fetchUrl = new URL(url)

    This is the primary failure point. this.options.url (the raw string) is passed directly to new URL(), which throws TypeError: Invalid URL for relative paths since no base is provided.

  2. Line 493canonicalShapeKey() function:

    const cleanUrl = new URL(url.origin + url.pathname)

    This receives a URL object so it would fail downstream if the first one were somehow bypassed, since .origin on a relative URL is "null".

The ShapeStreamOptions interface (line 231-236) documents the url field as requiring "the full URL" and all examples/tests use ${BASE_URL}/v1/shape patterns.

Notably, the browser's native fetch() API already supports relative URLs — so if the URL were resolved before being parsed, the fix would work seamlessly in browser contexts.

Suggested fix

The minimal fix is to resolve relative URLs against window.location.origin (in browser environments) early in the pipeline. Two approaches:

Approach A (simplest): Resolve at the top of #constructUrl() at line 883:

// Replace: const fetchUrl = new URL(url)
// With:
const fetchUrl = new URL(url, globalThis.location?.origin)

new URL(path, base) handles both relative and absolute URLs correctly — if url is already absolute, the base is ignored. And globalThis.location?.origin is undefined in Node.js (where relative URLs wouldn't make sense), so absolute URLs still work there and relative URLs would properly throw.

Approach B (more explicit): Add a URL normalization step in the constructor or validateOptions() (line 1822) that resolves relative URLs once and stores the absolute version in this.options.url, so all downstream code just works.

Files to modify:

  • packages/typescript-client/src/client.ts — line 883 (#constructUrl) is the primary fix location
  • Optionally update the url JSDoc at line 232-235 to note that relative paths work in browser environments
  • Add a unit test for relative URL resolution

★ Insight ───────────────────────────────────── The new URL(relative, base) two-argument form is the standard way to resolve relative URLs in JavaScript. When the first argument is already absolute, the second argument is simply ignored — making this a safe, backwards-compatible change. This is the same pattern the browser uses internally when you call fetch("/api/data"). ─────────────────────────────────────────────────

Effort estimate

Small — This is a 1-5 line code change in a single file with well-understood behavior. The hardest part is deciding on the right abstraction (resolve once vs. resolve at use-site) and ensuring Node.js environments don't regress.

Suggested comment

Thanks for the feature request, @fabianlindfors! This is a clean improvement — the fix would be to use the two-argument new URL(url, globalThis.location?.origin) form in the #constructUrl method, which gracefully handles both relative and absolute URLs. Relative paths would work in browser environments (where location.origin is available) while absolute URLs continue to work everywhere. Should be a small change.

Now I have a clear picture. Let me verify the exact line numbers for the fix.

Summary

When subqueries contain column names that require SQL quoting (e.g., camelCase names like threadId), the move-in query mechanism fails with "parameters must be of length 0". The root cause is that rebuild_subquery_section/1 doesn't quote column names, causing a string mismatch when trying to replace the subquery section with parameterized SQL.

Root Cause Analysis

The bug is in packages/sync-service/lib/electric/shapes/shape/subquery_moves.ex, line 67, in the rebuild_subquery_section/1 function.

The function reconstructs the subquery string to use as a target for String.replace/3. It joins column names with a bare Enum.join/2:

~s|IN (SELECT #{Enum.join(shape.explicitly_selected_columns, ", ")} FROM #{Electric.Utils.relation_to_sql(shape.root_table)}|

This produces unquoted column names like threadId. However, the original query string (stored in shape.where.query) was built by extract_sublink_queries/1 in packages/sync-service/lib/electric/shapes/shape.ex, line 356, which does quote column names:

columns = Enum.map_join(shape.explicitly_selected_columns, ", ", &Utils.quote_name/1)

So the reconstructed string IN (SELECT threadId FROM ... never matches the original IN (SELECT "threadId" FROM ..., and String.replace/3 returns the query unchanged — still containing the literal subquery with no $1 placeholders. When Postgrex then receives parameters for non-existent placeholders, it raises the "parameters must be of length 0" error.

★ Insight ───────────────────────────────────── This is a classic "reconstruct to match" bug. Two separate code paths independently build a SQL string representation of the same subquery — one quotes identifiers, the other doesn't. The String.replace silently does nothing when there's no match, which makes the failure mode particularly insidious: instead of an obvious error at replacement time, you get a confusing Postgrex error downstream when parameters don't align with placeholders. ─────────────────────────────────────────────────

Suggested Fix

File: packages/sync-service/lib/electric/shapes/shape/subquery_moves.ex, line 65-71

Change rebuild_subquery_section/1 to quote column names using Electric.Utils.quote_name/1, matching what extract_sublink_queries/1 does:

defp rebuild_subquery_section(shape) do
  columns =
    Enum.map_join(shape.explicitly_selected_columns, ", ", &Electric.Utils.quote_name/1)

  base =
    ~s|IN (SELECT #{columns} FROM #{Electric.Utils.relation_to_sql(shape.root_table)}|

  where = if shape.where, do: " WHERE #{shape.where.query}", else: ""
  base <> where <> ")"
end

This is exactly the fix the issue author proposed, and it's correct. The change is a single-line edit: replacing Enum.join(shape.explicitly_selected_columns, ", ") with Enum.map_join(shape.explicitly_selected_columns, ", ", &Electric.Utils.quote_name/1).

Effort Estimate

Small — One-line change in a single file. The fix is straightforward and the issue author has already identified the exact root cause and solution. A test with a camelCase column name in a subquery would be valuable to add.

Suggested Comment

Thanks for the excellent bug report and analysis, @timkendall! You've nailed the root cause — rebuild_subquery_section/1 needs to quote column names with Electric.Utils.quote_name/1 to match what extract_sublink_queries/1 produces. This is a small, clean fix and we'll get it addressed.

Now I have all the information needed. Here's my investigation:


Summary

After PR #3701 added a cache-buster mechanism to bypass stale CDN responses for expired shape handles, the cache buster value is not regenerated between retry attempts. The same cache-buster query parameter is reused across all retries, so if the CDN cached the first retry (with the buster), all subsequent retries hit the same stale cached response. The reporter also notes that offset=-1 appears in the failing request, which shouldn't normally fail.

Root Cause Analysis

The bug is in packages/typescript-client/src/client.ts.

The cache buster is generated once at line 1114 inside #onInitialResponse():

// Line 1114
this.#staleCacheBuster = `${Date.now()}-${Math.random().toString(36).substring(2, 9)}`

This value is stored in #staleCacheBuster (line 601) and then read at lines 1028–1032 when constructing the fetch URL for the retry. When the retry also receives a stale response, #onInitialResponse() is called again, which does re-execute line 1114 and regenerate the value — so at first glance it looks correct.

However, there's a subtle timing/flow issue: the cache buster is generated right before throwing StaleCacheError (line 1115). The error is caught at line 816, which immediately calls this.#requestShape(). The #requestShape() method reads this.#staleCacheBuster at line 1028 to build the URL. If the timestamp hasn't changed (sub-millisecond retry loop) or the random portion collides, the CDN serves the same cached response.

More critically, looking at the actual execution flow: Date.now() has millisecond resolution, and these retries happen nearly instantaneously (no backoff/delay between retries — line 821 is just return this.#requestShape()). So the Date.now() component can easily be identical across retries, and the random suffix has only ~6 characters of entropy from Math.random().

The real fix should ensure the cache buster is always unique per retry attempt by incorporating the retry count, or by generating it at the point of use (line 1028) rather than at the point of detection (line 1114).

Suggested Fix

File: packages/typescript-client/src/client.ts

Option A (simplest): Include the retry count in the cache buster to guarantee uniqueness:

At line 1114, change:

this.#staleCacheBuster = `${Date.now()}-${Math.random().toString(36).substring(2, 9)}`

to:

this.#staleCacheBuster = `${Date.now()}-${this.#staleCacheRetryCount}-${Math.random().toString(36).substring(2, 9)}`

This guarantees each retry produces a different cache buster since #staleCacheRetryCount is incremented at line 1089 before the cache buster is generated.

Option B (more robust): Move cache buster generation to the point of use. At lines 1028–1032, generate a fresh value each time instead of reading the stored one:

if (this.#staleCacheBuster) {
  fetchUrl.searchParams.set(
    CACHE_BUSTER_QUERY_PARAM,
    `${Date.now()}-${Math.random().toString(36).substring(2, 9)}`
  )
}

And change line 1114 to just set a boolean flag: this.#staleCacheBuster = 'active'.

Test fix: The existing test at packages/typescript-client/test/expired-shapes-cache.test.ts lines 570–574 only checks that the cache-buster parameter exists on each retry. It should additionally verify the values are unique across retries:

const cacheBusters = capturedUrls.slice(1).map(
  (u) => new URL(u).searchParams.get(CACHE_BUSTER_QUERY_PARAM)
)
expect(new Set(cacheBusters).size).toBe(cacheBusters.length) // all unique

Effort Estimate

Small — A one-line change to the cache buster generation (plus a small test addition) should fix this.

Suggested Comment

Thanks for the follow-up report, Joey! You're right — the cache buster value can end up being the same across retries because Date.now() has millisecond resolution and the retries happen nearly instantaneously with no backoff. The fix is to incorporate the retry count into the cache buster string (or regenerate it at the point of use) so each attempt is guaranteed to produce a unique value. We'll also update the test to assert uniqueness across retries.

Now I have the complete picture. Let me write the investigation report.


Summary

The user requests that Electric encode column values in their native JSON types (numbers, booleans, JSON objects) rather than serializing every column value as a JSON string. Currently, a response like {"id":"123","active":"true","score":"99.5"} forces every client to re-parse string values into their proper types, which is especially painful for strongly-typed clients like Swift.

Root cause analysis

Values are stringified at two independent points in the pipeline — both deliberately:

1. Initial snapshot query (the deeper cause)

In packages/sync-service/lib/electric/shapes/querying.ex:270-278, the SQL query that builds the initial snapshot JSON constructs each column value by:

defp pg_cast_column_to_text(column), do: ~s["#{Utils.escape_quotes(column)}"::text]
defp pg_escape_string_for_json(str), do: ~s[to_json(#{str})::text]

Every column is explicitly cast to ::text before being interpolated into the JSON string. The escape_column_value/1 pipeline (line 270) is: cast to text → wrap in to_json()::text → coalesce nulls. This means Postgres constructs the JSON with every value already quoted as a string.

2. Logical replication stream (inherent limitation)

In packages/sync-service/lib/electric/postgres/logical_replication/decoder.ex:235-249, the pgoutput protocol delivers all column values as raw text bytes. These flow unchanged through message_converter.ex:268 (defp column_value(_column_name, value), do: value) into the log as %{String.t() => String.t()} maps, which get JSON-encoded as string values.

Why it's this way

The current design is a deliberate simplification: by treating all values as strings, Electric avoids needing to maintain type mapping logic on the server. The TypeScript client compensates with a parser module (packages/typescript-client/src/parser.ts:32-41) that maps Postgres types like int4, bool, float8, jsonb to JS parse functions — but this is opt-in and only helps JS/TS clients.

Suggested fix

There are two paths, matching the two data sources:

Path A: Typed snapshot queries (Medium effort)

In querying.ex, instead of casting every column to ::text, use Postgres' to_json() or to_jsonb() without the text cast for types that have native JSON equivalents:

  • build_column_part/1 (line 252): Branch on column type info. For numeric/boolean columns, use to_json("column")::text which produces 123 not "123". For text/varchar, keep the current quoted-string approach.
  • This requires the Shape struct to carry column type information (via the selected_columns field or an additional type map from relation info).
  • Key files: querying.ex, shape.ex (to carry type metadata), and the relation/column info structures.

Path B: Typed replication values (Larger effort)

In the replication path, the message_converter.ex column_value/2 function would need access to column type OIDs (available from the Relation message's columns) to cast text values to their proper Elixir types before JSON encoding. This affects:

  • message_converter.ex:268 — add type-aware conversion
  • log_items.ex — the value map would contain mixed types
  • Storage layer — currently stores pre-encoded JSON strings; would need to handle typed values
  • All tests that assert string values in records

Recommended approach

Path A is significantly easier and covers the initial snapshot (the bulk of data). A query-parameter or header (e.g., ?typed_values=true) could opt into typed output to maintain backward compatibility. For replication values, a parallel effort would modify the message converter to use the column type OIDs already present in the Relation.Column structs.

Effort estimate

Medium to Large — Path A alone is Medium (~2-3 days for someone familiar with the codebase). Both paths together are Large. The main complexity is maintaining backward compatibility and ensuring the TypeScript client's parser still works when values arrive pre-typed.

Suggested comment

Thanks for filing this! You're right that encoding all values as JSON strings pushes unnecessary parsing work to clients. The server has the type information available (from both the Postgres catalog and replication relation messages), so it's technically feasible to emit native JSON types for numerics, booleans, and JSON columns. This would likely be introduced as an opt-in mode (e.g., via a query parameter) to maintain backward compatibility with existing clients. We'll triage this for consideration.

I now have a complete picture. Here's the investigation:


Summary

When a transaction contains exactly N changes where N is divisible by 100 (the max_batch_size), the changes are inserted into Postgres but never replicated to clients. The database write succeeds, but Electric's sync service silently drops the commit fragment, so subscribers receive zero messages. This affects any transaction type (INSERT, UPDATE, DELETE, or mixed) where the total change count is a multiple of 100.

Root Cause Analysis

The bug is a state-management edge case in MessageConverter.maybe_flush/1 interacting with a deduplication guard in ShapeLogCollector.handle_txn_fragment/2. Here's the chain:

1. Flush at exact boundarymessage_converter.ex:308-321

When the 100th change arrives, maybe_flush/1 fires because change_count >= max_batch_size. It emits a fragment with last_log_offset: state.last_log_offset (the offset of change #100), then creates a fresh, empty TransactionFragment for the continuation — but crucially does not preserve last_log_offset on the converter state:

# line 320 — new state has bare TransactionFragment, state.last_log_offset not carried over
%{state | txn_fragment: %TransactionFragment{xid: state.current_xid, lsn: fragment.lsn}}

2. Commit arrives with nil offsetmessage_converter.ex:223-248

Since no more changes come before the commit, state.last_log_offset is still nil. Line 230 falls back:

last_log_offset = state.last_log_offset || LogOffset.new(Lsn.to_integer(fragment.lsn), 0)

This produces an offset with op_offset: 0, which is less than the flushed data fragment's offset.

3. Guard drops the commitshape_log_collector.ex:470-484

The deduplication guard checks is_log_offset_lte(txn_fragment.last_log_offset, last_processed_offset). Since the commit fragment's offset (op 0) <= the data fragment's offset (op 100+), the commit is silently dropped. The transaction never completes for shape consumers.

★ Insight ───────────────────────────────────── This is a classic "boundary condition" bug — like the Nephites discovering that their fortification gap appeared only when the wall sections were exactly the right count to fill a boundary. The batching logic works for any count except exact multiples of the batch size. The || fallback on line 230 was meant to handle empty transactions, but it inadvertently masked the lost state after a flush. ─────────────────────────────────────────────────

Suggested Fix

A fix already exists on branch origin/alco/fix-max-change-batch-size-edge-case (commit f7c236f94). The approach:

File: packages/sync-service/lib/electric/postgres/replication_client/message_converter.ex

  1. Hold back the last change during flush — Instead of flushing all 100 changes, retain the last one in the new fragment. This guarantees the commit fragment always has at least one change with a valid log_offset greater than the data fragment's.

  2. Add finalize_txn_fragment/1 helper — Derives last_log_offset from the actual changes list rather than relying on converter state, eliminating the nil fallback path.

  3. Remove last_log_offset and current_xid from converter state — These are now derived from the fragment's changes directly, removing the class of bug where state gets out of sync after a flush.

No changes needed to ShapeLogCollector — the guard clause is correct; it's the offset calculation that was wrong.

Effort Estimate

Small — The fix is already implemented and tested on origin/alco/fix-max-change-batch-size-edge-case. It touches one file substantively (message_converter.ex) with a well-scoped change. The branch also includes unit tests, integration tests, and a Lux integration test.

Suggested Comment

Thanks for the excellent bug report and analysis @blj — you nailed the root cause. We have a fix on the alco/fix-max-change-batch-size-edge-case branch that retains the last change during a batch flush so the commit fragment always has a strictly greater offset than the data fragment. We'll get this merged soon.

That confirms it. The problematic PR #3682 shipped in 1.3.0 and the fix (PR #3716) shipped in 1.3.4. Now I have everything I need.


Summary

On Electric 1.3.0, the user experiences NimblePool checkout timeouts on the SQLite read connection pool (ShapeDb.Connection, :read) during startup when ~10-20 tables are loaded concurrently on ECS Fargate with 4 vCPUs. The error cascades — most shapes get 503s, and only a few load successfully. Downgrading to 1.2.11 resolves the issue.

Root Cause Analysis

The user's suspicion is correct: PR #3682 (cb9f5711f, shipped in v1.3.0) is the root cause.

PR #3682 moved shape metadata storage from ETS to SQLite, introducing NimblePool-managed connection pools for reads and writes. The hot-path function validate_shape_handle/3 — called on every incoming request that includes a shape_handle — now required checking out a connection from the read pool to query SQLite.

The bottleneck:

  • Read pool size: System.schedulers_online() connections — only 4 on the user's 4 vCPU Fargate instance
  • Checkout timeout: hardcoded 5 seconds (connection.ex:160)
  • Max concurrent requests allowed: up to 1000 (config.ex:70)

On startup with many shapes loading simultaneously, the 4 pool connections are saturated. Requests queue up, the 5-second timeout fires, and the cascade begins — each timeout frees a slot but the backlog never clears fast enough.

Key files involved:

  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/supervisor.ex:33 — pool size = System.schedulers_online()
  • packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/connection.ex:160 — 5s checkout timeout
  • packages/sync-service/lib/electric/shape_cache/shape_status.exvalidate_shape_handle/3 (was the hot-path bottleneck)

Suggested Fix

This is already fixed by PR #3716 (ba6dd2c7d, "Move SQLite writes to a background thread and buffer read state"), which shipped in v1.3.4.

The fix moved the hot-path validate_shape_handle/3 from a NimblePool/SQLite checkout to a direct ETS table lookup (shape_status.ex:188), making it effectively instantaneous. Additionally, a WriteBuffer was introduced to cache recently written shape data in ETS, reducing read pool pressure for other operations like handle_for_shape and shape_for_handle.

The user should upgrade to v1.3.4+ (or v1.4.x) to resolve this.

Effort Estimate

N/A — Already fixed. The user just needs to upgrade from 1.3.0 to 1.3.4+.

Suggested Comment

Thanks for the detailed report and stack trace! Your suspicion about PR #3682 was spot-on — the SQLite-backed metadata introduced in v1.3.0 created a bottleneck on the read connection pool under concurrent load. This was addressed in PR #3716 (shipped in v1.3.4), which moved the hot-path validation back to ETS lookups and added a write buffer to reduce pool pressure. Upgrading to v1.3.4 or later should resolve the issue.

The fix is included in @core/[email protected] and later. The user reported they're on Electric 1.3.0.

Summary

When using shapes with subquery WHERE clauses (e.g., task_id IN (SELECT id FROM tasks WHERE ...)), if any row in the primary table has a NULL value in the column referenced by the subquery, Electric crashes with ArgumentError: not an iodata term during snapshot creation. The user is on Electric 1.3.0 and using nested shape dependencies with a non-public schema.

Root cause analysis

Already fixed in PR #3750 (commit c162905cd, merged Jan 21, 2026).

In make_tags at packages/sync-service/lib/electric/shapes/querying.ex, the tag generation SQL was:

md5('stack_id_handle' || "column"::text)

When the column is NULL:

  1. "column"::text evaluates to NULL
  2. '...' || NULL = NULL in PostgreSQL (concatenation with NULL yields NULL)
  3. This NULL propagates through all || string concatenation in the row's JSON expression ('{' || key_part || ',' || value_part || ',' || headers_part || '}')
  4. The entire SELECT expression evaluates to NULL, so Postgrex returns nil
  5. IO.iodata_length(nil) at querying.ex:133 raises ArgumentError: not an iodata term

Suggested fix

Already fixed. The fix wraps column references with NULL-safe CASE expressions:

md5('...' || CASE WHEN "col"::text IS NULL THEN 'NULL' ELSE 'v:' || "col"::text END)

This ensures no NULL propagation, and distinguishes actual NULL from the string literal 'NULL' via the v: prefix. The fix is in @core/[email protected]+.

Key files changed:

  • packages/sync-service/lib/electric/shapes/querying.ex — added pg_namespace_value_sql/1 and updated make_tags/3
  • packages/sync-service/lib/electric/shapes/shape/subquery_moves.ex — added namespace_value/1, value_prefix/0, null_sentinel/0

Effort estimate

Already resolved — The fix shipped in v1.3.2. The issue can be closed.

Suggested comment

Thanks for reporting this, @mwildehahn! This was a bug in how we generated tag hashes for shapes with subquery WHERE clauses — NULL column values would propagate through PostgreSQL's string concatenation and crash the snapshot. This was fixed in PR #3750 and shipped in Electric v1.3.2. Upgrading from 1.3.0 to 1.3.2+ should resolve it.

Now I have all the details needed. Here's the investigation:


Summary

When a shape with subquery dependencies (e.g. WHERE team_id IN (SELECT ...)) fails during initialization, the error messages exposed in server logs and HTTP responses are opaque GenServer crash traces. The user sees cascading GenServer terminating / exited in: GenServer.call(nil, :await_snapshot_start, :infinity) logs rather than a clear explanation of why the shape failed. The HTTP client likely receives a generic "Unable to retrieve shape log: ..." 500 error with Elixir internal terms.

Root cause analysis

The error originates from a dependency materializer that can't start because the process it depends on (the Consumer for the subquery shape) is already dead or never started. The information loss happens at three layers:

  1. Materializer exits silently (materializer.ex:127-134): When Consumer.await_snapshot_start returns {:error, _reason} or throws an :exit, the Materializer logs a generic warning and stops with :shutdown — discarding the original reason entirely.

  2. Consumer detects dead materializer but doesn't explain why (consumer.ex:732-743): The parent shape's Consumer calls all_materializers_alive?/1 and, finding the materializer dead, stops itself with {:shutdown, :cleanup}. The only log is a Logger.warning("Materializer for #{shape_handle} is not alive, invalidating shape") — no context about why it died.

  3. ShapeCache doesn't catch {:shutdown, :cleanup} exits (shape_cache.ex:130-146): The await_snapshot_start/2 function catches :timeout and :noproc exits, but when the Consumer stops with {:shutdown, :cleanup}, the exit propagates as an uncaught exception through GenServer.call. This bubbles up to ServeShapePlug.handle_errors/2 (serve_shape_plug.ex:346-361), which formats the full exception + stack trace with Exception.format/2 and dumps it into the JSON response.

The net effect: the real error (e.g., a dependent subquery shape couldn't snapshot) is lost at layer 1, replaced by a generic shutdown at layer 2, and then the raw GenServer crash trace leaks to both logs and client at layer 3.

Suggested fix

Layer 1 — Preserve error context in Materializer (materializer.ex:127-128):

  • When {:error, reason} is received, stop with {:shutdown, {:materializer_failed, shape_handle, reason}} instead of bare :shutdown, so downstream monitors can see why.

Layer 2 — Propagate a structured error from Consumer (consumer.ex:139-163, consumer.ex:653-654):

  • In all_materializers_alive?/1, when a materializer is dead, store the dependency handle that failed in state so reply_to_snapshot_waiters can include it in the error.
  • Change stop_and_clean/1 to reply to snapshot waiters with a structured error like {:error, {:dependency_failed, dep_shape_handle}} before stopping.

Layer 3 — Catch {:shutdown, _} exits in ShapeCache (shape_cache.ex:130-146):

  • Add a catch clause for {:shutdown, :cleanup} (and other shutdown variants) that returns a descriptive {:error, ...} tuple instead of letting it propagate as an uncaught exception:
    :exit, {{:shutdown, :cleanup}, _} ->
      {:error, %RuntimeError{message: "Shape was invalidated during initialization"}}

Layer 4 — Sanitize client-facing errors (serve_shape_plug.ex:346-361):

  • The catch-all handle_errors/2 should never return raw Exception.format output to clients. Replace with a safe generic message and log the full trace server-side only.

Layer 5 — Better error in API (api.ex:677-695):

  • The catch-all {:error, error} clause currently calls inspect(term) on arbitrary Elixir terms. For dependency failures, map to a user-friendly 503 with retry guidance rather than a 500 with internal terms.

Effort estimate

Medium — The fix touches ~5 files across the shape lifecycle, but each change is small and self-contained. The main challenge is designing a consistent error struct that threads context through the Materializer → Consumer → ShapeCache → API pipeline without breaking existing error handling for other scenarios. Testing the subquery failure path end-to-end will also require some care.

Suggested comment

Thanks for the detailed report! The core issue is that when a shape with subquery dependencies fails to initialize, the original error reason gets lost as it propagates through several GenServer layers, and the raw Elixir crash trace leaks to the client. We'd need to thread structured error context from the materializer through to the HTTP response and sanitize the client-facing output. This is on our radar to improve.

Let me proceed with the information I have to write the investigation.


Summary

User reported that when using tagged_subqueries with a WHERE id IN (SELECT ...) clause, adding a new row to the inner query's table (workspace_members) did not cause the outer shape (profiles) to sync the newly-matching row. After debugging, the reporter determined the server-side sync service is working correctly — move-in messages are generated and sent. The bug is in pglite-sync (the client-side library in the electric-sql/pglite repo), which silently drops move-in messages.

Root Cause Analysis

The reporter identified and fixed two client-side bugs in pglite-sync (not in this repo), submitted as pglite PR #883:

Bug 1 — LSN filtering drops move-in messages: The Electric server correctly includes "is_move_in": true in move-in message headers (see packages/sync-service/lib/electric/shapes/querying.ex:15), but move-in messages don't carry an lsn header because they originate from direct DB queries, not the replication stream. pglite-sync was defaulting the missing LSN to 0 and filtering these messages as "already seen" (since 0 <= lastCommittedLsn is always true).

Bug 2 — Duplicate key errors on insert: Move-in query results can overlap with rows already synced during the initial snapshot. pglite-sync was doing plain INSERT (no ON CONFLICT) for move-in data, causing duplicate key constraint violations that silently failed.

Server-side code is correct. The move-in pipeline in this repo works as designed:

  • move_handling.ex:16-68 — triggers async move-in queries when materializer detects dependency changes
  • querying.ex:11-29 — runs the move-in query with is_move_in: true header
  • move_handling.ex:98-123 — splices results into the shape log with proper dedup via touch tracker

★ Insight ───────────────────────────────────── The Electric architecture splits move-in deduplication across server and client. The server uses a touch_tracker to avoid sending rows that have fresher versions in-flight, but relies on the client to handle the snapshot-end control message (with xmin/xmax/xip_list) for visibility filtering. pglite-sync wasn't doing its part of the contract. ─────────────────────────────────────────────────

Suggested Fix

No changes needed in this repo (electric-sql/electric). The fix lives entirely in the electric-sql/pglite repo:

  1. LSN filtering bypass: Check for is_move_in header and skip LSN-based "already seen" filtering for those messages
  2. Upsert for move-ins: Use INSERT ... ON CONFLICT DO UPDATE for move-in inserts instead of plain INSERT

These are addressed in pglite PR #883.

Effort Estimate

Small — No work needed in this repo. The fix is a client-side patch in pglite-sync (already submitted as PR #883).

Suggested Comment

Thanks for the thorough investigation and for tracking this all the way to a fix, @jbingen! You're right that the server-side tagged_subqueries pipeline is working correctly — the issue was entirely in pglite-sync's handling of move-in messages. We'll review and get pglite PR #883 merged. Closing this issue since the fix lives in the pglite repo.

I now have a clear picture. Let me write the investigation report.


Summary

When using subquery-based shapes (e.g., WHERE id IN (SELECT user_id FROM workspace_members WHERE workspace_id = $1)), a new row inserted into workspace_members replicates instantly, but the corresponding "move-in" of the matching profiles row takes ~15 seconds. The user expects both events to arrive with similar latency.

Root Cause Analysis

There are two likely causes, depending on the user's configuration:

Most likely: tagged_subqueries feature flag is not enabled (default in production)

The tagged_subqueries flag defaults to disabled in production (ELECTRIC_FEATURE_FLAGS defaults to []). Without this flag, every materializer change notification triggers full shape invalidation via stop_and_clean:

packages/sync-service/lib/electric/shapes/consumer.ex:283-299

tagged_subqueries_enabled? = "tagged_subqueries" in feature_flags

should_invalidate? =
  not tagged_subqueries_enabled? or state.or_with_subquery? or ...

if should_invalidate? do
  stop_and_clean(state)  # Tears down the shape entirely

When the shape is invalidated, it must be completely recreated — new initial snapshot query, publication setup, resubscription to the replication stream. This easily accounts for ~15 seconds, especially on a hosted Railway instance.

Secondary: Snapshot pool contention (even with the flag enabled)

When tagged_subqueries IS enabled, move-in events trigger an async database query via PartialModes.query_move_in_async (partial_modes.ex:59-93), which calls SnapshotQuery.execute_for_shape (snapshot_query.ex:41-66). This acquires a connection from the snapshot pool, which has aggressive queuing parameters:

packages/sync-service/lib/electric/connection/manager.ex:399-401

pool_queue_config =
  if pool_role == :snapshot,
    do: [queue_target: 5_000, queue_interval: 10_000],

Under load (many shapes being snapshotted concurrently), a move-in query can wait 5-10 seconds just to get a database connection. The snapshot pool gets ~75% of db_pool_size (default 20), so roughly 15 connections. If those are busy with initial snapshots for other shapes, the move-in query queues.

★ Insight ───────────────────────────────────── The architecture separates regular replication from move-in handling: regular changes flow synchronously through the ShapeLogCollectorConsumer pipeline in a single GenServer.call, while move-ins require a second async round-trip to the database. This two-phase design is necessary because the outer shape needs to query Postgres for the actual rows matching the newly moved-in keys — it can't just derive them from the WAL event. The DependencyLayers mechanism ensures inner shapes (the subquery) process first, so the materializer detects the new value before the outer shape even sees the transaction. ─────────────────────────────────────────────────

Suggested Fix

Short-term (documentation/config):

  1. Ensure the user has ELECTRIC_FEATURE_FLAGS=allow_subqueries,tagged_subqueries set. This is almost certainly the issue since it defaults to disabled.

Medium-term (reduce pool contention for move-in queries): 2. In manager.ex:399-401, consider lower queue_target/queue_interval values for move-in queries, or use a separate smaller pool for move-in queries to avoid contention with initial snapshots. The current 5s/10s queuing parameters are tuned for long-running initial snapshots, not latency-sensitive move-in queries.

Longer-term: 3. Consider prioritizing move-in queries over initial snapshot queries in the pool, since move-in latency directly affects user-visible real-time sync quality.

Effort Estimate

Small — if this is purely a configuration issue (missing tagged_subqueries feature flag), it's a one-line env var change for the user. If the pool contention path needs addressing, that's Medium effort (separate pool or priority queuing for move-in queries).

Suggested Comment

Thanks for the detailed report! The most likely cause is that the tagged_subqueries feature flag isn't enabled — it's off by default in production. Could you try setting ELECTRIC_FEATURE_FLAGS=allow_subqueries,tagged_subqueries and see if that brings the move-in latency down? Without this flag, every subquery dependency change causes a full shape rebuild rather than an incremental move-in, which would explain the ~15 second gap.

Summary

Adds Zeabur as a new integration provider to the Electric docs. Includes a new markdown doc page, sidebar nav entry, an SVG logo, and a JPG cover image. The doc describes one-click deployment on Zeabur with a deploy button, mentions the caching proxy, API token protection, and persistent disk defaults.

Strengths

  • Follows existing conventions well: Frontmatter structure, product icon, ## Electric and Zeabur section heading, and the > [!Tip] Need context? callout all match the established pattern (cf. render.md).
  • Alphabetical sidebar placement is correct — Zeabur goes after Supabase.
  • Concise and focused — doesn't over-promise or add unnecessary content.
  • The health check tip (/v1/health) mirrors the Render doc.

Concerns

  1. SVG has trailing whitespace/no newline at EOF — The zeabur.svg file ends with extra whitespace and \ No newline at end of file. Should add a final newline and trim trailing spaces on the last line.

  2. Doc is quite thin compared to peers — At 30 lines, this is notably shorter than Render (92 lines), Supabase (~129 lines), or PlanetScale (~148 lines). The Render doc, which is the closest analog, includes step-by-step instructions for deploying Electric manually, deploying a client app, and an IaC Blueprint example. The Zeabur doc essentially says "click this button" with minimal guidance on what happens next (e.g., how to set DATABASE_URL, how to configure the persistent disk path, how to connect a client app). Users who hit issues will have nowhere to go except the Zeabur template instructions.

  3. No mention of ELECTRIC_STORAGE_DIR or persistent disk setup — The doc says "a persistent disk" is included in defaults, but doesn't explain what mount path is used or how ELECTRIC_STORAGE_DIR should be configured. The Render doc covers this explicitly.

  4. Cover image noted as potentially needing replacement — The PR author themselves flagged this: "the cover image was created using Photoshop and may need to be changed." The maintainers should verify the image meets their brand standards before merging.

  5. No DATABASE_URL guidance — The PostgreSQL template link is provided but there's no explanation of how to wire the database URL to the Electric service. This is a critical step for users.

Verdict

Merge with minor fixes — The structure is sound and follows project conventions. The SVG EOF issue should be fixed, and the doc would benefit from a few more sentences explaining DATABASE_URL configuration and ELECTRIC_STORAGE_DIR, bringing it closer to the depth of the Render doc. These could be addressed as follow-up suggestions rather than blocking the merge.

Suggested comment

Thanks for adding Zeabur integration docs! The structure follows our conventions well. Two small requests: (1) fix the trailing whitespace / missing newline at EOF in zeabur.svg, and (2) consider adding a brief note on how DATABASE_URL gets wired to the Electric service and how ELECTRIC_STORAGE_DIR / persistent disk is configured — this would help users who run into issues after the one-click deploy. Happy to merge once those are addressed.

Summary

Fixes a broken link in the AWS integration docs. The link text says "AWS Elastic Container Service" but pointed to aws.amazon.com/efs (Elastic File System) instead of aws.amazon.com/ecs (Elastic Container Service). One character change: efsecs.

Strengths

  • Correct fix. The link now matches the service it describes.
  • Minimal, surgical change with zero risk of side effects.

Concerns

  • None. Worth noting there's a typo on the next line ("ephemoral" should be "ephemeral"), but that's out of scope for this PR.

Verdict

Merge.

Suggested comment

Thanks for catching this! Good eye — efs vs ecs is an easy one to miss. Merging as-is. (Minor aside: "ephemoral" on the next line is also a typo if anyone wants to fix that in a follow-up.)

Here's my review:


Summary

Adds support for connecting to Postgres via Unix sockets by parsing the ?host=/path/to/socket query parameter from the DATABASE_URL. This enables deployments like Google Cloud Run + Cloud SQL Proxy where Unix sockets are required. The change touches URI parsing in config.ex and connection option population in connection_resolver.ex.

Strengths

  • Solves a real deployment need (GCP Cloud SQL Proxy, local socket connections) with a clean approach.
  • Correctly disables SSL for Unix socket connections — sockets don't go over the network.
  • Good doctest coverage for the happy paths, including combinations with sslmode.
  • The populate_socket_dir/1 function follows the existing pattern of the other populate_* functions in ConnectionResolver.

Concerns

  1. Pattern matching in parse_url_query is fragile and incomplete. The new clauses only match %{"host" => _} and %{"host" => _, "sslmode" => _}. A URI like ?host=/tmp&replication=off or ?host=/tmp&foo=bar will fall through to the catch-all map -> clause and return an unhelpful "unsupported query options" error that includes host itself. Every existing clause has this same structural problem, but adding host makes it worse since all the combinations need handling. The host + invalid sslmode case (e.g., ?host=/tmp&sslmode=verify-full) also isn't handled.

  2. String.to_existing_atom(sslmode) in the new host + sslmode clause duplicates logic from the existing sslmode-only clause. If someone adds a new valid sslmode later, they'd need to update both places.

  3. No unit tests beyond doctests. The ConnectionResolverTest has no test for socket_dir flowing through populate_connection_opts. A test verifying that hostname and port are stripped and ssl is false when socket_dir is present would be valuable.

  4. populate_socket_dir is a no-op for the socket_dir key itself — the non-nil branch does Keyword.put(:socket_dir, socket_dir) which just re-sets what's already there. The function's real purpose is deleting hostname/port, so the put is redundant (harmless but misleading with the comment saying "Replace hostname with socket_dir").

  5. No validation of the socket path. A ?host=not-a-path would be silently accepted. At minimum, checking it starts with / (i.e., is an absolute path) would catch misconfiguration early.

  6. host in the query string vs. the URI host. When ?host=/cloudsql/... is present, the URI host component (e.g., localhost) is still validated and required but ultimately discarded. This is consistent with libpq behavior but worth a docstring note.

Verdict

Needs work. The core idea is sound, but the parse_url_query pattern matching needs to be refactored to handle host as a composable option alongside sslmode rather than adding more rigid pattern-match clauses. The lack of tests beyond doctests is also a gap for a feature that touches connection setup.

Suggested comment

Thanks for tackling this — Unix socket support is a real need for Cloud SQL and similar setups! A couple of things before this is ready: the parse_url_query pattern matching gets combinatorially tricky with another option added — consider refactoring it to parse recognized keys independently rather than matching exact map shapes. Also, adding a test in ConnectionResolverTest for the socket_dir path through populate_connection_opts would help ensure the hostname/port stripping and SSL disabling work correctly end-to-end.

Good — removeSnapshot only deletes from activeSnapshots, not from snapshotsByDatabaseLsn. So the existing code has a minor leak: entries accumulate in snapshotsByDatabaseLsn and are never cleaned up. The PR removes this leaky map entirely.


Summary

Removes the lastSeenUpdate method and its backing snapshotsByDatabaseLsn map from SnapshotTracker. Neither is called from anywhere in the codebase — the method was likely intended to integrate with global_last_seen_lsn (added in PR #3159) but that wiring never happened. The corresponding test is also removed.

Strengths

  • Correct identification of dead code. Grep confirms lastSeenUpdate is never invoked outside its own definition and test. The snapshotsByDatabaseLsn map is only populated and never consumed.
  • Fixes a minor memory leak. snapshotsByDatabaseLsn accumulated entries that were never cleaned up by removeSnapshot, so removing it is a net improvement even beyond dead-code hygiene.
  • Clean, minimal diff. Pure deletion — 0 additions, 66 deletions. No behavioral changes to existing code paths.

Concerns

  • None substantive. The xmaxSnapshots-based cleanup in shouldRejectMessage remains intact and is the active mechanism for snapshot eviction. The removed code was purely vestigial.
  • One nitpick: The class doc comment (line 14) still references removeSnapshot() as a public API, but it's actually private-ish (only called internally from shouldRejectMessage). That's pre-existing and out of scope for this PR, but worth noting.

Verdict

Merge. Straightforward dead code removal with no risk.

Suggested comment

Clean catch — this was indeed dead code with a minor leak (entries in snapshotsByDatabaseLsn were never cleaned up by removeSnapshot). LGTM, thanks for the contribution!

I now have a complete picture. Here's my review:


Summary

Adds a resolveShapeUrl function that resolves relative URLs (e.g., /v1/shape) against globalThis.location.origin in browser contexts, and throws InvalidShapeOptionsError for relative URLs in non-browser contexts. This simplifies configuration when the Electric proxy is same-origin (common with Next.js). Tests cover both the browser-mock and non-browser paths.

Strengths

  • Clean implementation — the try/catch on new URL(url) to detect relative URLs is the idiomatic approach, and resolution happens before validateOptions so the rest of the pipeline sees a clean absolute URL.
  • Good error boundary — uses a dedicated InvalidShapeOptionsError with a clear message rather than letting new URL() throw a cryptic TypeError.
  • Tests are thorough — both the happy path (mocked browser location) and the error path (non-browser) are covered, with proper cleanup in finally blocks.
  • Minimal diff — only touches what's needed, no unnecessary refactoring.

Concerns

  1. InvalidShapeOptionsError is not exported from the public API. src/index.ts only re-exports FetchError from ./error. The test imports it from ../src/error directly, which works, but external consumers who want to catch this specific error programmatically (e.g., to show a helpful message in SSR) can't import { InvalidShapeOptionsError } from '@electric-sql/client'. It should be added to the barrel export.

  2. Protocol-relative URLs (//cdn.example.com/v1/shape) and other edge cases. new URL("//cdn.example.com/v1/shape") will throw, so it falls through to the relative-URL branch and would get resolved against location.origin, producing https://cdn.example.com/v1/shape in a browser — which happens to work but only by accident. Worth a brief comment or test acknowledging this, or handling it explicitly.

  3. globalThis.location check could be slightly tighter. globalThis.location?.origin being a string is a reasonable heuristic, but some non-browser runtimes (e.g., Cloudflare Workers with location polyfills, or JSDOM in tests) might set location.origin to something unexpected. This is a minor nit — the current check is pragmatic and probably fine, but worth flagging.

  4. MissingShapeUrlError vs InvalidShapeOptionsError naming. MissingShapeUrlError already exists for when url is falsy. The new error covers a conceptually similar case (bad URL). Consider whether these should share a parent class or whether MissingShapeUrlError should be a subclass of InvalidShapeOptionsError for consistency — but this is a minor design preference, not a blocker.

Verdict

Merge with minor fixes — the core implementation is solid. The missing public export of InvalidShapeOptionsError is the only thing I'd want addressed before merge.

Suggested comment

Nice clean implementation! One thing to address: InvalidShapeOptionsError isn't re-exported from src/index.ts, so package consumers can't catch it. Could you add it to the barrel export alongside FetchError? The rest looks good to merge.

Now I have everything I need to write the review.

Summary

When a transaction has exactly max_change_batch_size (100) changes, maybe_flush emits a fragment and resets the state. At commit time, the resulting empty commit fragment gets last_log_offset from state.last_log_offset — which equals the already-flushed fragment's offset. ShapeLogCollector.handle_txn_fragment/2 then drops it via its is_log_offset_lte guard, silently swallowing the commit. The fix uses msg.end_lsn for such empty post-flush commit fragments so they always have a unique, greater offset.

Strengths

  • Correct root cause identification. The bug is subtle — it only manifests at exact batch boundaries (100, 200, 300…). The analysis of the interaction between maybe_flush, state.last_log_offset, and the ShapeLogCollector guard is accurate.
  • Minimal fix. The production change is 10 lines: a new predicate empty_commit_fragment_after_flush?/3 and a conditional in convert(%LR.Commit{}). It doesn't alter any other code path.
  • Predicate is well-named and clearly scoped. The three conditions (change_count == 0, last_log_offset not nil, end_lsn not nil) precisely identify the problematic case.

Concerns

  1. Unit tests for MessageConverter are missing. The fix changes MessageConverter.convert/2 but there are no unit tests for the module itself — only integration tests. A focused unit test that feeds a sequence of LR.Begin, 100× LR.Insert, LR.Commit messages into the converter and asserts the commit fragment's last_log_offset > flushed fragment's last_log_offset would be far cheaper to run and more precise at catching regressions than a full integration test.

  2. The test file is 1049 lines and heavily repetitive. The describe blocks for INSERT, UPDATE, and DELETE all duplicate the same batch_sizes list and loop structure. This could be collapsed substantially using a parameterized helper or shared for loop across operation types. As-is, it's a maintenance burden — like having three nearly identical copies of the Book of Mormon manuscript instead of one with proper cross-references.

  3. SQL injection in test helpers. Functions like execute_batch_insert build SQL via string interpolation with only single-quote escaping (String.replace(id, "'", "''")). These are test-only and the inputs are generated UUIDs, so it's not a real vulnerability, but using parameterized queries ($1, $2) would be more idiomatic for Postgrex and avoid the pattern entirely.

  4. generate_uuid/1 produces invalid UUIDs. "00000000-0000-0000-0000-000000000001" is technically a valid UUID format but version/variant bits are wrong. Fine for testing, but worth noting if any downstream code validates UUID versions.

  5. Silent timeout fallback in collect_all_* helpers. When await_count times out, the helpers silently fall through to collect_messages with a 1-second timeout and return whatever was collected. The test then asserts on length(messages) == batch_size which will fail — but the failure message won't indicate it was a timeout. A clearer pattern would be to flunk on timeout directly.

  6. The not is_nil(msg.end_lsn) guard feels defensive. Is there ever a real case where end_lsn is nil on a Commit message? If not, this guard is dead logic. If yes, the nil case deserves a comment explaining what happens.

Verdict

Merge with minor fixes. The core fix is correct and important. The test file should be refactored to reduce duplication, and a unit test for the MessageConverter boundary case should be added. Neither blocks merging.

Suggested comment

The fix looks correct and well-scoped — nice catch on this boundary condition. Two requests before merge: (1) could you add a unit test directly on MessageConverter.convert/2 that exercises the exact-batch-boundary scenario (100 changes → flush → commit)? It'll be much faster to run and more targeted than the integration tests. (2) The test file has a lot of structural repetition across INSERT/UPDATE/DELETE — consider extracting the batch size loop into a shared helper to cut the file roughly in half.

Now I have a clear picture. Let me note that the PR's noExternal is placed in commonOptions but looking at the current file, it's positioned before the spread of ...options — wait, let me re-read the diff placement carefully against the current file.

The diff shows the noExternal line is added at line 11 (after sourcemap: true, on line 10), which is inside commonOptions and before ...options. This means it will apply to all four build targets (ESM, CJS, legacy-esm, browser). That's correct behavior.

The key issue: pnpm patches are applied at the node_modules level, so locally during development the patched code is available. But when tsup bundles the package for npm publishing, by default it marks dependencies as external (not bundled), so consumers get the unpatched upstream version from npm. Adding noExternal forces tsup to inline the (locally patched) source into the bundle.

Summary

The @microsoft/fetch-event-source library has a pnpm patch that fixes browser-only assumptions (window.*, document.*) and abort signal handling bugs. However, tsup treats dependencies as external by default, so the published npm package ships with an import pointing at the unpatched upstream version. This one-line fix adds noExternal: ['@microsoft/fetch-event-source'] to tsup.config.ts, forcing tsup to inline the patched source into the bundle output.

Strengths

  • Correct root cause identification. The author traced a subtle packaging gap: pnpm patches apply locally but vanish when the dependency is marked external at bundle time. The PR description is thorough with screenshots and reproduction.
  • Minimal, surgical fix. One line in the build config, plus a well-written changeset. No unnecessary changes.
  • Applies to all build targets. The noExternal is in commonOptions, so ESM, CJS, legacy-ESM, and browser builds all get the inlined patched code.

Concerns

  1. Bundle size increase. Inlining the library means consumers can no longer deduplicate it if they also depend on @microsoft/fetch-event-source directly. This is a very minor concern given the library is small (~5KB) and few consumers would import it independently.

  2. No test coverage for the fix. There's no integration or smoke test verifying the bundled output contains the patched code. A future change to tsup config or dependency resolution could silently regress this. That said, adding such a test may be out of scope for a community patch fix.

  3. Missing newline at end of changeset file. The diff shows \ No newline at end of file for .changeset/slow-apricots-clean.md. Trivial but worth noting.

  4. Long-term maintenance. The deeper question is whether the project should continue patching a library that hasn't been updated since 2022, or fork/vendor it properly. But that's a strategic decision, not a blocker for this PR.

Verdict

Merge — this is a correct, minimal fix for a real bug that breaks liveSse abort behavior for all consumers of the published package.

Suggested comment

Great catch and excellent write-up! The fix is clean and correct — noExternal is exactly the right way to ensure pnpm-patched dependencies survive the tsup bundling step. One tiny nit: the changeset file is missing a trailing newline. Otherwise this is good to merge as-is.

Now I have the full picture. Here's the review:


Summary

This PR adds the missing lsn and op_position fields to the ChangeMessage type in the TypeScript client, and fixes the API documentation examples where lsn was shown as a number (1234) instead of a string ('1234'), matching the OpenAPI schema which declares lsn as type: string.

Strengths

  • Correct types: lsn typed as `${bigint}` aligns with the pattern used elsewhere in this codebase (e.g. PostgresSnapshot.xmin, Offset). op_position as number matches the API schema's type: integer. Both are correctly optional since they're only present on event-stream operations.
  • API doc fix is valid: The schema at line 535 defines lsn as type: string, but the examples had bare 1234 (a number in YAML). Quoting them as '1234' is the right fix.
  • Expanding the Header index signature exclusion is correct — adding lsn, op_position, txids, tags, and removed_tags to the Exclude prevents the index signature from widening these known fields to Value. This is a nice catch that improves the existing typing beyond just the two new fields.

Concerns

  1. txids type mismatch (pre-existing, not introduced by this PR): txids is typed as number[] but the API schema shows them as strings (['1020']). The API description says "uint64 as strings" for xip_list in similar contexts. This is outside the scope of this PR but worth noting as a follow-up — txids probably should be `${bigint}`[] for consistency. The PR author shouldn't be asked to fix this, but the maintainers should track it.

  2. No tests added: There are no test changes, but since these are purely additive type declarations and a doc fix, this is acceptable. TypeScript compilation itself validates the type additions.

  3. The example lsn values in the YAML are still a bit odd: The JSON SSE example on line 750 already had "lsn":"1234" (correctly a string), while the non-SSE examples had bare 1234. This PR fixes the latter, which is correct and consistent.

Verdict

Merge — Clean, correct, narrowly scoped. The types match the API schema and existing patterns.

Suggested comment

Clean fix — the types match the OpenAPI schema and the existing \${bigint}`pattern in the codebase. One thing worth tracking separately:txidsis currently typed asnumber[]` but the API describes transaction IDs as uint64 strings elsewhere. Might be worth a follow-up to align that. Thanks for the contribution!

Now I have a thorough understanding of the existing codebase and the PR. Let me write the review.

Summary

Adds a new SKILL.md file at the repo root (an "agent skills" format file) and updates website/docs/agents.md to document it alongside the existing AGENTS.md. The SKILL.md content is largely a reformatted version of the existing AGENTS.md, adapted to the agentskills.io format with YAML frontmatter and slightly different structure.

Strengths

  • Reasonable idea — SKILL.md is an emerging format and providing one could help agents that support it.
  • The docs page changes are structurally sound (heading demotion from ## to ### for the AGENTS.md sections, parallel structure for SKILL.md section).
  • The author is upfront that this is LLM-generated and low priority.

Concerns

  1. Mostly duplicated content. SKILL.md is ~388 lines that substantially overlap with the existing 350-line AGENTS.md. This creates a maintenance burden — every update to Electric's API guidance must now be mirrored in two files. The two files already have subtle differences (e.g., SKILL.md adds a Drizzle schema section and a parser block in shapeOptions that AGENTS.md doesn't have; AGENTS.md covers framework integrations, migration from TanStack Query, and cross-collection joins that SKILL.md omits). These will diverge further over time.

  2. Build would break. The docs page links to /SKILL.md and uses <<< @/../SKILL.md to embed it, but the VitePress config's ignoreDeadLinks only covers /AGENTS(\.md)?/ — it doesn't include /SKILL.md. This will cause dead-link warnings or build failures.

  3. Sidebar not updated. The sidebar in config.mts (line 250) still says { text: 'AGENTS.md', link: '/docs/agents' } — if the page title changes to "AGENTS.md & SKILL.md", the sidebar label should match, or else it looks inconsistent.

  4. SKILL.md format questionable. The frontmatter has last_updated: 2025-09-18 matching AGENTS.md, but the content has drifted from the source. The reference links use [N] notation inline (e.g., [10][4]) but these aren't standard markdown footnotes — they're bare bracket pairs that won't render as links in most markdown renderers. AGENTS.md uses proper reference-style links ([Electric][10]) which do render.

  5. LLM-generated content quality. The "What are agent skills?" section in the docs page reads like generic filler ("progressive disclosure", "procedural knowledge"). It doesn't add information specific to Electric and inflates the page without much value.

  6. agentskills.io / skills.sh maturity. These are very new, niche formats. Adding official support for them may set expectations about maintenance that the team doesn't want to commit to.

Verdict

Close. The maintenance cost of keeping two near-identical instruction files in sync outweighs the marginal benefit. If SKILL.md adoption becomes significant, it would be better to auto-generate it from AGENTS.md or have a single source of truth rather than hand-maintaining two copies. The PR also has build issues that would need fixing.

Suggested comment

Thanks for the contribution and for being upfront about how it was produced! The core idea of providing a SKILL.md is interesting, but we're concerned about the maintenance burden of keeping two near-identical instruction files in sync — they've already diverged in subtle ways. If the skills format gains more traction, we'd likely want to auto-generate SKILL.md from AGENTS.md rather than maintain both by hand. We'll close this for now but appreciate the idea.

Now I have all the context needed.

Summary

This PR extends SubsetParams.params to accept string[] in addition to Record<string, string>, matching the existing array support in ShapeStreamOptions.params. The convertWhereParamsToObj function is refactored to operate on just the params field (rather than the whole params object), and the conversion is moved into toInternalParams so it applies uniformly.

Strengths

  • Clean refactor of convertWhereParamsToObj: The function now has a single responsibility — converting string[] to Record<string, string> — instead of wrapping/unwrapping the parent object. This is a meaningful simplification.
  • Good type narrowing: The SubsetParams.params type changes from Record<string, string> to Record<\${number}`, string> | string[]`, which better expresses what keys are actually valid (numeric positional params).
  • Tests cover both requestSnapshot and fetchSnapshot: Both methods are tested with the array form.

Concerns

  1. POST path doesn't convert array params. #buildSubsetBody (line 1755) passes opts.params directly into the request body without calling convertWhereParamsToObj. If a user passes params: ['A', 'C'] with method: 'POST', the server receives a raw JSON array instead of the {"1": "A", "2": "C"} object. The GET path handles this correctly (line 938 in the diff), but the POST path does not. This is a bug.

  2. Conversion happens in two places. The PR adds convertWhereParamsToObj inside toInternalParams (for ShapeStream constructor params) and inside #constructUrl (for snapshot GET params). With the POST path also needing it, there are now three call sites. It might be cleaner to normalize once at the entry point of requestSnapshot/fetchSnapshot before params flow into either path.

  3. Tests only cover GET. The test fixtures don't set method: 'POST', so the POST bug from point 1 isn't caught.

Verdict

Needs work — the POST path bug means array params will silently produce incorrect requests when using method: 'POST' or subsetMethod: 'POST'.

Suggested comment

Nice cleanup of convertWhereParamsToObj — the narrowed scope is much better. One issue: #buildSubsetBody at line 1755 passes opts.params through without calling convertWhereParamsToObj, so array-form params will be sent as a raw JSON array in POST requests instead of the expected {"1": "val", ...} object. Could you add the conversion there and a POST-mode test to cover it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment