Now I have the complete picture. Let me write the investigation report.
Issue #3433: Container error: Stack not ready after 5000ms. Reason: Timeout waiting for shape data to be loaded
The Electric container enters an unrecoverable error state after restart, emitting Stack not ready after 5000ms. Reason: Timeout waiting for shape data to be loaded. A simple docker compose restart doesn't fix it because persistent shape data survives the restart; only docker compose down && docker compose up (which removes volumes/data) resolves the issue. All API requests get 503 responses indefinitely.
★ Insight ─────────────────────────────────────
The initialization chain is essentially a relay race — each process must hand off readiness to the next. If any runner drops the baton, the whole stack stalls permanently with no recovery path.
─────────────────────────────────────────────────
The error originates from a specific readiness condition (shape_log_collector_ready) that is never satisfied. Here's the chain:
-
On startup, ShapeCache (
shape_cache.ex:189) runshandle_continue(:wait_for_restore)which:- Calls
PublicationManager.wait_for_restore()(blocks via GenServer.call with:infinitytimeout until the RelationTracker finishes restoring publication filters from persisted shapes) - Then calls
ShapeLogCollector.mark_as_ready()(shape_log_collector.ex:76-81, also:infinitytimeout) mark_as_readyis the only call site that triggersStatusMonitor.mark_shape_log_collector_ready(status_monitor.ex:101)
- Calls
-
The StatusMonitor (
status_monitor.ex:262-273) holds incoming API requests for up tostack_ready_timeout(5000ms, configured inconfig.ex:66). When the timeout fires (status_monitor.ex:297-303), it callstimeout_message()(status_monitor.ex:349-382) which checks which condition is still false — in this caseshape_log_collector_ready— producing the error message. -
The blocking scenario: During
wait_for_restore, the ShapeCache callsrestore_shape_and_dependencies(shape_cache.ex:309-345) for each persisted shape. If any consumer fails to start (start_shapereturns:errorat line 298-302), it logs the error and callsShapeCleaner.remove_shape, butwait_for_restoreitself doesn't fail — it just takes longer. The more likely scenario is thatPublicationManager.wait_for_restoreblocks because the RelationTracker'shandle_continue(:restore_relations)(relation_tracker.ex:152-191) takes a long time — it iterates all persisted shapes, adds them to publication filters, then callsupdate_publication_if_necessarywhich issues SQL ALTER PUBLICATION commands against Postgres. If Postgres is slow to respond or the connection isn't fully ready, this blocks indefinitely. -
Why restart doesn't help: On restart, the persistent SQLite database (
ShapeDb) and file storage still contain all the shapes. The same expensive restore cycle begins again, hitting the same timeout. The 5000msstack_ready_timeoutis a per-request timeout, not a startup timeout — the init chain itself has no overall deadline and runs with:infinitytimeouts. So the system is stuck: the init takes >5s, every API request times out at 5s, but the init never gives up.
Key root cause: There is no timeout or error handling on the initialization chain itself. The wait_for_restore path in ShapeCache uses :infinity timeouts everywhere. If restoration takes too long (many shapes, slow Postgres, or Postgres connection issues), the shape_log_collector_ready condition is never set, and the stack is permanently "not ready" for all API requests.
There are several approaches, likely best combined:
In shape_cache.ex:194, the PublicationManager.wait_for_restore call should have a finite timeout, with fallback behavior (e.g., clean all shapes and start fresh):
# shape_cache.ex:194 - Add timeout and recovery
case PublicationManager.wait_for_restore(state.stack_id, timeout: 30_000) do
:ok -> :ok
{:error, :timeout} ->
Logger.warning("Shape restore timed out, cleaning stale shapes")
ShapeStatus.clean_all_shapes(state.stack_id)
endIn shape_cache.ex:309-345 (restore_shape_and_dependencies), when a consumer fails to start, the shape is cleaned up but other shapes continue. However, if the failure is systemic (e.g., Postgres not ready), all shapes fail one by one. Consider adding an early exit or retry-with-backoff.
In config.ex:66 and api.ex:58: The 5000ms default is aggressive. With many persisted shapes, restore can legitimately take longer. This should be configurable via environment variable (it may already be — check ELECTRIC_STACK_READY_TIMEOUT or similar).
Instead of requiring the full restore to complete before marking shape_log_collector_ready, consider marking it ready optimistically and restoring shapes in the background. This would require ShapeLogCollector to queue incoming events while shapes are still restoring.
| File | Lines | Change |
|---|---|---|
lib/electric/shape_cache.ex |
189-213 | Add timeout to restore, add recovery path |
lib/electric/replication/publication_manager/relation_tracker.ex |
80-81 | Make wait_for_restore accept a finite timeout |
lib/electric/replication/shape_log_collector.ex |
76-81 | Consider adding a timeout to mark_as_ready |
lib/electric/config.ex |
66 | Make stack_ready_timeout configurable/increase default |
Medium — The core fix (adding a timeout to the restore chain with a fallback to clean shapes) is straightforward, but testing the various failure modes (slow Postgres, many shapes, corrupted storage) requires integration test scenarios. The more ambitious "background restore" approach would be Large.
Thanks for the detailed report and log file! This looks like the shape restore process during startup is taking longer than the 5s readiness timeout allows — and once stuck, there's no recovery path since the restore chain uses infinite timeouts internally. We'll look into adding a bounded timeout to the restore chain with graceful fallback (e.g., clearing stale shapes) so the container can self-heal on restart.