Skip to content

Instantly share code, notes, and snippets.

@cpsievert
Created June 2, 2026 16:03
Show Gist options
  • Select an option

  • Save cpsievert/832303510744adf1afbf28547334b88e to your computer and use it in GitHub Desktop.

Select an option

Save cpsievert/832303510744adf1afbf28547334b88e to your computer and use it in GitHub Desktop.
Gray Screen of Death

Session Persistence: Eliminating the Gray Screen of Death

Problem

When a Shiny app's WebSocket connection drops — whether from a network blip, laptop sleep, server error, or any other cause — the user sees a semi-transparent gray overlay with no recovery path (the "gray screen of death"). The session is permanently destroyed server-side, and all state is lost.

An existing session$allowReconnect() mechanism exists but is off by default, only works on certain hosting platforms, and performs a cold restart (new session, server function re-runs, all server-side state lost).

Goal

Keep the R session alive across WebSocket disconnects so clients can transparently reconnect to the same session with no state loss. Short disconnects (network blips) should be invisible to the user. Longer disconnects show a subtle reconnect UI. Fatal R errors show an informative overlay.

Audience

  • End users: Better default experience without developer effort
  • App developers: New callbacks and configuration for advanced control

Design

Session State Model

Today a session is binary: closed = FALSE (alive) or closed = TRUE (dead). We introduce a three-state lifecycle:

CONNECTED  <-->  SUSPENDED  -->  CLOSED

CONNECTED: Normal operation. WebSocket active, reactivity flowing, flushes write to client.

SUSPENDED: WebSocket gone, but session alive.

  • self$closed stays FALSE (reactive graph keeps running)
  • New flag: self$suspended <- TRUE
  • Output observers are suspended (reversible via resume())
  • Timers (invalidateLater, reactiveTimer) continue running — they invalidate contexts, but output observers won't recompute until resumed
  • closedCallbacks (onSessionEnded) are NOT fired
  • Messages are buffered up to a configurable cap (see "Message Buffering" below)
  • A grace period timer starts; if it expires, transition to CLOSED

CLOSED: Same as today. closedCallbacks fire, observers destroyed, timers cancelled, session removed.

On reconnect (SUSPENDED -> CONNECTED):

  • Swap private$websocket to the new WebSocket
  • Set self$suspended <- FALSE
  • Resume all output observers (they recompute and flush to the new socket)
  • Client sends a resume message (not init) with current input values
  • Server calls manageInputs() without re-running serverFunc()

Server-Side Changes

wsClosed() replacement

Instead of wsClosed() immediately tearing everything down, we split into two methods:

suspendSession() (called from ws$onClose):

  • Set self$suspended <- TRUE
  • Call output$suspend() for all outputs (reversible)
  • Start grace period timer
  • Move session from appsByToken to a new suspendedSessions map

closeSession() (called when grace period expires, or on fatal error):

  • Set self$closed <- TRUE
  • Fire closedCallbacks (onSessionEnded, observer destroy, timer cancel, etc.)
  • Remove from suspendedSessions

suspendedSessions map

Parallel to appsByToken, keyed by session token. When a new WebSocket arrives with a reconnect token, look it up here instead of creating a new ShinySession.

Resume handshake

In the WebSocket handler (server.R), before creating a new session:

  1. Check for reconnect token (query param ?reconnect_token=<token>)
  2. If token found in suspendedSessions:
    • Cancel grace period timer
    • Call resumeSession(newWebSocket):
      • private$websocket <- newWebSocket
      • self$suspended <- FALSE
      • Attach message handlers to new WebSocket
      • Resume all output observers
      • Send config message (so client knows resume succeeded)
      • Move session back to appsByToken
      • Call requestFlush() to push accumulated state
  3. If token not found (expired or new client):
    • Create new ShinySession as today

private$write() during SUSPENDED state

Messages are buffered up to a global cap (total bytes across all message types). This covers:

  • Output values: Already self-deduplicate in invalidatedOutputValues/invalidatedOutputErrors Maps (keyed by output name, last value wins)
  • Input update messages: Accumulated in inputMessageQueue, bounded by input count
  • Custom messages (session$sendCustomMessage): The primary unbounded concern — buffered under the cap
  • Progress/notifications: Ephemeral, buffered under the same cap

If the cap is reached, stop buffering and flag that a full recompute will happen on resume. Notify the client of this on reconnect so it can inform the user if appropriate.

New developer callbacks

  • session$onDisconnected(callback) — fires when entering SUSPENDED state
  • session$onReconnected(callback) — fires when resuming from SUSPENDED
  • session$onSessionEnded() — unchanged semantics, just delayed until CLOSED

Client-Side Changes

Disconnect detection and reconnect flow

When the WebSocket closes without a preceding fatal error message:

  1. 0-5 seconds: No visual change. App looks normal. Client silently attempts reconnect every 1.5s, sending the session token.
  2. 5s-timeout: Subtle, non-blocking banner at top of page: "Connection lost. Reconnecting..." with a manual "Reconnect now" link. App remains visible and readable — no overlay.
  3. Timeout reached / server rejects token: Overlay: "Session expired. Reload to start fresh."

Fatal error flow

The server sends a {"type": "error", "message": "...", "fatal": true} message before closing the WebSocket. The client shows an overlay immediately with:

  • Error details (respecting shiny.sanitize.errors)
  • "Reload" button — clean slate, fresh session
  • "Reload and restore inputs" button — with a note: "This will attempt to restore your previous inputs, but the error may recur if it was caused by a specific input combination."

Resume handshake (client side)

  • Include session token in WebSocket URL: ws://host/websocket?reconnect_token=<token>
  • On connection, send a resume message (with current input values) instead of init
  • If server responds with the same session ID in config, resume succeeded — hide reconnect UI
  • If server responds with a new session ID (token changed), the old session expired — treat as fresh start

Input queueing during disconnect

While disconnected, user interactions (typing, clicking) are captured and sent as an update message after reconnect, so nothing is lost.

Configuration

Global options

options(
  shiny.reconnect = TRUE,           # Enable/disable (default TRUE)
  shiny.reconnect.timeout = 60,     # Grace period in seconds (default 60)
  shiny.reconnect.bufferSize = 1e6  # Max buffered bytes during SUSPENDED (~1MB default)
)

Per-session API

# Override grace period
session$setReconnectTimeout(seconds)

# Disable for this session
session$setReconnectTimeout(0)

# React to lifecycle events
session$onDisconnected(function() { ... })
session$onReconnected(function() { ... })

Backwards compatibility

  • session$allowReconnect(TRUE) becomes a wrapper for setReconnectTimeout(getOption("shiny.reconnect.timeout"))
  • session$allowReconnect(FALSE) becomes setReconnectTimeout(0)
  • onSessionEnded fires later than before (delayed by grace period). For apps needing immediate cleanup, setReconnectTimeout(0) restores old behavior.

Edge Cases

Multiple reconnect attempts / race conditions

If a new WebSocket arrives with a reconnect token while the old WebSocket's onClose hasn't fired yet, accept the new one and discard the old. Tokens are unique per session, so no cross-session collision.

Session hijacking

The token is 128-bit random — infeasible to guess. Interception risk over HTTP is the same as session cookies generally. Document that HTTPS is strongly recommended.

Server restart / process death

If the R process dies, there's no session to resume. Client retries, gets rejected (token not found), falls through to "session expired" overlay. This is where bookmarking remains the right tool.

onSessionEnded timing

Apps holding expensive resources (DB connections, large data) hold them longer during the grace period. setReconnectTimeout(0) provides an escape hatch. Document this tradeoff.

freezeReactiveValue during SUSPENDED

Frozen values thaw in onFlushed callbacks, which don't fire while suspended. On resume, the first flush thaws them. May cause a brief stale-value flash. Acceptable for v1.

File uploads in progress

Mid-stream uploads are lost on disconnect. On reconnect, user would need to re-upload. No attempt to resume uploads in v1.

Extended tasks

Extended tasks run in background processes (mirai/future) and are WebSocket-agnostic. During SUSPENDED:

  • Background work continues and completes normally
  • on_success/on_error sets rv_status and rv_value/rv_error via reactive values
  • Output observers are suspended, so they don't recompute yet
  • On reconnect, observers resume, read task$result(), and flush the completed value

This is one of the strongest arguments for session persistence — today, disconnecting during a long computation loses it entirely. With this design, the result is waiting on reconnect.

bind_task_button state may be briefly stale on reconnect (showing "running" when task is complete). The output flush corrects this quickly.

Queued invocations in invocation_queue chain and complete normally during SUSPENDED.


What this design does NOT include

  • Automatic state serialization — That's bookmarking territory
  • Cross-process session migration — Different problem entirely
  • Changes to reactive graph or observer semantics — Beyond the suspend/resume lifecycle
  • Upload resume — Too complex for v1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment