Skip to content

Instantly share code, notes, and snippets.

@pedramamini
Last active June 12, 2026 03:41
Show Gist options
  • Select an option

  • Save pedramamini/d3929b166f0e0048ce3a93984a5835d0 to your computer and use it in GitHub Desktop.

Select an option

Save pedramamini/d3929b166f0e0048ce3a93984a5835d0 to your computer and use it in GitHub Desktop.
maestro-p first-byte timeout: root-cause + patch spec (Pedsidian morning-cue failures)

UPDATE 2026-06-11: patched build ran; new diagnostic changes the diagnosis

The patched maestro-p (resubmit loop + getScreenTail dump) IS now running. Confirmed: today's first_byte_timeout errors in cue.db carry the new last screen at timeout (ANSI-stripped tail) block. So the fix is live AND we finally have ground truth. It changes the root cause.

What the screen tails prove

All three of today's Pedsidian agent cues still failed at 121s (chain-1 07:00, chain-9 11:00, chain-8 17:00), BUT the screen tails show claude is alive and working the whole time:

  • 11:00 + 17:00 tails are pure spinner animation: Pouncing… / Recombobulating… with a rising counter (1m 0s, 50, 1,2,3...). That is claude's working spinner. The turn STARTED.
  • 07:00 tail shows a RESUMED session re-rendering a large prior conversation (What are we working on?, Mulling… (22s · ↓ 1.6k tokens), reminder text from an earlier session).

So the resubmit fix worked: the prompt is now being submitted and claude begins the turn. The failure is no longer "prompt swallowed / never started." It is now: claude is healthily thinking past 120s, but maestro-p declares first_byte_timeout anyway.

Why it still times out

markFirstEntrySeen() is called ONLY from handleEntry ← the JsonlTailer "entry" event (lines ~4685, 4786-4787, 4833, 4877). So "first byte" = "a JSONL transcript entry was written." But claude can be demonstrably alive for >120s (spinner animating, token counter rising) BEFORE it writes its first transcript entry, especially on these heavy morning prompts (full news analysis + web search + extended thinking) and especially on the --resume path where it first reloads a big prior conversation. maestro-p kills a working session because it is measuring the wrong signal.

Recommended fixes (supersede the earlier list)

  1. Treat TUI liveness as first-byte, not just JSONL entries. The spinner frames (Pouncing…/Recombobulating…/Mulling…) and the rising token counter in rollingBuffer are proof of life. If the screen is changing / the spinner is advancing, the turn started, so clear firstByteTimer. Only time out on TRUE stall (no screen change AND no JSONL entry for N seconds). This is the real fix.
  2. The --resume path is a latency trap. It reloads a large prior conversation before the new turn produces output, eating the budget. Consider a fresh session per scheduled cue, or exclude resume-reload time from the first-byte budget.
  3. Interim, low-risk: raise the first-byte budget for these agents well past 120s (e.g. --first-byte-timeout 300). The morning workload legitimately needs minutes to first transcript entry. This alone would likely make today's runs pass.
  4. Possible harm from resubmit: RESUBMIT_INTERVAL_MS = 8000 keeps pressing Enter every 8s. Once claude is already thinking, those Enters may queue empty submissions / interrupts. Gate the resubmit loop to stop as soon as ANY TUI activity (spinner motion), not only a JSONL entry, is detected.

Also flag (probably side effects of the in-place build swap)

  • Every failed spawn logs SecCodeCheckValidity ... Code=-67034 (codesign). codesign --verify on the app now reports modified/added files in app.asar.unpacked, so the bundle signature is broken after the manual maestro-p/asar swap. Likely benign noise, but clean re-sign would remove it and rule it out.

Objective:: Patch spec for the maestro-p first-byte timeout that silently kills the Pedsidian morning agent cues. Hand to the maestro-p build agent.

maestro-p first-byte patch spec

File: maestro-p.js (reviewed build: 4905 lines, 167.9K, ~2.1.x). Symbols/line numbers below are from that build; the new build may differ, so match on symbol names, not line numbers.

Symptom (validated in cue.db)

Pedsidian scheduled agent cues fail first_byte_timeout (exit 5): 6/5, 6/8, 6/9. Other agents (Twitter, LinkedIn, Council) succeed through the same maestro-p. A manual cue trigger and an interactive launch both work.

What the code actually does (the smoking gun)

  1. firstByteTimer is armed right after await driver.start() (~line 4768), BEFORE the prompt is even sent. So the 120s DEFAULT_FIRST_BYTE_TIMEOUT_SECONDS budget also has to cover the ready-wait and session-id discovery, not just claude's first byte.
  2. Ready is detected by the prompt-glyph regex READY_REGEX = /[›❯]\s/ with its own READY_TIMEOUT_MS = 8000. Our failures are first_byte (exit 5), not ready_timeout (exit 4), so the glyph appeared in <8s, the prompt was sent, and then no turn ever started within 120s.
  3. TuiDriver.send() writes the prompt text, then fires Enter SUBMIT_ENTER_RETRIES + 1 = 5 times, but only across SEND_ENTER_DELAY_MS + tap*750ms for taps 0..4, i.e. all 5 submit attempts land in roughly the first ~3 seconds after send, then it stops trying.

Most likely root cause

The glyph regex matches an EARLY prompt render while MCP/plugins are still connecting (this agent loads context-mode and browsermcp; "MCP servers still connecting" is observed live). The prompt text + all 5 Enters are written into the input during that pre-settle window and are lost on a redraw, OR Enter does not submit until MCP init completes. Because all submit attempts are spent in the first ~3s, once MCP settles (say 10-40s later under morning IO contention) nothing ever re-submits, so claude sits at an idle prompt with no turn → first_byte_timeout. Light agents have no MCP/plugin load, settle instantly, and their single send lands cleanly.

This is why no timeout VALUE alone fixes it: a lost/unsubmitted prompt never starts regardless of how long the wall is.

Requested patch (ranked; #1 is the real fix)

1. Keep re-submitting until the first transcript byte, not just for ~3s (THE FIX)

Spread submit (Enter) retries across the whole first-byte budget instead of front-loading 5 in 3s. Re-tap Enter on an interval (e.g. every 10-15s) until firstEntrySeen flips or the budget expires.

  • Safety: gate every retry on !firstEntrySeen && !finalized. Once the first JSONL entry exists the turn has started, so stop. Extra Enters into an idle, prompt-loaded claude either submit the pending prompt (the win) or create an ignorable blank submission. Do NOT re-type the prompt text (that risks a double prompt); re-send the submit key only. If you can cheaply confirm the input buffer is empty (prompt was lost, not just unsubmitted), then and only then re-type.
  • Mechanically: either add a resubmitTimer = setInterval in the dispatch path that calls a driver.resubmit() (new method that just writes "\r") until first byte, or widen send() to accept a "keep submitting until cancelled" mode and have markFirstEntrySeen() cancel it alongside the first-byte timer.

2. Start the first-byte timer at prompt-send, not at driver.start()

Move the firstByteTimer = setTimeout(...) block to immediately AFTER driver.send(prompt) in both the resume and fresh branches. Removes ready-wait + discovery latency (up to ~8s plus discovery) from the budget so the 120s measures only "turn start after submit." Correctness cleanup; pairs with #1.

3. Post-ready settle before the first send

After the ready glyph, wait a short settle (≈1.5-3s, or better: wait for an MCP-init-complete signal if the TUI emits one) before the first driver.send(prompt), so the prompt is not typed into a still-redrawing input. Reduces how often #1 has to kick in.

4. Make the budget configurable per cue + bump the default (blunt backstop)

Default is already overridable via --first-byte-timeout. As an immediate mitigation independent of the build, the Pedsidian agent cues can pass --first-byte-timeout 300. Consider raising DEFAULT_FIRST_BYTE_TIMEOUT_SECONDS for agents with MCP/plugins. This buys margin but does NOT fix a lost submission, so it is a backstop, not the fix.

Diagnostic to confirm (cheap, do this regardless)

On first_byte_timeout, dump the last ~1-2KB of driver.rollingBuffer (the stripped TUI screen) into the error/stderr before finalize. That single change tells us definitively what was on screen at timeout: an MCP-connecting banner, a modal, or a prompt sitting with un-submitted text. If you want one change in the build to ship first, ship this, then we read the next failure and know exactly which of #1-#4 is needed.

#claude

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment