name	manuscript-broll-suggest
description	Scan a manuscript for overlay-free narration segments and suggest cinematic b-roll concepts to generate via Higgsfield AI; on selection, generate the clip into ./tmp and open it in QuickTime, then offer to refine or pick another.

Suggest and Generate B-Roll for Manuscript via Higgsfield

Analyze a manuscript, find segments that have no other visual overlay, propose movie-like b-roll concepts for them, and on user selection generate the clip with Higgsfield AI, save it under ./tmp/, and open it in QuickTime. Then loop: refine the current clip or pick another suggestion.

Input

Manuscript path: $1

Workflow Overview

Resolve manuscript path (ask if missing).
Detect overlay-free narration segments.
Suggest 3–8 b-roll concepts (each tied to a specific segment).
Wait for the user to pick one (or accept all the suggestions in sequence).
Generate the clip via Higgsfield, poll until done.
Save the result to ./tmp/<manuscript-stem>-broll-<slug>.mp4 and open in QuickTime.
Ask: pick another suggestion, or refine the just-generated one (re-prompt and regenerate).

Step 1 — Resolve Manuscript Path

If $1 is empty, ask: "Which manuscript should I scan for b-roll opportunities? Please provide the absolute path." Validate the file exists before proceeding.

Step 2 — Detect Overlay-Free Segments

A b-roll clip REPLACES the talking head full-screen for ~8 seconds. It must NEVER coincide with content that's already taking over the frame (code, diagrams, terminal transcripts). Lighter overlays — logos, lower-thirds, on-screen data, citations, bold phrases — are softer: a b-roll generally serves retention better than those, so don't disqualify a paragraph for having them. If the clip is accepted, the conflicting overlay is stripped from the paragraph (see Step 6a).

Hard exclusions — never propose b-roll for paragraphs that contain, or sit adjacent (within 2 lines above or below) to, any of these:

Fenced code blocks (```)
Inline code spans (`) when they dominate a line (commands, file paths, identifiers)
TODO: Diagram: lines and mermaid blocks (```mermaid)
[user] / [agent] markers (terminal/agent transcripts)

Hard exclusions (sections):

## Setup and ## Destroy sections in their entirety

Hard exclusion (positional):

The first paragraph of narration in the manuscript — the first paragraph that follows the first ## Header. This is where the user establishes presence on screen. If the manuscript has a TODO: intro-clip: line at the top (a cold-open from manuscript-broll-opener), the AI clip lands precisely on this paragraph's first frame, so a b-roll over it would erase the "you've arrived" moment. Even without a cold-open, the first paragraph is the viewer's first impression of the speaker — never b-roll over it.

Out of scope:

Cold-open clips that play before the first paragraph of narration are not this skill's responsibility — they're owned by manuscript-broll-opener, which generates clips ending at the first frame of recorded talking-head footage. Don't propose anything in that territory; if the manuscript needs a cold open, point the user at that skill instead.

Soft overlays — OK to propose b-roll over these (stripped on accept in Step 6a):

TODO: Logo: lines
TODO: Screen: lines
TODO: Lower-third: lines
TODO: Citation: lines
Bold phrases (**...**)

For TODO entry definitions, see .claude/skills/manuscript-analyze/SKILL.md. The canonical word-count exclusion list lives in .claude/skills/manuscript-analyze/count-narration-words.sh.

A clean segment is a contiguous run of paragraphs that:

Contains no hard-exclusion markers
Has no hard-exclusion marker in the 2 lines immediately before its first paragraph or after its last paragraph
Is at least one full sentence of pure narration

Soft overlays are NOT considered when detecting clean segments — they're handled at accept time.

Step 3 — Suggest B-Roll Concepts

Before scoring, find every paragraph already marked with a TODO: clip: line in the manuscript. Call those already-clipped paragraphs. They affect spacing.

From clean segments, select the strongest 3–8 candidates for cinematic b-roll. Prioritize segments that:

Express a vivid metaphor, transformation, or emotional beat
Make a strong claim, contrast, or stakes statement
Open or close a section (good rhythm points)

Concept patterns — apply these when generating suggestions:

Tie the visual to what the sentence is arguing, not just to a topic or metaphor word it mentions. Before locking any concept, write down — in plain words — what the surrounding sentence(s) are arguing for (consolidation, automatic rerouting, retrospective relief, etc.). The visual must depict that thesis. A concept that captures the topic but not the argument lands as "related but disconnected" to the viewer — e.g. server LEDs when the argument is about OpenRouter's automatic rerouting (LEDs ≠ routing); scattered cards when the argument is observability/consolidation (chaos ≠ "all in one place"); a smiling face when the argument is retrospective regret about not having used the tool sooner (calm ≠ before/after contrast). This is the most common failure mode.
Prefer in-domain concepts over imported metaphors. When the narration's setting is concrete (a keyboard, a dashboard, a monitor, a terminal, a code editor), build the visual inside that setting before reaching for unrelated metaphors (a surgeon, a chess player, a summit). Imported metaphors only earn their slot when they're dramatically richer than the in-domain version. Default to in-domain — the audience doesn't have to translate.
Literalize at least one vivid metaphor word in the narration — but only when the sentence's argument is also about that metaphor. Scan the excerpt for concrete metaphor nouns ("YouTube", "firehose", "summit", "battlefield", "lighthouse", "juggling"). At least one of your suggestions should take that word at face value when the surrounding argument matches. If the line says "our YouTube", propose someone watching the screen like it's TV (popcorn, deadpan, idle stare) — the literal read often lands harder than an oblique abstraction.
Closing/payoff concepts must reflect the specific arc of the video, not generic "victory." Re-read the whole manuscript's argument before proposing a closing image. If the video has been arguing "tool X is better than tool Y", the closing should depict that specific transition — not a generic mountain summit, sunrise, or trophy. Match the image to the journey.
Treat narration as editable when a small tweak unlocks a much stronger visual. If a slightly reshaped sentence enables a substantially better b-roll (a comedic deflation, an in-domain image, a clearer metaphor), propose the manuscript edit alongside the visual concept. Don't treat the existing wording as frozen — small narration changes that earn the visual are worth it.
Group narrator-as-subject clips into a recurring fictional set when possible. When 2+ suggestions feature the narrator as the subject, default to placing them in the same fictional setting (same desk, same chair, same room) showing different emotional states across the video. The recurring set becomes a recognizable motif — same character, same room, three states of confronting the same problem. Scatter only when the concepts genuinely demand different locations.

Spacing — flag adjacency, don't suppress. If a candidate paragraph is the immediate neighbor (one paragraph before or after) of an already-clipped paragraph, still suggest it but tag it with an adjacency warning. Two b-rolls back-to-back cut the viewer away from the talking head for 16+ seconds, which can hurt connection and pacing — leave the call to the user.

The subject must never appear to be speaking — non-negotiable. A b-roll plays as voiceover. If the visible subject's mouth is moving or they're delivering dialogue, the viewer's brain expects to hear those exact words, which collides with the narration audio. This applies to every clip, whether the subject is the narrator or a generic actor. Build the cinematic concept around physical action, contemplation, stillness, hands-on craft, or environmental motion — never mid-conversation, never visibly mouthing words. If a candidate concept naturally calls for someone speaking (e.g., a presenter explaining something), drop it or reshape it into silent action.

Narrator-as-subject — judge per clip. Independently of the speaking rule above, decide whether the b-roll should feature the user (the narrator) as its on-screen subject. This is a judgment call — read the narration and the cinematic concept, then decide.

Use narrator as subject when putting the narrator in frame strengthens the moment — autobiographical narration ("my journey ended at TUI"), section bookends, claim-of-role payoffs at the end of a section or video, or when the visual concept is generic enough that "the narrator" reads as natural casting.
Don't use narrator as subject when the narration explicitly names someone else on screen ("her hands trembled", "the engineer at the console"), the concept calls for a different demographic (a child, a different gender, a crowd, an animal), or the setting contradicts the narrator's plausible physical context in a way that breaks immersion.
Second-person or rhetorical narration ("you're either an orchestrator…") is NOT a reason to exclude the narrator. The narrator can still appear silently — see the no-speaking rule above. A rhetorical "you" with the narrator in frame often reads as an implicit "…like me," which is a strong move.

Tag candidates the narrator should be in with [👤 you as subject]. The user can override before generation.

For each candidate, output:

### Suggestion N — <short title>  [⚠ adjacent to existing clip: <clip-name>]  [👤 you as subject]

**Section:** <header it lives under>
**Narration excerpt:** "<the actual sentence(s) — verbatim, sized for the clip>"
**Why this segment:** <one line — what makes it visual / emotional>
**Concept (8s):** <one-paragraph cinematic concept; movie-like, real actors/props, no text on screen>

Excerpt sizing — never exceed the clip's runtime. The b-roll plays for a fixed 8 seconds. The narration excerpt is the spoken text the editor will align to the clip's start. The excerpt MUST NOT take longer than 8 seconds to read aloud — otherwise the clip cuts off before the narration finishes, which looks broken.

At conversational narration pace (140–150 wpm), 8 seconds is roughly 18–22 words. Use that as the upper bound. Excerpts longer than that should be trimmed by anchoring mid-paragraph and picking the most cinematically rich contiguous sentences (this makes it a mid-paragraph clip — see Step 6a).

Excerpts shorter than 8 seconds of speech are fine — the clip will run past the excerpt into the next sentence(s) of the same paragraph, and the editor fades out cleanly. The only constraint is that the clip shouldn't spill into a different idea or a different paragraph, so don't pick a candidate where the excerpt is the very last sentence of its paragraph and the next paragraph is unrelated.

Tags appear only when the rule triggers. Omit [⚠ adjacent…] if the candidate has no neighboring clip. Omit [👤 you as subject] if the clip should not feature the user.

Then ask the user: "Which suggestion should I generate? (number, or 'all' to walk through them)"

Step 4 — Generate via Higgsfield

Once the user picks a suggestion:

If the suggestion is tagged [👤 you as subject], resolve a headshot UUID.

The config + cache file is ./tmp/.headshots.json with this shape:
```
{
  "canonical_uuid": "<optional uuid the user wants reused across videos>",
  "cache": {
    "screenshot-01.png": { "uuid": "<uuid>", "mtime": <epoch> }
  }
}
```
Resolution order:

a. Canonical UUID (manual reuse path). If ./tmp/.headshots.json exists and contains a non-empty canonical_uuid, use it directly. Skip the screenshot lookup and the upload entirely. This is for the user who has already uploaded a "canonical me" UUID once (in any project) and wants to reuse it across videos without re-uploading screenshots.

b. Per-project screenshots (auto-upload path). Otherwise:
- Look in ./tmp/ for files whose names match (case-insensitive): ^screenshot([-_].*)?\.(jpg|jpeg|png|webp)$. These are screenshots from the user's videos used as headshot references. Examples that match: screenshot.png, screenshot-01.jpg, screenshot_main.webp, Screenshot-side.jpeg. Examples that don't: headshot-01.jpg, screen.png, screenshot.mp4.
- If no headshot screenshots are found, stop and tell the user verbatim: "No headshot screenshots found in ./tmp/. To generate this clip with you as the subject, drop one or more images named screenshot*.jpg|jpeg|png|webp into ./tmp/ and re-run. Or set canonical_uuid in ./tmp/.headshots.json to reuse a previously uploaded headshot. Skip with --no-headshot to generate with a stand-in actor instead." Wait for the user to either add files, set a canonical UUID, or pass --no-headshot.
- If headshots are found, pick the most appropriate one for the clip (prefer a clear front-facing frame for close-ups; a wider frame if the concept calls for a wide shot). If only one exists, use it.
- Cache uploads. For the chosen file: if cache[<filename>].uuid exists and the file's current mtime equals cache[<filename>].mtime, reuse that UUID. Otherwise call mcp__higgsfield__media_upload (and media_confirm if required) to upload the file, then write the new UUID + mtime under cache[<filename>].
The resolved UUID is what you'll attach in step 4.
Pick a model. Two viable defaults; choose between them based on the clip:
- seedance_2_0 (default for most cases). Text→video with a dedicated image role for an independent identity reference, strong identity preservation, natural human body motion, audio-reference support, max 1080p, genre control. Pick this when any of these hold:
  - The clip is tagged [👤 you as subject] (only Seedance exposes a separate identity-image role on Higgsfield).
  - The concept centers on natural human body motion — gesture, weight, fatigue, breath, fabric/hair physics, micro-expression. Seedance was purpose-built for human realism.
  - You want lip-sync, audio reference, or genre-tuned color/mood.
- kling3_0 (use for kinetic action when narrator is NOT the subject). Released Feb 2026. Strengths are photorealism, native 4K (mode: "4k"), kinetic motion realism, and physical momentum (running, falling, fast camera tracking, impact). Pick this when all of these hold:
  - The clip is not [👤 you as subject]. Higgsfield exposes Kling 3.0 with only start_image/end_image roles — there is no dedicated identity-image role, so a headshot reference cannot be attached cleanly. Don't use Kling 3.0 for narrator-as-subject clips.
  - The concept is kinetic / action-heavy — rapid camera moves, shattering glass, fast tracking, physical impact, racks of servers spinning up, water/fluid dynamics, vehicles in motion, sparks, debris.
  - You want the 4K resolution bump (Seedance caps at 1080p).
  If the concept is non-narrator but contemplative (a slow lamp flicker, a single hand on a worn book, an empty desk at golden hour), stay with Seedance — Kling 3.0's strengths don't apply, and Seedance's human-realism edge is wasted on inanimate concepts only when motion is genuinely the point.
- Avoid kling_2_6 and earlier — worse identity drift, missing end_image.
- Avoid wan* for narrator-as-subject — weaker identity preservation than Seedance, no image role.
If the user explicitly asks for a specific model (Veo 3.1 for ultra-realistic, Minimax Hailuo for facial emotion + physics, etc.), call models_explore (action: recommend, query: "8 second cinematic b-roll, text-to-video, movie-like") and honor their pick.

Build the prompt following Higgsfield's prompt rules (see Higgsfield Prompt Rules below). Layered structure:

SCENE: <opening setup — shot type, framing, location>
SUBJECT: <who/what is in frame, with physical detail and micro-expressions>
ACTION: <chronological beats with timing markers — "For the first 2 seconds...", "At 4s, suddenly...">
CAMERA: <specific camera verb — dolly in, orbit, handheld, crash zoom, FPV, locked-off>
LIGHTING & LOOK: <natural / practical / shallow depth of field / film stock reference>
FILM LOOK: live-action footage, shot on 35mm or Arri Alexa, 24fps, anamorphic-style depth of field, natural film grain, motivated lighting, color-graded like a feature film. NOT animation, NOT CGI, NOT cartoon, NOT illustration, NOT 3D render, NOT video game, NOT motion graphics.
END MOOD: <how the shot resolves emotionally>

Keep it tight: 100–200 words total. No on-screen text. One visual idea. The FILM LOOK block is required and verbatim in every prompt — do not paraphrase or shorten it.

Resolve the highest-quality params for the chosen model. Call models_explore (action: get, model_id: <chosen>) and inspect its parameters array:
- If the model exposes a resolution parameter, pick the highest option (e.g. Seedance 2.0 / 1.5 Pro / Wan 2.6 / Wan 2.7 → 1080p; Minimax Hailuo → 1080).
- If the model exposes a quality parameter, pick the highest option (e.g. Veo 3.1 → ultra; Wan 2.6 → 1080p).
- If the model exposes a mode parameter, pick the highest-quality option. For kling3_0 that's 4k (above pro and std). For Cinema Studio Video v2 and similar, that's pro. Never pick fast / lite modes — they trade quality for cost.
- If the model exposes a genre parameter (Seedance 2.0, Cinema Studio Video v2), pick the option that best matches the clip's emotional tone — drama is the safe default for cinematic b-roll; use action, noir, horror, epic, suspense, intimate, etc. when the clip's mood clearly calls for it. Never use auto — it can drift away from the live-action film look. Never use comedy unless the manuscript is explicitly comedic, since it can pull the model toward stylized/cartoon territory.
- If the model exposes none of these, send no extra param and accept the model's default.
- Skip marketing_studio_video for b-roll — it caps at 720p.
Note on cost: higher resolution can roughly double credit consumption (e.g. Seedance 2.0 8s ≈ 48 credits at 720p vs. ≈ 96 credits at 1080p). If balance shows credits would drop below ~200 after the generation, surface a heads-up before calling generate_video.
Call mcp__higgsfield__generate_video with:
- model: "seedance_2_0" (or the chosen model)
- prompt: <the prompt above>
- params.duration: 8 (or nearest allowed; the server clamps)
- params.aspect_ratio: "16:9"
- The highest-quality field(s) resolved in step 3 (e.g. params.resolution: "1080p", or params.quality: "ultra", or params.mode: "pro").
- If a headshot was resolved in step 0, also pass params.medias: [{role: "image", value: "<headshot-uuid>"}]. For models whose medias[].roles doesn't include image but does include start_image (check models_explore output), use start_image instead. The skill already restricts narrator-as-subject clips to identity-friendly models (Seedance), so image will normally be the right role.
Prompt adjustment when a headshot is attached: rewrite the SUBJECT block to lean on the reference image rather than describing a generic person. Use phrasing like "the person in the reference image" or "the man from the reference" instead of inventing height/build/hair details that might fight the reference. Keep micro-expressions, body language, and wardrobe direction — those guide performance, not identity.

Reference also carries scene context — describe only what changes. Seedance pulls the surrounding room from the reference image (furniture, walls, ambient lighting, art) when you don't override it. If you describe a competing scene ("dim home-office, leather chair, warm desk lamp"), the model sides with your prompt and ignores the reference's room. So:
- Want the narrator's real environment? Keep the SCENE block minimal — describe only the props/elements you're adding (e.g., "two monitors on a wide desk, a wheeled office chair") and let the reference carry the room. Don't redescribe walls, furniture, ambient lighting, or general atmosphere.
- Want a deliberately different setting (an arcade, an operating theatre, a mountain ridge)? Then describe the new scene fully — the model will follow your prompt and override the reference's room.
- Want similar-but-different? Don't. Either match the reference exactly (by under-describing) or pick a clearly distinct setting. "Almost the same" produces uncanny-valley discrepancies the viewer will spot.
This applies to LIGHTING too: if you don't specify lighting, the reference's lighting carries. Only specify lighting changes when the clip's new elements (e.g., a screen showing a red dashboard) require their own motivated light.
Poll mcp__higgsfield__job_status with sync: true first; if still running, wait poll_after_seconds and retry. Typical video time: 60–180s.
On terminal success, extract the result video URL from the job result.

If balance shows insufficient credits before generating, surface that to the user and stop.

Step 5 — Save to ./tmp and Open in QuickTime

Slugify the suggestion title (lowercase, hyphens, alphanumeric only).
Target path: ./tmp/<manuscript-stem>-broll-<slug>.mp4 (e.g. ./tmp/claude-routines-broll-keyboard-storm.mp4). If the file exists, append -v2, -v3, etc.
Download the video via curl -L -o <path> <url>.
Open in QuickTime: open -a "QuickTime Player" <path>.
Print the local path and the original Higgsfield job URL/ID so the user can reopen later.

Step 6 — Keep or Discard

After the clip is open in QuickTime, ask:

Keep this clip? (yes / no / refine)

yes → mark the clip in the manuscript (see Step 6a — Mark Clip in Manuscript below), then go to Step 7.
no → leave the manuscript untouched, delete the clip file from ./tmp/ (rm <path>), then go to Step 7. Rejected clips never persist on disk.
refine → go to Refinement path in Step 7. The current clip will be deleted only after the refined version successfully lands on disk (so you keep something to compare against if generation fails).

Step 6a — Mark Clip in Manuscript

When the user says yes:

Compute CLIP_NAME = the basename of the saved file without extension (e.g. claude-routines-broll-keyboard-storm or ...-v2).
Locate the paragraph in the manuscript that the clip was generated from — the one whose narration excerpt appeared in the suggestion.
Determine the anchor:
- If the suggestion's narration excerpt starts at the paragraph's first sentence, the anchor is the whole paragraph → use the bare form: TODO: clip: CLIP_NAME.
- If the excerpt starts mid-paragraph (covers only a sub-segment), include an in-paragraph anchor so the video editor knows where the clip kicks in: TODO: clip: CLIP_NAME (<first 3–5 words of the excerpt>...). Quote the words verbatim from the manuscript and end with ... to signal it's a fragment, not the full sentence.
Insert the resulting TODO line on its own line, immediately above the paragraph's first line, with a blank line above the TODO if there isn't one already.

Critical rules:
- The TODO line MUST be on its own line. Never inline it inside a sentence.
- It MUST go above the paragraph's first line. Never inside the paragraph (between sentences) and never below it. The in-paragraph anchor lives in the parenthetical on the TODO line, not in the paragraph body.
- If the paragraph already has TODO entries above it (e.g. TODO: Logo:, TODO: Screen:), append the new TODO: clip: line to that existing TODO group — keep the group contiguous, with the TODO: clip: line at the bottom of the group, directly above the paragraph.
- If a TODO: clip: line already exists for that paragraph (refinement case), replace the existing clip name and anchor rather than adding a second line.
- A single paragraph CAN host multiple TODO: clip: lines if each targets a different sub-segment — list them in the order their anchors appear in the paragraph, each on its own line, all above the paragraph.
Strip conflicting overlays within the clip's coverage. A b-roll replaces the talking head full-screen during its runtime, so any other on-screen overlay inside the coverage range must be removed.

In-paragraph overlays (always strip within coverage):
- Bold phrases (**...**) — strip the ** markers, keep the words.
Above-paragraph TODO overlays (strip ONLY when whole paragraph is covered):
- TODO: Logo: lines
- TODO: Screen: lines
- TODO: Lower-third: lines
- TODO: Citation: lines
These TODO lines sit above the paragraph and apply to it as a whole. Strip them only when the clip covers the entire paragraph (no parenthetical anchor). For mid-paragraph clips, leave them — the editor can place those overlays during the un-clipped portion.

Coverage range for bold:
- Whole-paragraph clip (no parenthetical anchor): the whole paragraph.
- Mid-paragraph clip (parenthetical anchor): the suggestion's narration excerpt — from the anchor's first word to the excerpt's last word.
What stays: Markdown links inside the paragraph body (e.g. [Anthropic](https://anthropic.com) produced by the TODO: Logo: rule) are not visual overlays. Keep them intact.

Apply all stripping in the same edit pass as the TODO: clip: insertion. In your response, list which overlays were stripped so the user can spot-check (e.g. "Stripped TODO: Logo: Anthropic and TODO: Screen: 200K context window. Unbolded flow state — all covered by the b-roll.").

Overlays outside the coverage range stay as-is.
Use the Edit tool with enough surrounding context to make the match unambiguous.

Example — fresh insertion, whole paragraph:

Before:

We had to rewrite the whole pipeline from scratch in a single weekend.

After:

TODO: clip: claude-routines-broll-keyboard-storm

We had to rewrite the whole pipeline from scratch in a single weekend.

Example — fresh insertion, mid-paragraph anchor with bold strip:

Suggestion's narration excerpt: "You are still writing code. You are still driving. The AI just makes you faster. It is a flow state amplifier." — anchor starts at "You are still writing"; coverage runs through "amplifier".

Before:

It works because nothing about your workflow actually changes. You are still writing code. You are still driving. The AI just makes you faster. It is a **flow state** amplifier. The learning curve is basically zero. Install an extension, keep coding.

After:

TODO: clip: ide-vs-tui-agents-broll-flow-state (You are still writing...)

It works because nothing about your workflow actually changes. You are still writing code. You are still driving. The AI just makes you faster. It is a flow state amplifier. The learning curve is basically zero. Install an extension, keep coding.

Note: **flow state** lost its ** markers because it falls inside the clip's coverage range. The opening sentence ("It works because nothing about your workflow…") and the trailing sentences ("The learning curve…", "Install an extension…") are outside the coverage range, so any bold there would have been preserved. Report back to the user: "Unbolded flow state — covered by the b-roll."

Example — whole-paragraph clip strips conflicting above-paragraph overlays:

Before:

TODO: Logo: Anthropic, https://anthropic.com
TODO: Screen: 200K context window

[Anthropic](https://anthropic.com)'s Claude can now hold an entire codebase in context.

After:

TODO: clip: claude-routines-broll-context-cathedral

[Anthropic](https://anthropic.com)'s Claude can now hold an entire codebase in context.

The TODO: Logo: and TODO: Screen: lines are stripped because the b-roll covers the whole paragraph and would conflict with both visual overlays. The [Anthropic](https://anthropic.com) markdown link stays — it's metadata, not a visual overlay. Report back: "Stripped TODO: Logo: Anthropic and TODO: Screen: 200K context window — paragraph fully covered by the clip."

Example — mid-paragraph clip leaves above-paragraph overlays alone:

If the clip targets only a sub-segment (parenthetical anchor), the above-paragraph TODOs stay because the editor can still place those overlays during the un-clipped sentences:

Before:

TODO: Logo: Anthropic, https://anthropic.com

[Anthropic](https://anthropic.com) ships fast. Their **context window** keeps growing every release.

After (clip only covers "Their context window..." onward):

TODO: Logo: Anthropic, https://anthropic.com
TODO: clip: my-clip (Their context window...)

[Anthropic](https://anthropic.com) ships fast. Their context window keeps growing every release.

The bold **context window** was inside the clip's coverage range, so its ** markers were stripped. The TODO: Logo: line stays — the editor can show the Anthropic logo during the un-clipped first sentence ("Anthropic ships fast.").

Step 7 — Loop: Refine or Next

Never auto-launch the next clip — even in auto mode, even after a keep. Calling generate_video always requires a fresh, explicit user pick (a number from the list, or refinement feedback). Auto mode applies to housekeeping (polling jobs, downloading finished clips, marking the manuscript on keep, deleting discarded clips); it does not apply to launching new generations. Generations cost credits and the user often wants to incorporate observations from the just-finished clip into the next prompt — speculative pipelining defeats that. After every kept/discarded/refined clip, stop and re-display the list, then wait.

Always re-display the full suggestion list before asking what's next — even if the user just kept or discarded a clip. Render the full Step 3 template for every remaining suggestion (### Suggestion N — title, Section:, Narration excerpt:, Why this segment:, Concept (8s):) — never collapse the list to a one-line title summary or a parenthetical aside ("remaining: 2, 3, 4..."). Mark completed suggestions with ✅ and discarded ones with ❌, but keep the full body for every still-available suggestion. The user picks based on the cinematic concept, not the title — they need to see the concept again to choose.

Then ask:

What's next?

Refine the just-generated clip (give me feedback and I'll regenerate)

Generate another suggestion (provide the number from the list above)

Done

Refinement path: Take the user's feedback, adjust the prompt (preserve style + metaphor; tweak action, lighting, framing, pacing per feedback), regenerate, save as -v2.mp4 (incrementing on subsequent refinements), open in QuickTime. Once the new version is on disk, delete the previous version (rm <previous-path>). If generation fails before the new version lands, leave the previous version intact and surface the error. Then return to Step 6 (Keep or Discard) for the new version. If the user keeps -v2, update the existing TODO: clip: line in the manuscript to point to the new clip name.

Next-suggestion path: The list is already on screen. Once the user picks a number, repeat from Step 4.

Done path: Print a summary: which suggestions were kept (with the manuscript paragraph each was inserted above and the local path of the kept clip) and which were discarded (their files have already been deleted, so no path to report).

Then check whether the manuscript has a cold-open. Look for either of these at the top of the manuscript (above the first ## Header):

A TODO: intro-clip: line (modern marker), or
A TODO: clip: line annotated with (before the narration starts) or similar (older convention).

If neither exists, suggest running the opener skill as a next step:

The manuscript doesn't have a cold-open yet. To add a cinematic intro effect that ends at the first frame of your recorded narration (and seamlessly cuts to live footage), run:
/manuscript-broll-opener <manuscript-path>
The opener skill brainstorms intro effect ideas, gives you recording instructions for any "special" framing the chosen effect needs, and generates the AI clip with end_image set to your first-frame screenshot.

If a cold-open already exists, do not suggest the opener skill — just print the standard summary.

Higgsfield Prompt Rules

Sourced from Higgsfield's official cinematic prompt guide. Apply these in every generated prompt and every refinement.

Do:

Short, direct sentences. Higgsfield reacts better to commands than descriptive paragraphs that force it to guess. "Dolly in slowly" beats "cinematic movement."
Specific camera verbs: dolly in, dolly out, orbit, handheld, tracking, crash zoom, FPV, locked-off, mounted on dashboard.
Explicit timing cues: "For the first 2 seconds, …", "At 4s, suddenly …", "By 7s, …" — sequence calm → shift → payoff.
Physical detail over abstraction: body trembles, breath quickens, jaw tenses, eyes widen. Show emotion through the body, not adjectives.
Layered separation: one prompt = one task. Camera motion lives in the video prompt only. Don't try to change identity AND move the camera in the same shot — generate identity/keyframe first if needed, then animate.
Mood close: end the prompt with the resolving emotional tone (tense, hopeful, unsettled, triumphant).

Don't:

Vague style words: ❌ "cinematic", "dynamic", "epic", "beautiful". They give the model nothing to act on.
Mixed instructions: don't combine identity edits with camera moves in one prompt.
Stacking multiple visual styles in one shot.
Repeating the same instruction in different words.
Describing what you don't want at length — Higgsfield biases toward what's stated. Keep negatives minimal: "no text on screen" is fine; long don't-lists waste budget.

Common failure modes (learned from real refinements):

Constraints the model can "almost honor" will be. "Wrists tied with tape" → the model puts hands in lap with token tape, hands still drift to the keyboard. "Hands tied behind the chair-back" → physically impossible to reach the keyboard, model has to honor it. When the concept depends on a constraint, specify it in a way the model can't quietly soften.
One major action beat per 8 seconds, not three. A clip with sit→stand→walk→sit→hands-on-keyboard runs out of time and ends mid-arc. Single continuous physical actions (one push-off, one walk, one rotation, one chair-roll, one reach) fit cleanly in 8s. If the concept needs 3+ posture changes, cut beats until one major beat remains. Eyes/face micro-changes can layer on top of the main beat without costing time.
Describe screen content as a visual effect, not as text. "Binary digits scrolling, no code, no syntax" still gets you code-looking text — the model renders the most "screen-like" thing it knows. "Vertical streaks of white motion blur on black, like a long-exposure photograph of fast-moving white dots" gets you the streaks you wanted. Reframe screen content as light effects (streaks, washes, glows, pulses) rather than as characters or words. Same for terminals: "a black screen with one small green pulse of cursor light" beats "a terminal showing $ prompt".
"Person + screen" geometry must be physically possible — commit to a single camera position. When a clip contains both a person and a screen they're interacting with, the prompt must commit to one of three valid camera positions. Without an explicit choice, the model resolves the impossible geometry by rendering both the face AND a lit screen visible to the camera at angles that can't exist (a recurring failure mode the viewer notices instantly). Same logic applies to any "person + visible-only-from-one-side object" pair (phones, books, clipboards, cards being read).
- Over-the-shoulder (OTS): camera behind the person. We see back of head/shoulders + screen content. Face is hidden — emotion must be carried by body language (shoulder tension, head tilt, hand posture). Use when the screen IS the visual story.
- Face-on (across-the-desk): camera faces the person. The screen is in front of them from their POV, so the camera sees the screen's back (shell, stand, cable) or only its reflected glow on the face/wall. The lit panel is never visible to the camera. Use when emotion (a smile, realization, glance) IS the visual story.
- Profile (~90° side angle): camera off to the side. Both the person and the screen are visible from the side — person's profile + screen's edge or oblique side view. The lit panel is partially visible at an angle but is not framed square-to-camera. Use when both the person's reaction AND the act of looking-at-the-screen matter, and a hint of screen content is enough.
- If a concept genuinely needs the face and a square-to-camera lit screen visible simultaneously, redesign the concept — don't try to prompt around it. The model will fail.

Live-action only — non-negotiable: Every clip must look like a frame pulled from a real motion picture. Treat this as a hard constraint, not a stylistic preference.

The verbatim FILM LOOK block from the prompt template is mandatory in every prompt.
Subjects are always real humans, real animals, or real physical objects in real locations. No anthropomorphic characters, no mascots, no avatars.
Lean on real-world cinematic references: a hand-held documentary shot, a slow studio dolly, a kitchen-sink drama interior, a film-noir alley, a 70s thriller car interior, an Arri Alexa interview setup. Concrete > abstract.
Lighting, lens, and color must be motivated and physical: practical lamps, sunlight through blinds, sodium streetlights, candlelight, overcast diffuse. Never "magical glow," "neon overlays," "pulsing energy," or other CGI-coded language.
If the metaphor is abstract (e.g. "the bottleneck"), translate it into a concrete real-world image (a single technician at a workstation in a darkened server room) instead of a stylized rendering.
Keep the subject grounded — physical effort, fatigue, breath, weight. Reality has texture; CGI flattens it.

Model defaults: (full decision logic lives in Step 4 step 1; this is the quick lookup)

Default — narrator-as-subject OR human-body-motion focus: seedance_2_0. Identity via separate image role, max 1080p, genre control. Wins on natural body motion, fabric/hair physics, and lip-sync.
Kinetic action, narrator NOT the subject: kling3_0. Native 4K (mode: "4k"), strongest kinetic motion realism and physical momentum on Higgsfield. Higgsfield exposes only start_image/end_image roles — no separate identity-image role, so don't pair with a headshot.
Marketing/product/ads with a URL: marketing_studio_video (call show_marketing_studio first).
Specialty: use models_explore (action: recommend) when the user wants something specific — e.g. Veo 3.1 for ultra-cinematic / top-tier realism, Minimax Hailuo for facial emotion + physics, Wan 2.7 for audio-reference sync without identity needs.

Duration: up to 15s; this skill defaults to 8s. Server clamps to nearest allowed value per model.

Style Reminders

Always live-action. Every clip must look like footage from a real motion picture — see the Live-action only — non-negotiable rule under Higgsfield Prompt Rules. No animation, CGI, cartoon, illustration, 3D render, motion graphics, or video-game look, ever.
Real actors, real props, real locations. Real physical lighting.
One clear visual metaphor per clip — and that metaphor is rendered as something concrete and physical, never abstract glow/pulses/particles.
Zero text on screen — no captions, no labels, no subtitles.
0–3s setup, 3–6s transformation, 6–8s payoff.
Maximum 3–4 visual elements.
Avoid screen-recording-style content (terminal, UI, code) — those segments were already excluded in Step 2.

Failure Modes

No clean segments found: report which overlays are saturating the manuscript and suggest the user run manuscript-analyze first or relax bold/TODO density.
Higgsfield job fails: surface the error message verbatim, offer to retry with a simplified prompt.
Download fails: keep the Higgsfield URL visible and let the user retry the download manually.

vfarcic/SKILL.md

Select an option

No results found

Select an option

No results found

Suggest and Generate B-Roll for Manuscript via Higgsfield

Input

Workflow Overview

Step 1 — Resolve Manuscript Path

Step 2 — Detect Overlay-Free Segments

Step 3 — Suggest B-Roll Concepts

Step 4 — Generate via Higgsfield

Step 5 — Save to ./tmp and Open in QuickTime

Step 6 — Keep or Discard

Step 6a — Mark Clip in Manuscript

Step 7 — Loop: Refine or Next

Higgsfield Prompt Rules

Style Reminders

Failure Modes