Skip to content

Instantly share code, notes, and snippets.

@vfarcic
Created June 3, 2026 17:56
Show Gist options
  • Select an option

  • Save vfarcic/673a57c7794eca33046a7664143ee6b8 to your computer and use it in GitHub Desktop.

Select an option

Save vfarcic/673a57c7794eca33046a7664143ee6b8 to your computer and use it in GitHub Desktop.
name manuscript-broll-opener
description Generate a cinematic 'cold open' clip that ends precisely at the first frame of recorded talking-head footage. Two-phase: brainstorm effect + suggest intro-narration tweaks to tighten the seam + give recording instructions, then resume with the user's first-frame screenshot to generate via Higgsfield with end_image. Tracks accepted concepts across videos to avoid repeats.

Generate Cold-Open Clip for Manuscript via Higgsfield

Propose a cinematic intro effect (materialize, ceiling zoom, rack focus, lamp reveal, etc.) for the very start of a video. The clip plays before recorded narration and ends at the exact first frame of the recorded talking-head footage, creating a seamless cut into live narration.

The skill runs in two phases:

  1. Brainstorm phase — propose effect ideas (excluding any used in prior videos), the user picks one, the skill outputs recording instructions and saves pending state, then pauses.
  2. Continue phase — the user records talking-head per the instructions, takes a screenshot of its first frame, then signals readiness with the screenshot path. In the same conversation that's just a natural-language handoff ("continue, screenshot at ./tmp/x.jpg") — no flag needed. Across sessions, re-invoke with --continue <screenshot-path>. Either way, the skill generates a clip via Higgsfield with that screenshot as end_image.

Input

  • Manuscript path: $1
  • Continue flag (optional): --continue <screenshot-path> to resume from pending state after recording.

Workflow Overview

  1. Resolve manuscript path.
  2. Detect mode: brainstorm (no --continue) or continue (--continue + screenshot).
  3. Brainstorm: read history, propose 5–7 effect concepts (none repeating prior accepted concepts). On user pick: evaluate the existing intro narration against the chosen concept, propose a targeted narration tweak if one would tighten the seam, ask the user whether to apply it before recording. Then output recording instructions, save pending state, exit.
  4. Continue: read pending state, upload screenshot as end_image, generate via Higgsfield, save to ./tmp/, open in QuickTime.
  5. Keep / refine / discard. On accept: append to history, insert TODO: intro-clip: at top of manuscript, clear pending state.

Step 1 — Resolve Manuscript Path

If $1 is empty, ask: "Which manuscript should I generate the opener for? Please provide the absolute path." Validate the file exists.

Step 2 — Detect Mode

The pending-state file is ./tmp/.intro-effect-pending.json:

{
  "manuscript": "<absolute path>",
  "concept_slug": "<slug>",
  "concept_summary": "<one-line description>",
  "prompt_template": "<full Higgsfield prompt>",
  "narrator_as_subject": true,
  "recording_instructions": "<verbatim instructions>",
  "genre": "drama"
}

Mode resolution:

  • --continue <screenshot-path> passed AND pending state exists → continue mode (Step 4).
  • --continue passed but no pending state → error: "No pending intro effect found. Run the skill without --continue to brainstorm first." Exit.
  • No --continue, but pending state already exists → ask the user: "A pending intro effect already exists for <concept_slug>. Discard and brainstorm fresh, or continue with the existing pending effect (provide a screenshot path)?" If discard, delete the pending file and proceed with brainstorm. If continue, ask for screenshot path and jump to Step 4.
  • No --continue, no pending state → brainstorm mode (Step 3).

In-session continuation does NOT need the --continue flag. The pending-state file is the single source of truth — within the same conversation, if you've already run the brainstorm phase and the user later tells you "continue, screenshot is at ./tmp/foo.jpg" (or any equivalent natural-language signal that they've recorded and are ready), read the pending-state file and jump straight to Step 4 with the screenshot path they provided. The --continue flag exists for cross-session re-entry — when the user closes the chat between brainstorming and recording, and starts a fresh conversation later. Within one session, conversational handoff is enough. Do not insist on the flag.

Also check the manuscript: if it already contains a TODO: intro-clip: line, ask: "This manuscript already has an opener: <existing-clip-name>. Replace it with a new one?" If no, exit. If yes, proceed (the existing line will be replaced on accept in Step 6a).

Step 3 — Brainstorm Mode

Step 3a — Read History

The history file is ./.claude/skills/manuscript-broll-opener/intro-effects-used.json:

{
  "effects": [
    {
      "slug": "film-slate-cold-open",
      "manuscript": "higgsfield.md",
      "date": "2026-05-04",
      "concept_summary": "DevOps & AI Toolkit film slate snapping shut, cuts to talking head"
    }
  ]
}

If the file doesn't exist, treat history as empty. Build a set of slugs and concept_summary strings — the next step must avoid proposing anything that matches.

Also scan the target manuscript for an existing TODO: clip: line at the very top of the file (above the first ## Header) — older manuscripts may have used TODO: clip: with a (before the narration starts) annotation as an informal cold-open marker. Treat any such existing opener as "already used" so we don't propose the same concept again.

Step 3b — Read the Manuscript, Calibrate to the Destination, Propose 5–7 Concepts

First: read the manuscript to understand the topic, tone, and any objects, metaphors, or settings the narration leans on. The brainstorm should be informed by content — when a thematic tie-in is natural, it deepens the seam between cold-open and live narration.

Second — and this is critical: calibrate to the actual destination frame before brainstorming. Don't propose effects blind. The effects must work for the user's actual recording setup — the props, framing, lighting, and palette of the final talking-head shot.

Critical: do NOT auto-trust files in ./tmp/ during brainstorm. The brainstorm phase typically runs before the user has recorded anything for this video — they're using the brainstorm output to decide what to record. Any last*.jpg / screenshot*.jpg / empty.jpg files already sitting in ./tmp/ are almost always leftovers from a prior video's session, not the destination frame for the current one. Calibrating to those produces concepts that fit an old setup and waste the user's time.

The calibration policy is therefore:

  • If the user explicitly passed a reference image (e.g. --reference ./tmp/foo.jpg, or mentioned a file path in the invocation), read it and use it.
  • Otherwise, ask the user directly before brainstorming: "What does your recording setup look like for this video? (e.g., 'wide studio shot with a microphone in front, bicolor LED lighting, red back panel' or 'tight head-shot at a wooden writing desk with a laptop'). Or, if you have a representative reference frame, point me at it." Don't proceed with assumptions — wrong-aesthetic suggestions waste the user's time.
  • Do not silently pull from ./tmp/ even if matching filenames exist. Stale references from prior sessions are the dominant case, not the exception. If you do notice candidate files there, mention them and ask the user whether they're current before using them — never read them as ground truth on your own.

Note: the ./tmp/ directory IS the trusted source during the continue phase (Step 4), because by then the user has explicitly handed you a screenshot path. That's an entirely different context from this brainstorm step.

When proposing concepts, anchor them in the actual setup. If the destination is a YouTube studio with bicolor LEDs, don't propose "warm window light through a writer's desk." If the destination is a quiet wooden desk, don't propose "vibrant LED panels flicking on." Each concept should make sense as a path into the specific frame the user will record.

Aim for a mix:

  • At least 2 concepts that thematically reference the manuscript's subject. If the video is about AI video generation, the cold-open could involve a film slate, a camera, or an empty editing bay. If it's about distributed systems, racks of servers waking up. If it's about coding, a terminal cursor blinking to life. Pick props or settings that the manuscript itself invokes.
  • The remaining concepts content-agnostic — universal cold-open patterns that work regardless of topic. These give the user a fallback when no thematic concept lands.

Each concept must:

  • Be cinematic, live-action, and grounded (see Higgsfield Prompt Rules below).

  • End on a static destination frame — the AI clip's last frame must be reachable from a normal recorded talking-head pose (sitting, looking at camera, hands at rest). Effects that demand the user be mid-motion at the end (walking in, mid-gesture, mid-glance) make the seam unstable and should be avoided.

  • Differ in mechanic from one another — don't propose three variants of the same mechanic. Aim for variety across these patterns, but prioritize dramatic / kinetic effects over subtle ones:

    Tier 1 — dramatic / kinetic / "special" (prefer these):

    • Drop / fall into place — subject falls or drops vertically into the frame, lands in the destination pose.
    • Materialize — empty space → particles assemble into the subject in place.
    • Camera arrival / push-through — camera moves toward / through an obstacle (door, window, glass) → lands on the subject.
    • Pull-back reveal — extreme close-up on an object → camera pulls back rapidly to reveal the wider scene with the subject.
    • Smash / break-through — a barrier (glass, paper, dust cloud) shatters or clears, revealing the subject behind it.
    • Doorway entry — POV through a closed door that opens, camera pushes through into the room with the subject inside.

    Tier 2 — subtle / optical (use sparingly, only when the user has asked for restrained openers):

    • Light reveal — dark room → light source flicks on → subject already in place.
    • Time-lapse settle — environmental shift (dawn → dusk) → ends at present-day stillness.
    • Reflection / screen wake — passive surface shows subject as reflection, then dissolves to direct view.
    • Rack focus alone — pure focus pull from blur to sharp.
    • Color flood — black-and-white frame floods with color over time.

    Why the bias: purely optical or lighting changes (Tier 2) read as "the AI is hiding the transition" — they feel boring even when technically well-executed. Dramatic physical action and strong camera moves (Tier 1) read as intentional, produced, "special." A cold open should feel crafted. Default to proposing 4–5 Tier 1 concepts and at most 1–2 Tier 2 concepts in any brainstorm.

Exclude any concept whose slug or summary matches an entry from history (Step 3a).

For thematic concepts, mark the tie-in explicitly in the suggestion — e.g., "Thematic tie-in: the manuscript opens with the film slate metaphor; this effect carries that motif into the cold open." This helps the user judge whether the tie-in is too on-the-nose or just right.

For each candidate, output:

### Suggestion N — <short title>

**Concept (8s):** <one-paragraph cinematic concept; movie-like, real actors/props, no on-screen text>
**Recording requirement:** <exactly what the user needs to do when filming the talking-head shot — pose, framing, props, lighting; or "no special instructions, just record normally">
**End-frame fit:** <Excellent / Strong / Medium — how cleanly the AI clip can land on the recorded first frame>
**Identity:** <Narrator visible / Narrator not visible — does the destination frame include the user's face>

Then ask: "Which suggestion should I prepare? (number)"

If the user counters with their own multi-scene narrative — typically a montage across multiple environments ("asleep in bed → phone rings → walk to kitchen → coffee machine → sit at desk") — this cannot be done as a single 8s AI clip and you should not try. Identity preservation collapses across 3+ environments and the per-beat time would be ~1.5s, which reads as frantic. Instead, switch to the Multi-Clip Cold-Open Pipeline (separate section below) and exit the normal Step 3c/4 flow. The multi-clip path doesn't use ./tmp/.intro-effect-pending.json — it's driven directly through the conversation.

Step 3c — Evaluate Narration, Save Pending State, Exit

When the user picks a number:

  1. Build the full Higgsfield prompt for the chosen effect using the layered structure in Higgsfield Prompt Rules. The destination is determined by end_image at generation time — describe the final pose / position in the ACTION block but do not describe a literal image. If Identity: Narrator visible, use phrasing like "the man from the reference image" in SUBJECT.

  2. Pick a genre matching the effect's mood (drama for somber/grounded, action for kinetic, noir for dark/moody, etc. — never auto, never comedy unless the manuscript is comedic).

  3. Slugify the suggestion title: lowercase, hyphens, alphanumeric only.

  4. Evaluate the intro narration against the chosen concept (this is a first-class output of phase 1, not an afterthought). The cold-open and the first words of recorded narration are functionally a single creative unit — the transition lands only if both sides know about each other. Don't skip this step; the user reads the result before recording, which is the only chance to tweak the words they'll say.

    Read the manuscript's intro section (the content between the first ## Header and the next ## Header, or end-of-file if there's no second header). Identify the first 1–3 sentences of actual spoken narration — those are the words the viewer hears immediately after the AI clip cuts to live footage.

    Ask three diagnostic questions:

    • Energy match. Does the narration's opening energy match where the cold-open lands? A kinetic drop-into-frame into a subdued "In this video, I'm going to walk through…" is a mismatch. A noir-lit slow reveal into a chirpy "Hey everyone!" is a mismatch.
    • Seam tightness. Does the narration reference, riff on, or land cleanly on the cold-open's metaphor — or does it ignore something the viewer just spent 8 seconds watching? Either is a valid creative choice, but the user should make it deliberately.
    • Entry-word strength. Is the first word or phrase the strongest possible entry point given what the viewer just saw? Sometimes the existing third sentence is the real opening, and the first two are throat-clearing.

    If you spot a clear improvement, propose a specific edit:

    Suggested narration tweak:
    
    - **Current opening:** "<verbatim first sentence(s) from the manuscript>"
    - **Suggested:** "<your proposed rewrite>"
    - **Why:** <one sentence — name what the cold-open is doing and how the new opening lands on it>
    

    If the existing narration already lands well on this concept, say so explicitly: "Current narration opening lands cleanly on this concept — no tweak suggested." Do not invent a tweak when none is needed — false positives waste the user's time and erode trust in this output.

    If the manuscript has no intro narration (e.g., only a ## Header with no following prose), skip this step entirely and tell the user: "No intro narration found in the manuscript; nothing to evaluate."

  5. If a tweak was suggested, ask the user: "Apply this narration tweak to the manuscript before you record?"

    • yes → apply the edit via the Edit tool, replacing the current opening with the suggested rewrite. Confirm: "Applied. Record with the new opening."
    • no → leave the manuscript untouched. Confirm: "Recording as-is."
    • modify → the user supplies their own rewrite. Apply that version, confirm.

    If no tweak was suggested in step 4, skip this question entirely.

  6. Write ./tmp/.intro-effect-pending.json with the fields shown in Step 2.

  7. Output to the user:

    You picked: Suggestion N — <title>
    
    Recording instructions:
    <recording_instructions, verbatim>
    
    When you've recorded the talking-head shot, take a screenshot of its first frame, save it to `./tmp/<descriptive-name>.jpg`, then:
    
    - **Same conversation:** just tell me "continue, screenshot is at ./tmp/<descriptive-name>.jpg" (or similar) — I'll pick up the pending state and generate the clip. No flag needed.
    - **Fresh conversation** (you closed the chat in between): re-invoke `/manuscript-broll-opener <manuscript-path> --continue ./tmp/<descriptive-name>.jpg`.
    
    Pending state saved. Pausing here.
    
  8. Pause here. Do NOT call generate_video yet. This is the natural pause for the user to (optionally edit the narration further, then) record talking-head. Within the same session, wait for the user to come back with a screenshot path; do not require a flag-based re-invocation. Across sessions, the --continue flag is how the user re-enters this skill cleanly.

Step 4 — Continue Mode (Generate)

Entered when the user signals they're ready with a screenshot — either via the --continue <screenshot-path> flag (cross-session) or via a same-session conversational handoff that points at a screenshot path. The pending-state file must exist either way.

  1. Validate screenshot. Check that the path exists and is a .jpg / .jpeg / .png / .webp. If not, error and ask the user to fix the path.

  2. Scan reference images for IP-flagged content before uploading. Higgsfield runs an intellectual-property detector on every reference image. If it flags something, the job comes back ip_detected (the credit is not charged, but you still burn a few minutes on the rejected-job + fix + retry cycle). The skill agent can avoid this by visually pre-scanning each reference image about to be uploaded.

    Apply the scan to every reference image that will be uploaded fresh in this run: end_image (the destination screenshot — always), start_image (the empty-environment reference, if the chosen effect is materialize-style and a path was provided), and the identity headshot (only if it's not already cached in ./tmp/.headshots.json).

    For each image, read it visually and look for:

    • Character art / illustrated IP: comic-book characters (Joker, Marvel, DC, anime figures), film/TV character posters, video-game character art.
    • Brand logos at recognizable size: sponsor logos, branded merchandise, product packaging filling enough pixels to be identified.
    • Framed posters or artwork that reads as a reproduction of a known copyrighted image.
    • Figurines / collectibles: statues, action figures, branded merch on shelves, even small if the character is identifiable.

    If anything is flagged, stop before uploading and report the specifics back to the user. Be precise: name the item, name the location in frame ("right edge", "behind subject's shoulder", "top-left shelf"), and recommend an action ("crop the right ~25% of the frame, save as <original-name>-cropped.jpg, and re-invoke --continue" or "restage and re-shoot").

    Example report:

    "Heads-up: scanned ./tmp/last.jpg and found two likely IP issues:

    • Right edge of frame: framed Joker (DC Comics) poster, large and clearly identifiable.
    • Top-left shelf, behind subject: small purple figurine that reads as a Joker collectible.

    Higgsfield's IP detector will probably reject the job. Crop the right ~30% to remove the poster, and either crop or hide the figurine — then save as last-cropped.jpg and point me at the cropped file. Or say proceed anyway and I'll submit the original (and we'll likely need to fix and retry)."

    Wait for the user to either fix and re-invoke, or say proceed anyway. Do not auto-proceed when something is flagged — let the user choose.

    If nothing is flagged, proceed silently to step 3 — no need to spam the user with "scan clean."

  3. Scan the end_image for mid-speech / mid-gesture poses. Higgsfield uses end_image as the target the AI animates toward — meaning the subject's face and body in the AI clip's final beat will match the pose in this image. If the end_image captures a mid-speech moment (mouth visibly open, jaw mid-articulation, lips formed around a vowel/consonant), the AI will animate the subject talking through the entire clip. That has two consequences:

    • Visual: the subject appears to be delivering dialogue in the cold open, which collides with the recorded voiceover that follows.
    • Audio: Seedance (and similar models) react to a "talking" subject by generating fabricated dialogue audio in the soundtrack. That's the actual root cause of unwanted AI voiceover in clips — the model is "voicing" the visible mouth motion.

    Read the end_image visually. Look for:

    • Mouth open with visible teeth or formed lip shape (mid-vowel).
    • Asymmetric jaw position (mid-articulation).
    • Hand caught mid-gesture in the air (not at rest).
    • Eyes mid-blink or off-axis (mid-glance away from camera).

    If any are present, warn the user before uploading:

    "Heads-up: scanned <end_image> and the captured frame shows the subject mid-speech (mouth open) / mid-gesture (hand mid-air). Higgsfield will animate dialogue / motion toward that pose, which produces a 'talking' cold-open subject and fabricated voiceover audio. Better: scrub the talking-head recording for a moment where you're silent (between sentences, mouth closed, hands at rest) and screenshot that frame instead. Save as <name>-still.jpg and point me at the new file. Or say proceed anyway to use the current frame."

    Wait for the user to either fix and re-invoke, or say proceed anyway. Do not auto-proceed when the frame is mid-action.

    If the frame is settled (mouth closed, hands at rest or out of frame, gaze on-camera), proceed silently — no need to confirm.

  4. Upload screenshot as end_image. Call mcp__higgsfield__media_upload with the screenshot file. (No caching for end_image — it's a one-off frame for this specific clip.) Save the returned UUID. Call media_confirm if required by the upload response.

  5. If narrator_as_subject is true, also resolve a headshot UUID following the same flow as manuscript-broll-suggest Step 4 part 0:

    • First check ./tmp/.headshots.json for canonical_uuid. If non-empty, use it.
    • Otherwise look for screenshot*.{jpg,jpeg,png,webp} files in ./tmp/. Pick the cleanest front-facing one. Reuse cached UUID if mtime matches; otherwise upload + cache.
  6. Pick model and resolve params. Default to seedance_2_0. Switch to kling3_0 only when the criteria in Higgsfield Prompt Rules → Model defaults for openers are all met (kinetic Tier 1 effect, narrator's face clearly in the end_image, no separate identity reference required, 4K wanted). Verify via models_explore (action: get, model_id: <chosen>) that medias[].roles includes end_image. Resolve highest-quality params per model:

    • seedance_2_0: resolution: "1080p", mode: "std", genre from pending state.
    • kling3_0: mode: "4k", sound: "off" (the recorded narration replaces the AI audio anyway, and the visible subject must never appear to speak — see the no-speaking rule). No genre parameter on Kling 3.0.
  7. Call mcp__higgsfield__generate_video with:

    • model: <chosen>
    • prompt: <prompt_template from pending state>
    • params.duration: 8
    • params.aspect_ratio: "16:9"
    • Model-specific quality params from step 5.
    • params.medias:
      • Always: {role: "end_image", value: "<destination-screenshot-uuid>"}
      • If narrator-as-subject AND model is seedance_2_0: {role: "image", value: "<headshot-uuid>"}. Skip this for kling3_0 — Higgsfield does not expose Kling 3.0's identity-image role; the headshot would have to be reused as start_image, which conflicts with the materialize use case below.
      • Optional but strongly recommended for materialize / "empty environment to populated" effects: {role: "start_image", value: "<empty-environment-uuid>"}. Without it, the AI invents what the empty version of the environment looks like, which usually mismatches the user's real space (wrong colors, wrong prop positions). With it, the transition is anchored from a real empty frame to a real populated frame, eliminating the "imagined → real" mismatch. The empty-environment screenshot should be the same camera angle and lighting as the destination, just without the narrator. Standard path: ./tmp/empty.jpg (the skill should look for this when the chosen effect is materialize-style).
  8. Poll mcp__higgsfield__job_status with sync: true until terminal. Typical: 60–180s.

  9. On success, extract the result video URL.

If balance shows insufficient credits before generating, surface that and stop.

Step 5 — Save and Open

  1. Target path: ./tmp/<manuscript-stem>-opener-<concept_slug>.mp4. If the file exists, append -v2, -v3, etc.
  2. Download via curl -L -o <path> <url>.
  3. Open in QuickTime: open -a "QuickTime Player" <path>.
  4. Print the local path and the Higgsfield job ID.

Audio: leave the AI-generated audio track on the file. Ambient room tone, screen hum, footsteps, glass settle, etc. add atmosphere and the editor will mute or replace the track in post anyway. The thing to prevent is the AI generating dialogue — see the Step 4 mid-speech guard and the "no speaking" rule in Higgsfield Prompt Rules. The fix for unwanted voiceover is at the source (subject not talking in the clip), not at the file level.

Step 6 — Keep / Discard / Refine

After the clip is open in QuickTime, ask:

Keep this clip? (yes / no / refine)

  • yes → mark in manuscript (Step 6a), append to history (Step 6b), clear pending state (Step 6c).
  • no → delete the clip file (rm <path>), leave the manuscript untouched, leave the pending state intact so the user can retry without losing context. Tell them: "Clip deleted. Pending state preserved. Re-invoke with --continue and a new screenshot, or pass --reset to brainstorm fresh."
  • refine → take feedback, rewrite the prompt (preserve the effect's metaphor and end-frame target; tweak action / lighting / camera per feedback), regenerate as -v2.mp4, delete previous version once the new one lands. Return to Step 6 for the new version.

Step 6a — Mark Opener in Manuscript

When the user says yes:

  1. Compute CLIP_NAME = basename of the saved file without extension.

  2. Insert TODO: intro-clip: <CLIP_NAME> (plays before talking-head) immediately under the first ## Intro header (or before the first ## Header if there is no ## Intro), with a blank line above and below. The parenthetical annotation is required — the editor needs to know without reading the SKILL doc that these clips play before the recorded narration.

  3. If a TODO: intro-clip: line already exists at that location, replace its clip name (refinement / replacement case) rather than adding a second — unless this is a multi-clip sequence (see point 5).

  4. If the manuscript has a TODO: clip: line at the top with a (before the narration starts) annotation (older convention), ask the user whether to replace it with the new TODO: intro-clip: form or leave it. Default: replace.

  5. Multi-clip cold-open sequences (when the user wants a montage instead of a single AI clip — e.g., bedroom → coffee → desk, generated as multiple Higgsfield clips and edited together): write one TODO: intro-clip: line per clip, in playback order, each annotated with its position and the special role of the final one:

    TODO: intro-clip: <CLIP_NAME_1> (plays before talking-head, 1/N)
    TODO: intro-clip: <CLIP_NAME_2> (plays before talking-head, 2/N)
    ...
    TODO: intro-clip: <CLIP_NAME_N> (plays before talking-head, N/N, hard cut into recorded footage at final frame)
    

    Only the last clip's final frame is locked to the recorded talking-head's first frame via end_image. The earlier clips are standalone and stand on their own cuts.

Format (single clip):

## Intro

TODO: intro-clip: <CLIP_NAME> (plays before talking-head)

<existing first paragraph...>

Note for the editor: TODO: intro-clip: differs from TODO: clip: — the clip plays before recorded narration and ends at the recorded first frame, so the cut is hard from AI clip → live footage with no overlap.

Step 6b — Append to History

Append to ./.claude/skills/manuscript-broll-opener/intro-effects-used.json:

{
  "slug": "<concept_slug>",
  "manuscript": "<basename of manuscript file>",
  "date": "<YYYY-MM-DD, today>",
  "concept_summary": "<one-line description of what the viewer sees>"
}

If the file doesn't exist, create it with {"effects": [<this entry>]}. Pretty-print with 2-space indent for readability.

The concept_summary should describe what the viewer sees, not implementation details. The summary is what future brainstorms will read to judge whether a new candidate concept overlaps with this one — so describe the visual (setting, action, framing), not the technical params.

  • Good: "Empty studio chair, man drops vertically from above into the chair, slow camera push-in across 8 seconds, lands on tighter destination framing."
  • Bad: "Seedance 2.0, drama genre, 1080p, locked-off camera, end_image set to last-cropped.jpg."

Step 6c — Clear Pending State

Delete ./tmp/.intro-effect-pending.json.

Step 7 — Done Summary

Print:

  • Accepted clip's local path.
  • Higgsfield job ID.
  • That the manuscript now has TODO: intro-clip: <name> at the top.
  • That <concept_slug> is now in the cross-video history and won't be re-suggested.

Multi-Clip Cold-Open Pipeline

The default skill flow above generates one 8-second AI clip that lands on the recorded talking-head. Sometimes the user wants a narrative montage instead — a sequence of beats across multiple environments (e.g. "asleep in bed → phone rings → coffee machine → sit at desk"). A single 8s shot can't carry that. Use this alternative pipeline.

When to offer it

During brainstorm (Step 3), if the user proposes a multi-environment narrative (waking up scene, walking-into-room sequence, day-in-the-life montage, etc.) that obviously won't fit one 8s clip, surface the multi-clip option explicitly. Don't try to compress 5 scenes into one shot — Seedance can't render the user's face accurately across 3+ different environments in a single clip; identity falls apart.

Honest tradeoffs to surface up front

Before starting, tell the user:

  • Cost: ~7–12 Higgsfield jobs (one keyframe image + one video per beat, plus refinements). At ~50 credits per video and ~10 per image, expect 200–400 credits total.
  • Time: 30–60 min of generation + your edit time afterward.
  • Identity drift: the user's face rendered in a bedroom or kitchen will drift from the studio headshot reference. Expect 1–2 beats to need regeneration. Camera angles that hide or partially obscure the face (asleep on pillow, low-angle on legs, hand-only shots, body-only shots cropped at the neck) sidestep this entirely — bias toward those when possible.
  • Continuity: wardrobe and props must be specified per-prompt or the AI arbitrarily costumes the subject between beats.
  • Operates outside the normal pending-state: the skill's ./tmp/.intro-effect-pending.json mechanism is for single-clip flow. Multi-clip is driven directly through the conversation; track progress in your head / via TaskCreate rather than the pending file.

Beat plan

Lock the beats with the user before generating anything:

  • How many beats (typically 3–5).
  • What environment each one is in.
  • What action / motion each one carries.
  • Which beats hide the user's face (cheaper, lower identity risk) vs. which show the face (higher risk).
  • The wardrobe progression (the user may change clothes between bedroom and final desk).
  • The final beat — this is the one that lands on the recorded screenshot via end_image.

Pipeline (per beat)

For beats 1..N-1 (no end_image lock):

  1. Generate keyframe image with mcp__higgsfield__generate_image. This is the first frame of the beat's video.
  2. Show keyframe to user. Keep / refine / discard. Refinement loop: see "Image refinement" below.
  3. On accept: generate the video with mcp__higgsfield__generate_video, passing the accepted keyframe's job ID as start_image. Duration: 4s (Seedance 2.0 minimum) unless the beat genuinely needs longer.
  4. Parallelize: while the video bakes, start generating the next beat's keyframe. The video poll loop and image generation are independent.

For beat N (final, lands on screenshot):

  1. Run the existing --continue validation: IP scan, mid-speech scan on the screenshot.
  2. Generate beat-N keyframe showing the start of the final motion (user entering frame, hand placing mug, etc.) — composition should approach the recorded screenshot's framing.
  3. On accept: generate video with BOTH start_image (beat-N keyframe) AND end_image (the screenshot). Duration: 4–6s.

Model choice for keyframes

  • Use nano_banana_2 (Nano Banana Pro). It respects the prompt while using the reference image for guidance.
  • Do NOT use soul_2 for keyframes that diverge from the reference image's content. Soul 2.0 runs enhance_prompt: true by default and silently rewrites your prompt to re-describe the reference image. You ask for "man asleep in dark bedroom with a phone on the nightstand"; it gives you the man in his studio with a microphone, because that's what the reference image showed. Soul 2.0 is fine for portraits/UGC that intentionally match the reference; it's wrong for any beat where the scene differs.

Image refinement (keep / refine / discard loop)

When the user wants a tweak ("darker window", "move slippers right", "swap the espresso machine"), don't regenerate from scratch with a longer prompt. Use image-to-image editing:

  1. Upload the current keyframe (media_upload + curl PUT + media_confirm).
  2. Call generate_image with nano_banana_2, passing the uploaded image as medias[].role: "image".
  3. Prompt: "Edit this image: keep EVERYTHING identical (list the things to preserve) EXCEPT . All other elements of the image unchanged."

This works well for: window lighting, prop positions, swapping an object for a similar one, removing or adding a small element. Saves credits and preserves the parts the user already approved.

Audio control

Seedance 2.0 has generate_audio: true baked in and there's no parameter to disable it. The only lever is the prompt itself. To suppress music while keeping ambient sound effects, add an explicit AUDIO block to every video prompt:

AUDIO: ambient room tone only — <list the specific diegetic sounds appropriate to the scene>. NO MUSIC. NO SCORE. NO SOUNDTRACK. NO MELODY. NO BACKGROUND MUSIC. NO DIALOGUE. NO VOICEOVER. Only diegetic natural sound effects from the action.

This is effective; without it Seedance often layers a faint score under everything.

File naming

  • Keyframes: ./tmp/<beat-name>-keyframe-v<N>.png (e.g. bedroom-keyframe-v3.png).
  • Videos: ./tmp/<manuscript-stem>-opener-clip<N>-<beat-name>.mp4 (e.g. grafana-agent-opener-clip1-bedroom.mp4). The clip<N> prefix makes playback order obvious to the editor.

Marking and history (multi-clip)

  • Manuscript: write one TODO: intro-clip: line per clip in playback order, with annotations as documented in Step 6a (point 5).
  • History (intro-effects-used.json): write one entry summarizing the whole montage concept — not one entry per clip. The concept_summary should describe the visual arc across all beats so future brainstorms know the metaphor is taken (e.g. "wake-up-to-desk montage", "alert-at-night ritual"). One slug like midnight-alert-coffee-ritual is enough.

Higgsfield Prompt Rules

Sourced from Higgsfield's official cinematic prompt guide. Apply these in every generated prompt and every refinement.

Do:

  • Short, direct sentences. Higgsfield reacts better to commands than descriptive paragraphs that force it to guess. "Dolly in slowly" beats "cinematic movement."
  • Specific camera verbs: dolly in, dolly out, orbit, handheld, tracking, crash zoom, FPV, locked-off, mounted on dashboard.
  • Explicit timing cues: "For the first 2 seconds, …", "At 4s, suddenly …", "By 7s, …" — sequence calm → shift → payoff. For openers especially, the last 1–2 seconds must resolve into stillness so the AI's final frame can match the recorded end_image cleanly.
  • Physical detail over abstraction: body trembles, breath quickens, jaw tenses, eyes widen. Show emotion through the body, not adjectives.
  • Layered separation: one prompt = one task. Camera motion lives in the video prompt only. Don't try to change identity AND move the camera in the same shot.
  • Mood close: end the prompt with the resolving emotional tone (tense, hopeful, unsettled, triumphant).

Don't:

  • Vague style words: ❌ "cinematic", "dynamic", "epic", "beautiful". They give the model nothing to act on.
  • Mixed instructions: don't combine identity edits with camera moves in one prompt.
  • Stacking multiple visual styles in one shot.
  • Repeating the same instruction in different words.
  • Describing what you don't want at length — Higgsfield biases toward what's stated. Keep negatives minimal: "no text on screen" is fine; long don't-lists waste budget.

Live-action only — non-negotiable: Every clip must look like a frame pulled from a real motion picture. Treat this as a hard constraint, not a stylistic preference.

  • The verbatim FILM LOOK block from the prompt template is mandatory in every prompt.
  • Subjects are always real humans, real animals, or real physical objects in real locations. No anthropomorphic characters, no mascots, no avatars.
  • Lean on real-world cinematic references: a hand-held documentary shot, a slow studio dolly, a kitchen-sink drama interior, a film-noir alley, a 70s thriller car interior, an Arri Alexa interview setup. Concrete > abstract.
  • Lighting, lens, and color must be motivated and physical: practical lamps, sunlight through blinds, sodium streetlights, candlelight, overcast diffuse. Never "magical glow," "neon overlays," "pulsing energy," or other CGI-coded language.
  • If the metaphor is abstract, translate it into a concrete real-world image instead of a stylized rendering.
  • Keep the subject grounded — physical effort, fatigue, breath, weight. Reality has texture; CGI flattens it.

The subject must never appear to be speaking — non-negotiable. A cold-open clip transitions into voiceover narration. If the visible subject's mouth is moving or they're delivering dialogue, the viewer's brain expects to hear those exact words, which collides with the narration audio that follows. Build the cinematic concept around physical action, contemplation, stillness, hands-on craft, or environmental motion — never mid-conversation, never visibly mouthing words.

Prompt template:

SCENE: <opening setup — shot type, framing, location>
SUBJECT: <who/what is in frame, with physical detail; if narrator-as-subject, lean on "the person/man from the reference image">
ACTION: <chronological beats with timing markers — "For the first 2 seconds...", "At 4s, suddenly...", "By 7s..." — culminating in stillness for the final beat to match end_image>
CAMERA: <specific camera verb — dolly in, orbit, handheld, crash zoom, FPV, locked-off>
LIGHTING & LOOK: <natural / practical / shallow depth of field / film stock reference; should be reachable from the recorded talking-head's lighting in the final beat>
FILM LOOK: live-action footage, shot on 35mm or Arri Alexa, 24fps, anamorphic-style depth of field, natural film grain, motivated lighting, color-graded like a feature film. NOT animation, NOT CGI, NOT cartoon, NOT illustration, NOT 3D render, NOT video game, NOT motion graphics.
END MOOD: <how the shot resolves emotionally>

Keep the prompt 100–200 words. The FILM LOOK block is required and verbatim in every prompt.

Model defaults for openers:

  • seedance_2_0 — default. Text→video with end_image + a dedicated image role for an independent identity reference, strong identity preservation, max 1080p, genre control. Pick this when:

    • The narrator is visible only partially or not at all in the destination frame and you need a separate headshot reference to lock identity (the destination is hands-on-keyboard, a side-angle, an over-the-shoulder shot).
    • The effect leans on natural human body motion — settling into a chair, breath, micro-gestures, weight, fabric/hair physics. Seedance 2.0 was purpose-built for human body realism.
    • The narrator-as-subject flag would otherwise apply.
  • kling3_0 — viable alternative for kinetic openers. Released Feb 2026. Use when all of the following are true:

    1. The narrator's face is clearly resolved in the end_image (Kling 3.0 on Higgsfield only exposes start_image/end_image roles — no dedicated identity-image role, so the destination frame must carry identity on its own).
    2. The Tier 1 effect is kinetic / action-heavy: drop-into-place, camera push-through, smash-through, FPV door entry, crash zoom. Kling 3.0's strength is kinetic motion realism and physical momentum.
    3. You want the 4K resolution bump — set params.mode: "4k".

    Skip Kling 3.0 for Tier 2 / subtle openers and for openers where the destination is a side-angle or an off-axis shot of the narrator.

  • Avoid kling_2_6 and earlier — they expose only start_image, no end_image, so they can't land the seam at all.

  • Avoid wan_2_6 — no end_image support.

  • wan_2_7 has end_image but no identity-image role and weaker identity preservation than Seedance 2.0. Only consider it if you specifically need its audio reference role.

  • Avoid marketing_studio_video — caps at 720p, designed for product ads, not openers.

Duration: 8s default. Server clamps to nearest allowed value per model.

Style Reminders

  • Always live-action. Every clip must look like footage from a real motion picture — no animation, CGI, cartoon, illustration, 3D render, motion graphics, or video-game look, ever.
  • Real actors, real props, real locations. Real physical lighting.
  • Motion throughout. The final pose should match end_image, but the clip does not need to be "motionless" before that. The end_image parameter drives identity preservation by giving the AI a target to land on; that's what locks the pose. It does NOT mean the subject should be frozen throughout the clip. Do NOT write prompts that demand the subject be "motionless throughout" or "still for the last 2 seconds" — that produces dead, boring clips. Write prompts where the subject has natural micro-movement (forming, settling, breathing, slight head shifts, hands moving into position) all the way through, with the final frame arriving at the destination pose.
  • Seam invisibility is rarely achievable anyway. If the user's recorded talking-head footage is at a different framing than end_image (e.g., end_image is a cropped frame to remove an IP-flagged background element, but the actual recording is uncropped), the cut from AI clip → live footage will have a visible frame-size pop regardless of how still the AI clip's last frame is. In that case, "stillness in the final beat" earns nothing — the cut will be visible no matter what. Lean into motion across the whole clip.
  • Zero text on screen — no captions, no labels, no subtitles.
  • 0–3s setup, 3–6s transformation, 6–7s settling, 7–8s steady on the destination pose.
  • Maximum 3–4 visual elements.
  • Lighting in the final beat should approximate the recorded talking-head's lighting; this minimizes color/exposure pop at the seam.

Failure Modes

  • No pending state, --continue passed: tell the user to run without --continue first to brainstorm and pick an effect.
  • Pending state exists, no --continue, no --reset: ask whether to resume or discard.
  • Higgsfield job fails: surface the error message verbatim, offer to retry with a simplified prompt.
  • Screenshot upload fails: surface the error, ask the user to verify the file path and re-point you at the screenshot (same-session) or re-invoke with --continue (cross-session).
  • Download fails: keep the Higgsfield URL visible and let the user retry the download manually.
  • History file unreadable / malformed JSON: tell the user, but proceed with empty history (don't block brainstorming).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment