| name | manuscript-broll-opener |
|---|---|
| description | Generate a cinematic 'cold open' clip that ends precisely at the first frame of recorded talking-head footage. Two-phase: brainstorm effect + suggest intro-narration tweaks to tighten the seam + give recording instructions, then resume with the user's first-frame screenshot to generate via Higgsfield with end_image. Tracks accepted concepts across videos to avoid repeats. |
Propose a cinematic intro effect (materialize, ceiling zoom, rack focus, lamp reveal, etc.) for the very start of a video. The clip plays before recorded narration and ends at the exact first frame of the recorded talking-head footage, creating a seamless cut into live narration.
The skill runs in two phases:
- Brainstorm phase — propose effect ideas (excluding any used in prior videos), the user picks one, the skill outputs recording instructions and saves pending state, then pauses.
- Continue phase — the user records talking-head per the instructions, takes a screenshot of its first frame, then signals readiness with the screenshot path. In the same conversation that's just a natural-language handoff ("continue, screenshot at ./tmp/x.jpg") — no flag needed. Across sessions, re-invoke with
--continue <screenshot-path>. Either way, the skill generates a clip via Higgsfield with that screenshot asend_image.
- Manuscript path: $1
- Continue flag (optional):
--continue <screenshot-path>to resume from pending state after recording.
- Resolve manuscript path.
- Detect mode: brainstorm (no
--continue) or continue (--continue+ screenshot). - Brainstorm: read history, propose 5–7 effect concepts (none repeating prior accepted concepts). On user pick: evaluate the existing intro narration against the chosen concept, propose a targeted narration tweak if one would tighten the seam, ask the user whether to apply it before recording. Then output recording instructions, save pending state, exit.
- Continue: read pending state, upload screenshot as
end_image, generate via Higgsfield, save to./tmp/, open in QuickTime. - Keep / refine / discard. On accept: append to history, insert
TODO: intro-clip:at top of manuscript, clear pending state.
If $1 is empty, ask: "Which manuscript should I generate the opener for? Please provide the absolute path." Validate the file exists.
The pending-state file is ./tmp/.intro-effect-pending.json:
{
"manuscript": "<absolute path>",
"concept_slug": "<slug>",
"concept_summary": "<one-line description>",
"prompt_template": "<full Higgsfield prompt>",
"narrator_as_subject": true,
"recording_instructions": "<verbatim instructions>",
"genre": "drama"
}Mode resolution:
--continue <screenshot-path>passed AND pending state exists → continue mode (Step 4).--continuepassed but no pending state → error: "No pending intro effect found. Run the skill without--continueto brainstorm first." Exit.- No
--continue, but pending state already exists → ask the user: "A pending intro effect already exists for<concept_slug>. Discard and brainstorm fresh, or continue with the existing pending effect (provide a screenshot path)?" If discard, delete the pending file and proceed with brainstorm. If continue, ask for screenshot path and jump to Step 4. - No
--continue, no pending state → brainstorm mode (Step 3).
In-session continuation does NOT need the --continue flag. The pending-state file is the single source of truth — within the same conversation, if you've already run the brainstorm phase and the user later tells you "continue, screenshot is at ./tmp/foo.jpg" (or any equivalent natural-language signal that they've recorded and are ready), read the pending-state file and jump straight to Step 4 with the screenshot path they provided. The --continue flag exists for cross-session re-entry — when the user closes the chat between brainstorming and recording, and starts a fresh conversation later. Within one session, conversational handoff is enough. Do not insist on the flag.
Also check the manuscript: if it already contains a TODO: intro-clip: line, ask: "This manuscript already has an opener: <existing-clip-name>. Replace it with a new one?" If no, exit. If yes, proceed (the existing line will be replaced on accept in Step 6a).
The history file is ./.claude/skills/manuscript-broll-opener/intro-effects-used.json:
{
"effects": [
{
"slug": "film-slate-cold-open",
"manuscript": "higgsfield.md",
"date": "2026-05-04",
"concept_summary": "DevOps & AI Toolkit film slate snapping shut, cuts to talking head"
}
]
}If the file doesn't exist, treat history as empty. Build a set of slugs and concept_summary strings — the next step must avoid proposing anything that matches.
Also scan the target manuscript for an existing TODO: clip: line at the very top of the file (above the first ## Header) — older manuscripts may have used TODO: clip: with a (before the narration starts) annotation as an informal cold-open marker. Treat any such existing opener as "already used" so we don't propose the same concept again.
First: read the manuscript to understand the topic, tone, and any objects, metaphors, or settings the narration leans on. The brainstorm should be informed by content — when a thematic tie-in is natural, it deepens the seam between cold-open and live narration.
Second — and this is critical: calibrate to the actual destination frame before brainstorming. Don't propose effects blind. The effects must work for the user's actual recording setup — the props, framing, lighting, and palette of the final talking-head shot.
Critical: do NOT auto-trust files in ./tmp/ during brainstorm. The brainstorm phase typically runs before the user has recorded anything for this video — they're using the brainstorm output to decide what to record. Any last*.jpg / screenshot*.jpg / empty.jpg files already sitting in ./tmp/ are almost always leftovers from a prior video's session, not the destination frame for the current one. Calibrating to those produces concepts that fit an old setup and waste the user's time.
The calibration policy is therefore:
- If the user explicitly passed a reference image (e.g.
--reference ./tmp/foo.jpg, or mentioned a file path in the invocation), read it and use it. - Otherwise, ask the user directly before brainstorming: "What does your recording setup look like for this video? (e.g., 'wide studio shot with a microphone in front, bicolor LED lighting, red back panel' or 'tight head-shot at a wooden writing desk with a laptop'). Or, if you have a representative reference frame, point me at it." Don't proceed with assumptions — wrong-aesthetic suggestions waste the user's time.
- Do not silently pull from
./tmp/even if matching filenames exist. Stale references from prior sessions are the dominant case, not the exception. If you do notice candidate files there, mention them and ask the user whether they're current before using them — never read them as ground truth on your own.
Note: the ./tmp/ directory IS the trusted source during the continue phase (Step 4), because by then the user has explicitly handed you a screenshot path. That's an entirely different context from this brainstorm step.
When proposing concepts, anchor them in the actual setup. If the destination is a YouTube studio with bicolor LEDs, don't propose "warm window light through a writer's desk." If the destination is a quiet wooden desk, don't propose "vibrant LED panels flicking on." Each concept should make sense as a path into the specific frame the user will record.
Aim for a mix:
- At least 2 concepts that thematically reference the manuscript's subject. If the video is about AI video generation, the cold-open could involve a film slate, a camera, or an empty editing bay. If it's about distributed systems, racks of servers waking up. If it's about coding, a terminal cursor blinking to life. Pick props or settings that the manuscript itself invokes.
- The remaining concepts content-agnostic — universal cold-open patterns that work regardless of topic. These give the user a fallback when no thematic concept lands.
Each concept must:
-
Be cinematic, live-action, and grounded (see Higgsfield Prompt Rules below).
-
End on a static destination frame — the AI clip's last frame must be reachable from a normal recorded talking-head pose (sitting, looking at camera, hands at rest). Effects that demand the user be mid-motion at the end (walking in, mid-gesture, mid-glance) make the seam unstable and should be avoided.
-
Differ in mechanic from one another — don't propose three variants of the same mechanic. Aim for variety across these patterns, but prioritize dramatic / kinetic effects over subtle ones:
Tier 1 — dramatic / kinetic / "special" (prefer these):
- Drop / fall into place — subject falls or drops vertically into the frame, lands in the destination pose.
- Materialize — empty space → particles assemble into the subject in place.
- Camera arrival / push-through — camera moves toward / through an obstacle (door, window, glass) → lands on the subject.
- Pull-back reveal — extreme close-up on an object → camera pulls back rapidly to reveal the wider scene with the subject.
- Smash / break-through — a barrier (glass, paper, dust cloud) shatters or clears, revealing the subject behind it.
- Doorway entry — POV through a closed door that opens, camera pushes through into the room with the subject inside.
Tier 2 — subtle / optical (use sparingly, only when the user has asked for restrained openers):
- Light reveal — dark room → light source flicks on → subject already in place.
- Time-lapse settle — environmental shift (dawn → dusk) → ends at present-day stillness.
- Reflection / screen wake — passive surface shows subject as reflection, then dissolves to direct view.
- Rack focus alone — pure focus pull from blur to sharp.
- Color flood — black-and-white frame floods with color over time.
Why the bias: purely optical or lighting changes (Tier 2) read as "the AI is hiding the transition" — they feel boring even when technically well-executed. Dramatic physical action and strong camera moves (Tier 1) read as intentional, produced, "special." A cold open should feel crafted. Default to proposing 4–5 Tier 1 concepts and at most 1–2 Tier 2 concepts in any brainstorm.
Exclude any concept whose slug or summary matches an entry from history (Step 3a).
For thematic concepts, mark the tie-in explicitly in the suggestion — e.g., "Thematic tie-in: the manuscript opens with the film slate metaphor; this effect carries that motif into the cold open." This helps the user judge whether the tie-in is too on-the-nose or just right.
For each candidate, output:
### Suggestion N — <short title>
**Concept (8s):** <one-paragraph cinematic concept; movie-like, real actors/props, no on-screen text>
**Recording requirement:** <exactly what the user needs to do when filming the talking-head shot — pose, framing, props, lighting; or "no special instructions, just record normally">
**End-frame fit:** <Excellent / Strong / Medium — how cleanly the AI clip can land on the recorded first frame>
**Identity:** <Narrator visible / Narrator not visible — does the destination frame include the user's face>
Then ask: "Which suggestion should I prepare? (number)"
If the user counters with their own multi-scene narrative — typically a montage across multiple environments ("asleep in bed → phone rings → walk to kitchen → coffee machine → sit at desk") — this cannot be done as a single 8s AI clip and you should not try. Identity preservation collapses across 3+ environments and the per-beat time would be ~1.5s, which reads as frantic. Instead, switch to the Multi-Clip Cold-Open Pipeline (separate section below) and exit the normal Step 3c/4 flow. The multi-clip path doesn't use ./tmp/.intro-effect-pending.json — it's driven directly through the conversation.
When the user picks a number:
-
Build the full Higgsfield prompt for the chosen effect using the layered structure in Higgsfield Prompt Rules. The destination is determined by
end_imageat generation time — describe the final pose / position in the ACTION block but do not describe a literal image. IfIdentity: Narrator visible, use phrasing like "the man from the reference image" in SUBJECT. -
Pick a
genrematching the effect's mood (drama for somber/grounded, action for kinetic, noir for dark/moody, etc. — neverauto, nevercomedyunless the manuscript is comedic). -
Slugify the suggestion title: lowercase, hyphens, alphanumeric only.
-
Evaluate the intro narration against the chosen concept (this is a first-class output of phase 1, not an afterthought). The cold-open and the first words of recorded narration are functionally a single creative unit — the transition lands only if both sides know about each other. Don't skip this step; the user reads the result before recording, which is the only chance to tweak the words they'll say.
Read the manuscript's intro section (the content between the first
## Headerand the next## Header, or end-of-file if there's no second header). Identify the first 1–3 sentences of actual spoken narration — those are the words the viewer hears immediately after the AI clip cuts to live footage.Ask three diagnostic questions:
- Energy match. Does the narration's opening energy match where the cold-open lands? A kinetic drop-into-frame into a subdued "In this video, I'm going to walk through…" is a mismatch. A noir-lit slow reveal into a chirpy "Hey everyone!" is a mismatch.
- Seam tightness. Does the narration reference, riff on, or land cleanly on the cold-open's metaphor — or does it ignore something the viewer just spent 8 seconds watching? Either is a valid creative choice, but the user should make it deliberately.
- Entry-word strength. Is the first word or phrase the strongest possible entry point given what the viewer just saw? Sometimes the existing third sentence is the real opening, and the first two are throat-clearing.
If you spot a clear improvement, propose a specific edit:
Suggested narration tweak: - **Current opening:** "<verbatim first sentence(s) from the manuscript>" - **Suggested:** "<your proposed rewrite>" - **Why:** <one sentence — name what the cold-open is doing and how the new opening lands on it>If the existing narration already lands well on this concept, say so explicitly: "Current narration opening lands cleanly on this concept — no tweak suggested." Do not invent a tweak when none is needed — false positives waste the user's time and erode trust in this output.
If the manuscript has no intro narration (e.g., only a
## Headerwith no following prose), skip this step entirely and tell the user: "No intro narration found in the manuscript; nothing to evaluate." -
If a tweak was suggested, ask the user: "Apply this narration tweak to the manuscript before you record?"
- yes → apply the edit via the Edit tool, replacing the current opening with the suggested rewrite. Confirm: "Applied. Record with the new opening."
- no → leave the manuscript untouched. Confirm: "Recording as-is."
- modify → the user supplies their own rewrite. Apply that version, confirm.
If no tweak was suggested in step 4, skip this question entirely.
-
Write
./tmp/.intro-effect-pending.jsonwith the fields shown in Step 2. -
Output to the user:
You picked: Suggestion N — <title> Recording instructions: <recording_instructions, verbatim> When you've recorded the talking-head shot, take a screenshot of its first frame, save it to `./tmp/<descriptive-name>.jpg`, then: - **Same conversation:** just tell me "continue, screenshot is at ./tmp/<descriptive-name>.jpg" (or similar) — I'll pick up the pending state and generate the clip. No flag needed. - **Fresh conversation** (you closed the chat in between): re-invoke `/manuscript-broll-opener <manuscript-path> --continue ./tmp/<descriptive-name>.jpg`. Pending state saved. Pausing here. -
Pause here. Do NOT call
generate_videoyet. This is the natural pause for the user to (optionally edit the narration further, then) record talking-head. Within the same session, wait for the user to come back with a screenshot path; do not require a flag-based re-invocation. Across sessions, the--continueflag is how the user re-enters this skill cleanly.
Entered when the user signals they're ready with a screenshot — either via the --continue <screenshot-path> flag (cross-session) or via a same-session conversational handoff that points at a screenshot path. The pending-state file must exist either way.
-
Validate screenshot. Check that the path exists and is a
.jpg/.jpeg/.png/.webp. If not, error and ask the user to fix the path. -
Scan reference images for IP-flagged content before uploading. Higgsfield runs an intellectual-property detector on every reference image. If it flags something, the job comes back
ip_detected(the credit is not charged, but you still burn a few minutes on the rejected-job + fix + retry cycle). The skill agent can avoid this by visually pre-scanning each reference image about to be uploaded.Apply the scan to every reference image that will be uploaded fresh in this run:
end_image(the destination screenshot — always),start_image(the empty-environment reference, if the chosen effect is materialize-style and a path was provided), and the identity headshot (only if it's not already cached in./tmp/.headshots.json).For each image, read it visually and look for:
- Character art / illustrated IP: comic-book characters (Joker, Marvel, DC, anime figures), film/TV character posters, video-game character art.
- Brand logos at recognizable size: sponsor logos, branded merchandise, product packaging filling enough pixels to be identified.
- Framed posters or artwork that reads as a reproduction of a known copyrighted image.
- Figurines / collectibles: statues, action figures, branded merch on shelves, even small if the character is identifiable.
If anything is flagged, stop before uploading and report the specifics back to the user. Be precise: name the item, name the location in frame ("right edge", "behind subject's shoulder", "top-left shelf"), and recommend an action ("crop the right ~25% of the frame, save as
<original-name>-cropped.jpg, and re-invoke--continue" or "restage and re-shoot").Example report:
"Heads-up: scanned
./tmp/last.jpgand found two likely IP issues:- Right edge of frame: framed Joker (DC Comics) poster, large and clearly identifiable.
- Top-left shelf, behind subject: small purple figurine that reads as a Joker collectible.
Higgsfield's IP detector will probably reject the job. Crop the right ~30% to remove the poster, and either crop or hide the figurine — then save as
last-cropped.jpgand point me at the cropped file. Or sayproceed anywayand I'll submit the original (and we'll likely need to fix and retry)."Wait for the user to either fix and re-invoke, or say
proceed anyway. Do not auto-proceed when something is flagged — let the user choose.If nothing is flagged, proceed silently to step 3 — no need to spam the user with "scan clean."
-
Scan the end_image for mid-speech / mid-gesture poses. Higgsfield uses
end_imageas the target the AI animates toward — meaning the subject's face and body in the AI clip's final beat will match the pose in this image. If the end_image captures a mid-speech moment (mouth visibly open, jaw mid-articulation, lips formed around a vowel/consonant), the AI will animate the subject talking through the entire clip. That has two consequences:- Visual: the subject appears to be delivering dialogue in the cold open, which collides with the recorded voiceover that follows.
- Audio: Seedance (and similar models) react to a "talking" subject by generating fabricated dialogue audio in the soundtrack. That's the actual root cause of unwanted AI voiceover in clips — the model is "voicing" the visible mouth motion.
Read the end_image visually. Look for:
- Mouth open with visible teeth or formed lip shape (mid-vowel).
- Asymmetric jaw position (mid-articulation).
- Hand caught mid-gesture in the air (not at rest).
- Eyes mid-blink or off-axis (mid-glance away from camera).
If any are present, warn the user before uploading:
"Heads-up: scanned
<end_image>and the captured frame shows the subject mid-speech (mouth open) / mid-gesture (hand mid-air). Higgsfield will animate dialogue / motion toward that pose, which produces a 'talking' cold-open subject and fabricated voiceover audio. Better: scrub the talking-head recording for a moment where you're silent (between sentences, mouth closed, hands at rest) and screenshot that frame instead. Save as<name>-still.jpgand point me at the new file. Or sayproceed anywayto use the current frame."Wait for the user to either fix and re-invoke, or say
proceed anyway. Do not auto-proceed when the frame is mid-action.If the frame is settled (mouth closed, hands at rest or out of frame, gaze on-camera), proceed silently — no need to confirm.
-
Upload screenshot as end_image. Call
mcp__higgsfield__media_uploadwith the screenshot file. (No caching for end_image — it's a one-off frame for this specific clip.) Save the returned UUID. Callmedia_confirmif required by the upload response. -
If
narrator_as_subjectis true, also resolve a headshot UUID following the same flow asmanuscript-broll-suggestStep 4 part 0:- First check
./tmp/.headshots.jsonforcanonical_uuid. If non-empty, use it. - Otherwise look for
screenshot*.{jpg,jpeg,png,webp}files in./tmp/. Pick the cleanest front-facing one. Reuse cached UUID if mtime matches; otherwise upload + cache.
- First check
-
Pick model and resolve params. Default to
seedance_2_0. Switch tokling3_0only when the criteria in Higgsfield Prompt Rules → Model defaults for openers are all met (kinetic Tier 1 effect, narrator's face clearly in theend_image, no separate identity reference required, 4K wanted). Verify viamodels_explore(action: get,model_id: <chosen>) thatmedias[].rolesincludesend_image. Resolve highest-quality params per model:- seedance_2_0:
resolution: "1080p",mode: "std",genrefrom pending state. - kling3_0:
mode: "4k",sound: "off"(the recorded narration replaces the AI audio anyway, and the visible subject must never appear to speak — see the no-speaking rule). Nogenreparameter on Kling 3.0.
- seedance_2_0:
-
Call
mcp__higgsfield__generate_videowith:model: <chosen>prompt: <prompt_template from pending state>params.duration: 8params.aspect_ratio: "16:9"- Model-specific quality params from step 5.
params.medias:- Always:
{role: "end_image", value: "<destination-screenshot-uuid>"} - If narrator-as-subject AND model is seedance_2_0:
{role: "image", value: "<headshot-uuid>"}. Skip this forkling3_0— Higgsfield does not expose Kling 3.0's identity-image role; the headshot would have to be reused asstart_image, which conflicts with the materialize use case below. - Optional but strongly recommended for materialize / "empty environment to populated" effects:
{role: "start_image", value: "<empty-environment-uuid>"}. Without it, the AI invents what the empty version of the environment looks like, which usually mismatches the user's real space (wrong colors, wrong prop positions). With it, the transition is anchored from a real empty frame to a real populated frame, eliminating the "imagined → real" mismatch. The empty-environment screenshot should be the same camera angle and lighting as the destination, just without the narrator. Standard path:./tmp/empty.jpg(the skill should look for this when the chosen effect is materialize-style).
- Always:
-
Poll
mcp__higgsfield__job_statuswithsync: trueuntil terminal. Typical: 60–180s. -
On success, extract the result video URL.
If balance shows insufficient credits before generating, surface that and stop.
- Target path:
./tmp/<manuscript-stem>-opener-<concept_slug>.mp4. If the file exists, append-v2,-v3, etc. - Download via
curl -L -o <path> <url>. - Open in QuickTime:
open -a "QuickTime Player" <path>. - Print the local path and the Higgsfield job ID.
Audio: leave the AI-generated audio track on the file. Ambient room tone, screen hum, footsteps, glass settle, etc. add atmosphere and the editor will mute or replace the track in post anyway. The thing to prevent is the AI generating dialogue — see the Step 4 mid-speech guard and the "no speaking" rule in Higgsfield Prompt Rules. The fix for unwanted voiceover is at the source (subject not talking in the clip), not at the file level.
After the clip is open in QuickTime, ask:
Keep this clip? (yes / no / refine)
- yes → mark in manuscript (Step 6a), append to history (Step 6b), clear pending state (Step 6c).
- no → delete the clip file (
rm <path>), leave the manuscript untouched, leave the pending state intact so the user can retry without losing context. Tell them: "Clip deleted. Pending state preserved. Re-invoke with--continueand a new screenshot, or pass--resetto brainstorm fresh." - refine → take feedback, rewrite the prompt (preserve the effect's metaphor and end-frame target; tweak action / lighting / camera per feedback), regenerate as
-v2.mp4, delete previous version once the new one lands. Return to Step 6 for the new version.
When the user says yes:
-
Compute
CLIP_NAME= basename of the saved file without extension. -
Insert
TODO: intro-clip: <CLIP_NAME> (plays before talking-head)immediately under the first## Introheader (or before the first## Headerif there is no## Intro), with a blank line above and below. The parenthetical annotation is required — the editor needs to know without reading the SKILL doc that these clips play before the recorded narration. -
If a
TODO: intro-clip:line already exists at that location, replace its clip name (refinement / replacement case) rather than adding a second — unless this is a multi-clip sequence (see point 5). -
If the manuscript has a
TODO: clip:line at the top with a(before the narration starts)annotation (older convention), ask the user whether to replace it with the newTODO: intro-clip:form or leave it. Default: replace. -
Multi-clip cold-open sequences (when the user wants a montage instead of a single AI clip — e.g., bedroom → coffee → desk, generated as multiple Higgsfield clips and edited together): write one
TODO: intro-clip:line per clip, in playback order, each annotated with its position and the special role of the final one:TODO: intro-clip: <CLIP_NAME_1> (plays before talking-head, 1/N) TODO: intro-clip: <CLIP_NAME_2> (plays before talking-head, 2/N) ... TODO: intro-clip: <CLIP_NAME_N> (plays before talking-head, N/N, hard cut into recorded footage at final frame)Only the last clip's final frame is locked to the recorded talking-head's first frame via
end_image. The earlier clips are standalone and stand on their own cuts.
Format (single clip):
## Intro
TODO: intro-clip: <CLIP_NAME> (plays before talking-head)
<existing first paragraph...>
Note for the editor: TODO: intro-clip: differs from TODO: clip: — the clip plays before recorded narration and ends at the recorded first frame, so the cut is hard from AI clip → live footage with no overlap.
Append to ./.claude/skills/manuscript-broll-opener/intro-effects-used.json:
{
"slug": "<concept_slug>",
"manuscript": "<basename of manuscript file>",
"date": "<YYYY-MM-DD, today>",
"concept_summary": "<one-line description of what the viewer sees>"
}If the file doesn't exist, create it with {"effects": [<this entry>]}. Pretty-print with 2-space indent for readability.
The concept_summary should describe what the viewer sees, not implementation details. The summary is what future brainstorms will read to judge whether a new candidate concept overlaps with this one — so describe the visual (setting, action, framing), not the technical params.
- Good: "Empty studio chair, man drops vertically from above into the chair, slow camera push-in across 8 seconds, lands on tighter destination framing."
- Bad: "Seedance 2.0, drama genre, 1080p, locked-off camera, end_image set to last-cropped.jpg."
Delete ./tmp/.intro-effect-pending.json.
Print:
- Accepted clip's local path.
- Higgsfield job ID.
- That the manuscript now has
TODO: intro-clip: <name>at the top. - That
<concept_slug>is now in the cross-video history and won't be re-suggested.
The default skill flow above generates one 8-second AI clip that lands on the recorded talking-head. Sometimes the user wants a narrative montage instead — a sequence of beats across multiple environments (e.g. "asleep in bed → phone rings → coffee machine → sit at desk"). A single 8s shot can't carry that. Use this alternative pipeline.
During brainstorm (Step 3), if the user proposes a multi-environment narrative (waking up scene, walking-into-room sequence, day-in-the-life montage, etc.) that obviously won't fit one 8s clip, surface the multi-clip option explicitly. Don't try to compress 5 scenes into one shot — Seedance can't render the user's face accurately across 3+ different environments in a single clip; identity falls apart.
Before starting, tell the user:
- Cost: ~7–12 Higgsfield jobs (one keyframe image + one video per beat, plus refinements). At ~50 credits per video and ~10 per image, expect 200–400 credits total.
- Time: 30–60 min of generation + your edit time afterward.
- Identity drift: the user's face rendered in a bedroom or kitchen will drift from the studio headshot reference. Expect 1–2 beats to need regeneration. Camera angles that hide or partially obscure the face (asleep on pillow, low-angle on legs, hand-only shots, body-only shots cropped at the neck) sidestep this entirely — bias toward those when possible.
- Continuity: wardrobe and props must be specified per-prompt or the AI arbitrarily costumes the subject between beats.
- Operates outside the normal pending-state: the skill's
./tmp/.intro-effect-pending.jsonmechanism is for single-clip flow. Multi-clip is driven directly through the conversation; track progress in your head / via TaskCreate rather than the pending file.
Lock the beats with the user before generating anything:
- How many beats (typically 3–5).
- What environment each one is in.
- What action / motion each one carries.
- Which beats hide the user's face (cheaper, lower identity risk) vs. which show the face (higher risk).
- The wardrobe progression (the user may change clothes between bedroom and final desk).
- The final beat — this is the one that lands on the recorded screenshot via
end_image.
For beats 1..N-1 (no end_image lock):
- Generate keyframe image with
mcp__higgsfield__generate_image. This is the first frame of the beat's video. - Show keyframe to user. Keep / refine / discard. Refinement loop: see "Image refinement" below.
- On accept: generate the video with
mcp__higgsfield__generate_video, passing the accepted keyframe's job ID asstart_image. Duration: 4s (Seedance 2.0 minimum) unless the beat genuinely needs longer. - Parallelize: while the video bakes, start generating the next beat's keyframe. The video poll loop and image generation are independent.
For beat N (final, lands on screenshot):
- Run the existing
--continuevalidation: IP scan, mid-speech scan on the screenshot. - Generate beat-N keyframe showing the start of the final motion (user entering frame, hand placing mug, etc.) — composition should approach the recorded screenshot's framing.
- On accept: generate video with BOTH
start_image(beat-N keyframe) ANDend_image(the screenshot). Duration: 4–6s.
- Use
nano_banana_2(Nano Banana Pro). It respects the prompt while using the reference image for guidance. - Do NOT use
soul_2for keyframes that diverge from the reference image's content. Soul 2.0 runsenhance_prompt: trueby default and silently rewrites your prompt to re-describe the reference image. You ask for "man asleep in dark bedroom with a phone on the nightstand"; it gives you the man in his studio with a microphone, because that's what the reference image showed. Soul 2.0 is fine for portraits/UGC that intentionally match the reference; it's wrong for any beat where the scene differs.
When the user wants a tweak ("darker window", "move slippers right", "swap the espresso machine"), don't regenerate from scratch with a longer prompt. Use image-to-image editing:
- Upload the current keyframe (
media_upload+ curl PUT +media_confirm). - Call
generate_imagewithnano_banana_2, passing the uploaded image asmedias[].role: "image". - Prompt: "Edit this image: keep EVERYTHING identical (list the things to preserve) EXCEPT . All other elements of the image unchanged."
This works well for: window lighting, prop positions, swapping an object for a similar one, removing or adding a small element. Saves credits and preserves the parts the user already approved.
Seedance 2.0 has generate_audio: true baked in and there's no parameter to disable it. The only lever is the prompt itself. To suppress music while keeping ambient sound effects, add an explicit AUDIO block to every video prompt:
AUDIO: ambient room tone only — <list the specific diegetic sounds appropriate to the scene>. NO MUSIC. NO SCORE. NO SOUNDTRACK. NO MELODY. NO BACKGROUND MUSIC. NO DIALOGUE. NO VOICEOVER. Only diegetic natural sound effects from the action.
This is effective; without it Seedance often layers a faint score under everything.
- Keyframes:
./tmp/<beat-name>-keyframe-v<N>.png(e.g.bedroom-keyframe-v3.png). - Videos:
./tmp/<manuscript-stem>-opener-clip<N>-<beat-name>.mp4(e.g.grafana-agent-opener-clip1-bedroom.mp4). Theclip<N>prefix makes playback order obvious to the editor.
- Manuscript: write one
TODO: intro-clip:line per clip in playback order, with annotations as documented in Step 6a (point 5). - History (
intro-effects-used.json): write one entry summarizing the whole montage concept — not one entry per clip. Theconcept_summaryshould describe the visual arc across all beats so future brainstorms know the metaphor is taken (e.g. "wake-up-to-desk montage", "alert-at-night ritual"). One slug likemidnight-alert-coffee-ritualis enough.
Sourced from Higgsfield's official cinematic prompt guide. Apply these in every generated prompt and every refinement.
Do:
- Short, direct sentences. Higgsfield reacts better to commands than descriptive paragraphs that force it to guess. "Dolly in slowly" beats "cinematic movement."
- Specific camera verbs:
dolly in,dolly out,orbit,handheld,tracking,crash zoom,FPV,locked-off,mounted on dashboard. - Explicit timing cues: "For the first 2 seconds, …", "At 4s, suddenly …", "By 7s, …" — sequence calm → shift → payoff. For openers especially, the last 1–2 seconds must resolve into stillness so the AI's final frame can match the recorded
end_imagecleanly. - Physical detail over abstraction: body trembles, breath quickens, jaw tenses, eyes widen. Show emotion through the body, not adjectives.
- Layered separation: one prompt = one task. Camera motion lives in the video prompt only. Don't try to change identity AND move the camera in the same shot.
- Mood close: end the prompt with the resolving emotional tone (tense, hopeful, unsettled, triumphant).
Don't:
- Vague style words: ❌ "cinematic", "dynamic", "epic", "beautiful". They give the model nothing to act on.
- Mixed instructions: don't combine identity edits with camera moves in one prompt.
- Stacking multiple visual styles in one shot.
- Repeating the same instruction in different words.
- Describing what you don't want at length — Higgsfield biases toward what's stated. Keep negatives minimal: "no text on screen" is fine; long don't-lists waste budget.
Live-action only — non-negotiable: Every clip must look like a frame pulled from a real motion picture. Treat this as a hard constraint, not a stylistic preference.
- The verbatim
FILM LOOKblock from the prompt template is mandatory in every prompt. - Subjects are always real humans, real animals, or real physical objects in real locations. No anthropomorphic characters, no mascots, no avatars.
- Lean on real-world cinematic references: a hand-held documentary shot, a slow studio dolly, a kitchen-sink drama interior, a film-noir alley, a 70s thriller car interior, an Arri Alexa interview setup. Concrete > abstract.
- Lighting, lens, and color must be motivated and physical: practical lamps, sunlight through blinds, sodium streetlights, candlelight, overcast diffuse. Never "magical glow," "neon overlays," "pulsing energy," or other CGI-coded language.
- If the metaphor is abstract, translate it into a concrete real-world image instead of a stylized rendering.
- Keep the subject grounded — physical effort, fatigue, breath, weight. Reality has texture; CGI flattens it.
The subject must never appear to be speaking — non-negotiable. A cold-open clip transitions into voiceover narration. If the visible subject's mouth is moving or they're delivering dialogue, the viewer's brain expects to hear those exact words, which collides with the narration audio that follows. Build the cinematic concept around physical action, contemplation, stillness, hands-on craft, or environmental motion — never mid-conversation, never visibly mouthing words.
Prompt template:
SCENE: <opening setup — shot type, framing, location>
SUBJECT: <who/what is in frame, with physical detail; if narrator-as-subject, lean on "the person/man from the reference image">
ACTION: <chronological beats with timing markers — "For the first 2 seconds...", "At 4s, suddenly...", "By 7s..." — culminating in stillness for the final beat to match end_image>
CAMERA: <specific camera verb — dolly in, orbit, handheld, crash zoom, FPV, locked-off>
LIGHTING & LOOK: <natural / practical / shallow depth of field / film stock reference; should be reachable from the recorded talking-head's lighting in the final beat>
FILM LOOK: live-action footage, shot on 35mm or Arri Alexa, 24fps, anamorphic-style depth of field, natural film grain, motivated lighting, color-graded like a feature film. NOT animation, NOT CGI, NOT cartoon, NOT illustration, NOT 3D render, NOT video game, NOT motion graphics.
END MOOD: <how the shot resolves emotionally>
Keep the prompt 100–200 words. The FILM LOOK block is required and verbatim in every prompt.
Model defaults for openers:
-
seedance_2_0— default. Text→video withend_image+ a dedicatedimagerole for an independent identity reference, strong identity preservation, max 1080p, genre control. Pick this when:- The narrator is visible only partially or not at all in the destination frame and you need a separate headshot reference to lock identity (the destination is hands-on-keyboard, a side-angle, an over-the-shoulder shot).
- The effect leans on natural human body motion — settling into a chair, breath, micro-gestures, weight, fabric/hair physics. Seedance 2.0 was purpose-built for human body realism.
- The narrator-as-subject flag would otherwise apply.
-
kling3_0— viable alternative for kinetic openers. Released Feb 2026. Use when all of the following are true:- The narrator's face is clearly resolved in the
end_image(Kling 3.0 on Higgsfield only exposesstart_image/end_imageroles — no dedicated identity-image role, so the destination frame must carry identity on its own). - The Tier 1 effect is kinetic / action-heavy: drop-into-place, camera push-through, smash-through, FPV door entry, crash zoom. Kling 3.0's strength is kinetic motion realism and physical momentum.
- You want the 4K resolution bump — set
params.mode: "4k".
Skip Kling 3.0 for Tier 2 / subtle openers and for openers where the destination is a side-angle or an off-axis shot of the narrator.
- The narrator's face is clearly resolved in the
-
Avoid
kling_2_6and earlier — they expose onlystart_image, noend_image, so they can't land the seam at all. -
Avoid
wan_2_6— noend_imagesupport. -
wan_2_7hasend_imagebut no identity-image role and weaker identity preservation than Seedance 2.0. Only consider it if you specifically need itsaudioreference role. -
Avoid
marketing_studio_video— caps at 720p, designed for product ads, not openers.
Duration: 8s default. Server clamps to nearest allowed value per model.
- Always live-action. Every clip must look like footage from a real motion picture — no animation, CGI, cartoon, illustration, 3D render, motion graphics, or video-game look, ever.
- Real actors, real props, real locations. Real physical lighting.
- Motion throughout. The final pose should match
end_image, but the clip does not need to be "motionless" before that. Theend_imageparameter drives identity preservation by giving the AI a target to land on; that's what locks the pose. It does NOT mean the subject should be frozen throughout the clip. Do NOT write prompts that demand the subject be "motionless throughout" or "still for the last 2 seconds" — that produces dead, boring clips. Write prompts where the subject has natural micro-movement (forming, settling, breathing, slight head shifts, hands moving into position) all the way through, with the final frame arriving at the destination pose. - Seam invisibility is rarely achievable anyway. If the user's recorded talking-head footage is at a different framing than
end_image(e.g.,end_imageis a cropped frame to remove an IP-flagged background element, but the actual recording is uncropped), the cut from AI clip → live footage will have a visible frame-size pop regardless of how still the AI clip's last frame is. In that case, "stillness in the final beat" earns nothing — the cut will be visible no matter what. Lean into motion across the whole clip. - Zero text on screen — no captions, no labels, no subtitles.
- 0–3s setup, 3–6s transformation, 6–7s settling, 7–8s steady on the destination pose.
- Maximum 3–4 visual elements.
- Lighting in the final beat should approximate the recorded talking-head's lighting; this minimizes color/exposure pop at the seam.
- No pending state,
--continuepassed: tell the user to run without--continuefirst to brainstorm and pick an effect. - Pending state exists, no
--continue, no--reset: ask whether to resume or discard. - Higgsfield job fails: surface the error message verbatim, offer to retry with a simplified prompt.
- Screenshot upload fails: surface the error, ask the user to verify the file path and re-point you at the screenshot (same-session) or re-invoke with
--continue(cross-session). - Download fails: keep the Higgsfield URL visible and let the user retry the download manually.
- History file unreadable / malformed JSON: tell the user, but proceed with empty history (don't block brainstorming).