This is the recipe used for Crazy_Little_Thing_Called_Love/concert_edit.mp4.
Reuse it for the next song. The whole pipeline is just two Python scripts +
mlt-melt; no NLE GUI needed.
A directory per song containing N (3-6) *.mp4 files, each a recording of
the same song from a different angle. Naming convention used so far:
<angle>-songN.mp4 (e.g. center-song5.mp4, left-song5.mp4,
misha-song5.mp4 — first names are camera operators for handheld). Files
are roughly trimmed to start/end at the song boundaries.
Different sources will typically have different resolutions (some 4K, some 1080p) and slightly different frame rates (29.97 vs 30.000). Output is 1920×1080 at the frame rate of whichever source provides the audio.
Even if files appear pre-trimmed to matching durations, they're almost never in tight sync. Cross-correlate audio envelopes to find per-source offsets relative to a chosen reference (usually the center/wide camera, which also provides the final audio).
# 1. Extract mono 8 kHz WAVs for fast correlation
mkdir -p /tmp/sync_<song> && cd <song-dir>
for f in *.mp4; do
ffmpeg -y -i "$f" -ac 1 -ar 8000 -vn -f wav "/tmp/sync_<song>/${f%.mp4}.wav"
done
# 2. Run envelope-based xcorr (script lives one directory up, next to HOWTO).
# The script auto-detects WAVs in the given directory and uses any file
# whose name starts with 'center' as the reference (override by passing
# its basename as a second arg).
python3 ../xcorr2.py /tmp/sync_<song>Use the envelope (20 ms RMS), not raw samples — raw cross-correlation gives weak peaks (~0.05) because different mics in different positions hear very different acoustics. Envelope correlation lifts the peak to 0.4-0.6 and clearly beats the second-strongest peak.
Output looks like:
file offset (s) norm corr 2nd peak
center 0.0000 1.0000 0.31 (reference)
left -0.0826 0.55 0.34
misha -0.0040 0.53 0.20
right +0.0471 0.55 0.34
vova +0.0304 0.46 0.30
Sign convention (scipy): with correlate(ref, src), lag<0 means the
matching content lies LATER in src's file than in ref's (src's recording
started earlier than ref's). The seek-correction formula is therefore
in_pos_for_src = T_timeline - scipy_lag
For example left -83 ms → at timeline T, pull left's frame from file
position T + 83 ms. right +47 ms → at timeline T, pull right from T − 47
ms. (Earlier docs in this file had the sign reversed — that was a bug.)
If the spread between earliest and latest is < ~10 ms, sync is fine. Anything > ~20 ms is audible flam if you mix the audio, and shows as multi-frame misalignment when cutting between angles. Bake the offsets into the MLT (see step 3).
Watch for off-camera people that the closeup will reveal. This venue had a sound engineer standing behind the conductor for most of the song (visible after ~0:20). A tight push-in or pan-reveal of the conductor on the left cam catches him in frame. Sample the planned tight crops at multiple timestamps before locking them in:
# sample one tight-crop frame per timestamp, stitch into a grid
for t in 10 30 60 90 120 150; do
ffmpeg -y -loglevel error -ss $t -i left-songN.mp4 -frames:v 1 \
-vf "crop=1440:810:2400:600,scale=480:270,\
drawtext=text='t=${t}s':fontcolor=yellow:fontsize=22:x=10:y=10:\
box=1:boxcolor=black@0.6" /tmp/g/t${t}.png
done
ffmpeg -y -loglevel error -i /tmp/g/t10.png -i /tmp/g/t30.png \
-i /tmp/g/t60.png -i /tmp/g/t90.png -i /tmp/g/t120.png -i /tmp/g/t150.png \
-filter_complex "[0][1][2]hstack=3[r1];[3][4][5]hstack=3[r2];\
[r1][r2]vstack=2" /tmp/g/grid.jpgWhen a closeup reveals an unwanted person, swap the shot for: (a) a different timestamp where they're absent, (b) a wider crop that just looks like context not closeup, or (c) the same beat on a different camera angle.
Decisions to lock in before generating XML:
- Output: 1920×1080, frame rate = reference camera's fps (use 30000/1001 if the reference is 29.97; MLT will conform any 30.000-fps sources).
- Audio: from the reference camera only. All other audio is muted.
- Cut cadence: 30-35 cuts over a ~3-minute song works well (avg 5 s, varied 3-9 s). Open and close on the wide reference; never repeat the same angle twice in a row.
- Ken Burns moves: only on 4K sources (you have 2× linear headroom over the 1080p output, so any 1920×1080 crop is 1:1 with no upscale). 2-3 moves per song is plenty; more feels gimmicky.
- Punch-ins on non-center 4K: occasional 1.2-1.5× crops are fine.
- 1080p sources: full frame only — any crop is an upscale.
Sample 12-frame grids per source to inform which closeups go where:
for f in *.mp4; do
ffmpeg -y -i "$f" -vf "fps=1/13.5,scale=480:-1,tile=4x3" \
-frames:v 1 "/tmp/sync/previews/${f%.mp4}_grid.jpg"
doneNote who's prominent in each angle (conductor, soloists, sections) — that drives which angle to cut to at which point in the song. If a particular camera is a "close on the conductor", reserve it for moments when the conductor's expression matters (verse endings, climactic phrases).
Eyeballing the choir position from a thumbnail is unreliable: the choir often occupies only a small slice of a wide-angle 4K frame, and the default mental "middle" estimate is wrong by several hundred source pixels in practice. Crops derived from wrong bounds either show empty walls or cut singers off. Confirm bounds with the user before generating any RECTS. This is the single highest-leverage step.
Procedure for each 4K source:
-
Extract one mid-song frame (e.g. t=43) and scale to 1920w for review:
mkdir -p /tmp/marks_<song> for f in *.mp4; do ffmpeg -y -loglevel error -ss 43 -i "$f" -frames:v 1 \ "/tmp/marks_<song>/${f%.mp4}.png" ffmpeg -y -loglevel error -i "/tmp/marks_<song>/${f%.mp4}.png" \ -vf "scale=1920:-1,drawtext=text='${f%.mp4}':fontsize=50:\ fontcolor=yellow:box=1:boxcolor=black:x=10:y=10" \ "/tmp/marks_<song>/${f%.mp4}_review.jpg" done
-
Make a best guess for the choir (green) and conductor (red) bounding boxes in source coords and draw them with
ffmpeg drawbox:ffmpeg -y -i source.png -vf \ "drawbox=x=CX1:y=CY1:w=CW:h=CH:color=lime@0.9:t=6,\ drawbox=x=KX1:y=KY1:w=KW:h=KH:color=red@0.9:t=4,\ scale=1920:-1" source_marked.jpg
-
Show the marked images to the user and ask them to confirm or supply corrected coordinates. The user supplies pixel coords in the 1920w review image; multiply by 2 to recover 4K source coords. Common refinements:
- Widen the choir box: members at the edges are easy to miss.
- Specify a separate y_center for tight zooms. The choir's vertical midpoint includes legs; faces sit higher (e.g. y_center ≈ choir_top + 30% of choir height, not 50%). The user often picks this directly.
-
Record the confirmed bounds in
make_mlt.pyas a comment near the RECTS block (see the existing template). All subsequent crops derive from these bounds:- Wide (sw=1920, sh=1080): vertically centered so the choir top sits in the upper third of the output.
- Tight (sw=960, sh=540 or sw=1280, sh=720): horizontally centered on
(choir_x_min + choir_x_max) / 2, vertically centered on the user- supplied face-level y. - Pan endpoints: horizontally shift the tight rect within the choir's x-range.
Only after the user signs off on the marked images should you write any RECTS or SHOTS.
Different cameras have different white-balance + exposure. The cheapest correction that works well: per-channel multiplicative gain in sRGB space (no linearization, no LUT, ~20 lines of numpy).
# 1. Sample one frame per camera at the same moment (e.g. t=30s).
mkdir -p /tmp/cc_<song> && cd <song-dir>
for f in *.mp4; do
ffmpeg -y -ss 30 -i "$f" -frames:v 1 "/tmp/cc_<song>/${f%.mp4}.png"
done
# 2. Pick a reference camera (NOT necessarily the audio reference) — pick the
# one whose look you want the final to have. Default in the script is
# 'misha' (warm-ish front cam) — override by passing a basename as 2nd arg.
# 3. Run color_match.py — it computes mean R/G/B on mid-tones per camera,
# derives gain_C = ref_mean_C / cam_mean_C, normalizes by the geometric
# mean so each cam's brightness is preserved (WB-only correction), and
# writes comparison.jpg (visual preview) and gains.py into the frames dir.
python3 ../color_match.py /tmp/cc_<song>Two correction modes — pick based on visual preview:
- Full match: gain =
ref_mean / cam_mean. Matches WB and brightness. Tends to dim cams that were brighter than ref. - WB-only (default): same gain triple, then divided by its geometric mean. Preserves each cam's brightness, only rebalances R/G/B ratios.
Bake the gains into SOURCES[cam]["gain"] = (gR, gG, gB) in make_mlt.py.
The generator attaches an avfilter.colorchannelmixer filter to each video
producer (av.rr=gR av.gg=gG av.bb=gB). Producer-level filters apply to
every entry from that producer, so a single filter handles all shots from
that camera.
Use make_mlt.py (next to this file) as a template. Per-song edits:
SRC_DIR— path to the song directory.SOURCESdict — file names, source dimensions, and the measured synclagfor each angle (from step 1, with scipy's sign convention).RECTSdict — crop rectangles on the source frame. Always keep 16:9 aspect (sw / sh == 16/9) or aspect distortion will be visible. Pre-set entries likeC_pushin,C_panL,C_choirparameterize where on the wide source the choir sits — re-tune these per venue.SHOTSlist —(t_start_s, t_end_s, source_key, rect_key)rows covering the full song without gaps. The last shot'st_endis the total song duration.
Key correctness rules baked into the generator:
-
Sync trim: for each shot at timeline
Ton source X, the producer's in/out is set toT - lag_X(scipy-convention lag — see Step 1). This is the only place sync is corrected. -
Audio-only producer: a separate
<producer>withvideo_index="-1"carries the reference camera's audio for the full song length. -
Video-only producers: each angle has
audio_index="-1"so their audio isn't mixed into the output. -
Filter for crops/Ken Burns:
mlt_service="qtblend"with arectproperty.rectformat is"X Y W H A"where (X, Y, W, H) places the scaled producer in the 1920×1080 output canvas. To show source rect (sx, sy, sw, sh) as the full output:scale = 1920 / sw # aspect must match output (16:9) X = -sx * scale Y = -sy * scale W = src_w * scale H = src_h * scaleFor a Ken Burns move, use two keyframes: the rect property becomes
"00:00:00.000=<start_rect>;<duration>=<end_rect>". The keyframe time is filter-local (starts at 0), not producer time — this was the main pitfall during development. -
Use
mlt_service="avformat", NOTavformat-novalidate. The novalidate variant doesn't probe the file, so MLT can't compute clip lengths and the project ends up reporting itself as ~1 second long. With five-ish 3-minute sources the validation cost is negligible. -
MLT
<entry in="..." out="...">is an INCLUSIVE frame range, so an entry playsout - in + 1frames, notout - in. If you write entries with time strings (e.g.out="00:00:08.000"), MLT rounds to the nearest frame and then adds one frame on top — you silently accumulate one extra frame per playlist entry. Over 35 cuts at 29.97 fps that's ~1.2 s of video without matching audio, manifesting as a growing video-behind-audio delay through the song. Fix: use integer frame indices and setout = in + length - 1. Seemake_mlt.pyfor the pattern. The same fix applies to the audio playlist's single entry.
Last 3 s of the song look much more polished with a fade. The generator applies two filters tied to the final shot:
- Video:
mlt_service="brightness"on the last entry, withlevelkeyframed1 → 1 → 0(hold full brightness forlength − FADE_Sseconds, then ramp to 0). Uses the same time-string keyframes as qtblend; pairs cleanly with the qtblend pan running concurrently. - Audio:
mlt_service="avfilter.afade"on the audio entry, withav.type=out,av.start_time=<total − FADE_S>,av.duration=FADE_S. Do NOT use MLT's nativevolumefilter with time-string keyframes — in practice that silences the entire track.avfilter.afade(ffmpeg's afade under MLT's avfilter wrapper) is the reliable path.
Both filter in/out must match their host entry's frame range (same rule
as qtblend pans).
mlt-melt project.mlt -consumer avformat:concert_edit.mp4 \
vcodec=libx264 crf=18 preset=medium acodec=aac ab=192k threads=$(nproc)CRF 18 / preset medium on a 20-core box renders 163 s of 1080p at ~75 fps, i.e. ~4 minutes for a ~3-minute song. Output is ~12 Mbps. Bump to preset=slow if you want smaller files at the same quality.
Don't bother with a low-res "preview" pass. Tried 854×480 — same wallclock time as 1920×1080. The bottleneck is decoding the 4K source files, not encoding the output. The only way to make iteration faster is to pre-downscale the sources themselves.
-
qtblend keyframe time scope (the gotcha that bit us twice): the
<property name="rect">keyframe times are filter-local time strings, BUT the filter only animates if the<filter>element also declaresin="..."andout="..."matching the entry's producer-frame range. Without those attributes the filter holds its end value (silently static). Working pattern:<entry producer="v_center" in="659" out="898"> <filter mlt_service="qtblend" in="659" out="898"> <property name="rect">00:00:00.000=...;00:00:08.000=...</property> ...Combinations that DON'T animate: time-strings without filter in/out; percentage keyframes (
0%=...;100%=...); frame-number keyframes with or without filter in/out. Verified empirically — see/tmp/pan_test. -
Aspect distortion on crops: keep every crop rect at 16:9. If you really need a non-16:9 crop, set
distort="1"on qtblend — but the result will stretch. -
MLT's
volumefilter silences everything when you try to use time-string keyframes on itsgainproperty. The property is dB-scaled and parser-fragile. For end-of-song fade-out useavfilter.afade(see Step 3.5) — that wraps ffmpeg's afade and Just Works. -
End-of-song edge case: with the
T - lagcorrection, sources with negative scipy lag (file content lies later than ref) needT_end - lag_Xto be ≤ the source's file duration. Symmetrically, sources with positive lag needT_start - lag_X ≥ 0at the song open. The generator clamps in_pos to 0 at the start; for the end, either close on the reference camera or shave a few hundred ms off the final cut. -
30.000 vs 29.97 fps mix: harmless — MLT conforms to the profile's frame rate. The temporal alias is far below the sync threshold.
-
Validate the parsed project before rendering:
mlt-melt project.mlt -consumer xml:check.xmlround-trips through MLT's loader and shows what it actually saw. If<tractor>reports a short duration, the producers weren't validated (see avformat-novalidate gotcha above). -
Don't trust grid/strip eyeballing for choir bounds: reading pixel coordinates off a scaled-down grid overlay is consistently off by several hundred source pixels (we got the choir x-extent wrong by ~300px multiple times on the same frame, and the y-extent by ~240px). Always confirm bounds with the user against marker overlays at full resolution (Step 2.6). Marker-overlay tests beat any amount of mental coordinate conversion.
-
Tight-zoom y must be face level, NOT the choir bbox midpoint: the vertical midpoint of a standing-choir bbox falls on torsos and legs. Always ask the user for an explicit y-center for any crop with
sh ≤ 720. For wider crops (sh=1080+) the bbox midpoint is fine, but align the top of the crop with the top of the choir's heads to avoid empty ceiling/wall above the singers. -
Pans on side cameras can pan into empty wall: when the conductor sits at the very edge of the useful image (e.g. right cam where the conductor is at source x=880-1060 with backstage at x<880), a "pan to conductor" or "push to conductor" inevitably reveals the backstage wall on the side of the conductor. Two safe alternatives:
- Replace the conductor-targeted move with a push to the choir center instead (the choir is the safer target — it's far from any edge).
- Use a different camera for the "reveal conductor" beat (the front camera's wide shot, or the opposite side cam). Rule of thumb: any tight crop whose target is within ~5% of the source frame edge is at risk; verify by rendering the end-frame alone before committing.
-
Crop tops align with content tops, not the bbox center: when a wide crop (sh=1080 over a choir of only ~600px source height) has 400+px of vertical slack, the default "center vertically" leaves half the slack above the heads as ceiling/wall. The user reads this as "wall above singers." Default to
sy = choir_top_y(or even slightly above) and let the floor take the remaining slack at the bottom — empty floor reads as context, empty ceiling reads as a framing error. -
Trim the song length to your last clean cut: if the recording extends past the music, don't render to the literal file duration. Pick a
total_dur≈2-3s before the last sound and let the fade-out cover it. This also avoids the negative-lag tail-end clipping issue.
In the song directory:
make_mlt.py— generator template (the editable bit; copy + tweak per song)project.mlt— last generated projectconcert_edit.mp4— last rendered output
One directory up (2025-05-03-choir/), shared across songs:
HOWTO-multicam-edit.md— this filexcorr2.py— envelope-based sync analyzer (takes a WAV dir as arg)sync_view.sh— live side-by-side sync viewer (mpv lavfi-complex; center audio in left ear, the other angle's audio in right ear — drift becomes audible as flam)color_match.py— color-gain computer (takes a frames dir as arg; writes<dir>/comparison.jpgand<dir>/gains.pyto paste intomake_mlt.py). Reference defaults tomisha.