How to make a multi-cam concert edit (MLT-based pipeline)

This is the recipe used for Crazy_Little_Thing_Called_Love/concert_edit.mp4. Reuse it for the next song. The whole pipeline is just two Python scripts + mlt-melt; no NLE GUI needed.

Input assumption

A directory per song containing N (3-6) *.mp4 files, each a recording of the same song from a different angle. Naming convention used so far: <angle>-songN.mp4 (e.g. center-song5.mp4, left-song5.mp4, misha-song5.mp4 — first names are camera operators for handheld). Files are roughly trimmed to start/end at the song boundaries.

Different sources will typically have different resolutions (some 4K, some 1080p) and slightly different frame rates (29.97 vs 30.000). Output is 1920×1080 at the frame rate of whichever source provides the audio.

Step 1 — Check sync (always do this first)

Even if files appear pre-trimmed to matching durations, they're almost never in tight sync. Cross-correlate audio envelopes to find per-source offsets relative to a chosen reference (usually the center/wide camera, which also provides the final audio).

# 1. Extract mono 8 kHz WAVs for fast correlation
mkdir -p /tmp/sync_<song> && cd <song-dir>
for f in *.mp4; do
  ffmpeg -y -i "$f" -ac 1 -ar 8000 -vn -f wav "/tmp/sync_<song>/${f%.mp4}.wav"
done

# 2. Run envelope-based xcorr (script lives one directory up, next to HOWTO).
#    The script auto-detects WAVs in the given directory and uses any file
#    whose name starts with 'center' as the reference (override by passing
#    its basename as a second arg).
python3 ../xcorr2.py /tmp/sync_<song>

Use the envelope (20 ms RMS), not raw samples — raw cross-correlation gives weak peaks (~0.05) because different mics in different positions hear very different acoustics. Envelope correlation lifts the peak to 0.4-0.6 and clearly beats the second-strongest peak.

Output looks like:

file                offset (s)   norm corr   2nd peak
center                 0.0000      1.0000     0.31     (reference)
left                  -0.0826      0.55       0.34
misha                 -0.0040      0.53       0.20
right                 +0.0471      0.55       0.34
vova                  +0.0304      0.46       0.30

Sign convention (scipy): with correlate(ref, src), lag<0 means the matching content lies LATER in src's file than in ref's (src's recording started earlier than ref's). The seek-correction formula is therefore

in_pos_for_src = T_timeline - scipy_lag

For example left -83 ms → at timeline T, pull left's frame from file position T + 83 ms. right +47 ms → at timeline T, pull right from T − 47 ms. (Earlier docs in this file had the sign reversed — that was a bug.)

If the spread between earliest and latest is < ~10 ms, sync is fine. Anything > ~20 ms is audible flam if you mix the audio, and shows as multi-frame misalignment when cutting between angles. Bake the offsets into the MLT (see step 3).

Step 2 — Decide structure, sample preview frames

Watch for off-camera people that the closeup will reveal. This venue had a sound engineer standing behind the conductor for most of the song (visible after ~0:20). A tight push-in or pan-reveal of the conductor on the left cam catches him in frame. Sample the planned tight crops at multiple timestamps before locking them in:

# sample one tight-crop frame per timestamp, stitch into a grid
for t in 10 30 60 90 120 150; do
  ffmpeg -y -loglevel error -ss $t -i left-songN.mp4 -frames:v 1 \
    -vf "crop=1440:810:2400:600,scale=480:270,\
drawtext=text='t=${t}s':fontcolor=yellow:fontsize=22:x=10:y=10:\
box=1:boxcolor=black@0.6" /tmp/g/t${t}.png
done
ffmpeg -y -loglevel error -i /tmp/g/t10.png -i /tmp/g/t30.png \
  -i /tmp/g/t60.png -i /tmp/g/t90.png -i /tmp/g/t120.png -i /tmp/g/t150.png \
  -filter_complex "[0][1][2]hstack=3[r1];[3][4][5]hstack=3[r2];\
[r1][r2]vstack=2" /tmp/g/grid.jpg

When a closeup reveals an unwanted person, swap the shot for: (a) a different timestamp where they're absent, (b) a wider crop that just looks like context not closeup, or (c) the same beat on a different camera angle.

Decisions to lock in before generating XML:

Output: 1920×1080, frame rate = reference camera's fps (use 30000/1001 if the reference is 29.97; MLT will conform any 30.000-fps sources).
Audio: from the reference camera only. All other audio is muted.
Cut cadence: 30-35 cuts over a ~3-minute song works well (avg 5 s, varied 3-9 s). Open and close on the wide reference; never repeat the same angle twice in a row.
Ken Burns moves: only on 4K sources (you have 2× linear headroom over the 1080p output, so any 1920×1080 crop is 1:1 with no upscale). 2-3 moves per song is plenty; more feels gimmicky.
Punch-ins on non-center 4K: occasional 1.2-1.5× crops are fine.
1080p sources: full frame only — any crop is an upscale.

Sample 12-frame grids per source to inform which closeups go where:

for f in *.mp4; do
  ffmpeg -y -i "$f" -vf "fps=1/13.5,scale=480:-1,tile=4x3" \
         -frames:v 1 "/tmp/sync/previews/${f%.mp4}_grid.jpg"
done

Note who's prominent in each angle (conductor, soloists, sections) — that drives which angle to cut to at which point in the song. If a particular camera is a "close on the conductor", reserve it for moments when the conductor's expression matters (verse endings, climactic phrases).

Step 2.6 — Locate choir + conductor bounds on each source (REQUIRED)

Eyeballing the choir position from a thumbnail is unreliable: the choir often occupies only a small slice of a wide-angle 4K frame, and the default mental "middle" estimate is wrong by several hundred source pixels in practice. Crops derived from wrong bounds either show empty walls or cut singers off. Confirm bounds with the user before generating any RECTS. This is the single highest-leverage step.

Procedure for each 4K source:

Extract one mid-song frame (e.g. t=43) and scale to 1920w for review:

mkdir -p /tmp/marks_<song>
for f in *.mp4; do
  ffmpeg -y -loglevel error -ss 43 -i "$f" -frames:v 1 \
    "/tmp/marks_<song>/${f%.mp4}.png"
  ffmpeg -y -loglevel error -i "/tmp/marks_<song>/${f%.mp4}.png" \
    -vf "scale=1920:-1,drawtext=text='${f%.mp4}':fontsize=50:\
fontcolor=yellow:box=1:boxcolor=black:x=10:y=10" \
    "/tmp/marks_<song>/${f%.mp4}_review.jpg"
done

Make a best guess for the choir (green) and conductor (red) bounding boxes in source coords and draw them with ffmpeg drawbox:

ffmpeg -y -i source.png -vf \
  "drawbox=x=CX1:y=CY1:w=CW:h=CH:color=lime@0.9:t=6,\
   drawbox=x=KX1:y=KY1:w=KW:h=KH:color=red@0.9:t=4,\
   scale=1920:-1" source_marked.jpg

Show the marked images to the user and ask them to confirm or supply corrected coordinates. The user supplies pixel coords in the 1920w review image; multiply by 2 to recover 4K source coords. Common refinements:
- Widen the choir box: members at the edges are easy to miss.
- Specify a separate y_center for tight zooms. The choir's vertical midpoint includes legs; faces sit higher (e.g. y_center ≈ choir_top + 30% of choir height, not 50%). The user often picks this directly.
Record the confirmed bounds in make_mlt.py as a comment near the RECTS block (see the existing template). All subsequent crops derive from these bounds:
- Wide (sw=1920, sh=1080): vertically centered so the choir top sits in the upper third of the output.
- Tight (sw=960, sh=540 or sw=1280, sh=720): horizontally centered on (choir_x_min + choir_x_max) / 2, vertically centered on the user- supplied face-level y.
- Pan endpoints: horizontally shift the tight rect within the choir's x-range.

Only after the user signs off on the marked images should you write any RECTS or SHOTS.

Step 2.5 — Color-match the cameras (optional but recommended)

Different cameras have different white-balance + exposure. The cheapest correction that works well: per-channel multiplicative gain in sRGB space (no linearization, no LUT, ~20 lines of numpy).

# 1. Sample one frame per camera at the same moment (e.g. t=30s).
mkdir -p /tmp/cc_<song> && cd <song-dir>
for f in *.mp4; do
  ffmpeg -y -ss 30 -i "$f" -frames:v 1 "/tmp/cc_<song>/${f%.mp4}.png"
done

# 2. Pick a reference camera (NOT necessarily the audio reference) — pick the
#    one whose look you want the final to have. Default in the script is
#    'misha' (warm-ish front cam) — override by passing a basename as 2nd arg.

# 3. Run color_match.py — it computes mean R/G/B on mid-tones per camera,
#    derives gain_C = ref_mean_C / cam_mean_C, normalizes by the geometric
#    mean so each cam's brightness is preserved (WB-only correction), and
#    writes comparison.jpg (visual preview) and gains.py into the frames dir.
python3 ../color_match.py /tmp/cc_<song>

Two correction modes — pick based on visual preview:

Full match: gain = ref_mean / cam_mean. Matches WB and brightness. Tends to dim cams that were brighter than ref.
WB-only (default): same gain triple, then divided by its geometric mean. Preserves each cam's brightness, only rebalances R/G/B ratios.

Bake the gains into SOURCES[cam]["gain"] = (gR, gG, gB) in make_mlt.py. The generator attaches an avfilter.colorchannelmixer filter to each video producer (av.rr=gR av.gg=gG av.bb=gB). Producer-level filters apply to every entry from that producer, so a single filter handles all shots from that camera.

Step 3 — Generate the MLT XML

Use make_mlt.py (next to this file) as a template. Per-song edits:

SRC_DIR — path to the song directory.
SOURCES dict — file names, source dimensions, and the measured sync lag for each angle (from step 1, with scipy's sign convention).
RECTS dict — crop rectangles on the source frame. Always keep 16:9 aspect (sw / sh == 16/9) or aspect distortion will be visible. Pre-set entries like C_pushin, C_panL, C_choir parameterize where on the wide source the choir sits — re-tune these per venue.
SHOTS list — (t_start_s, t_end_s, source_key, rect_key) rows covering the full song without gaps. The last shot's t_end is the total song duration.

Key correctness rules baked into the generator:

Sync trim: for each shot at timeline T on source X, the producer's in/out is set to T - lag_X (scipy-convention lag — see Step 1). This is the only place sync is corrected.
Audio-only producer: a separate <producer> with video_index="-1" carries the reference camera's audio for the full song length.
Video-only producers: each angle has audio_index="-1" so their audio isn't mixed into the output.
Filter for crops/Ken Burns: mlt_service="qtblend" with a rect property. rect format is "X Y W H A" where (X, Y, W, H) places the scaled producer in the 1920×1080 output canvas. To show source rect (sx, sy, sw, sh) as the full output:
```
scale = 1920 / sw            # aspect must match output (16:9)
X = -sx * scale
Y = -sy * scale
W = src_w * scale
H = src_h * scale
```
For a Ken Burns move, use two keyframes: the rect property becomes "00:00:00.000=<start_rect>;<duration>=<end_rect>". The keyframe time is filter-local (starts at 0), not producer time — this was the main pitfall during development.
Use mlt_service="avformat", NOT avformat-novalidate. The novalidate variant doesn't probe the file, so MLT can't compute clip lengths and the project ends up reporting itself as ~1 second long. With five-ish 3-minute sources the validation cost is negligible.
MLT <entry in="..." out="..."> is an INCLUSIVE frame range, so an entry plays out - in + 1 frames, not out - in. If you write entries with time strings (e.g. out="00:00:08.000"), MLT rounds to the nearest frame and then adds one frame on top — you silently accumulate one extra frame per playlist entry. Over 35 cuts at 29.97 fps that's ~1.2 s of video without matching audio, manifesting as a growing video-behind-audio delay through the song. Fix: use integer frame indices and set out = in + length - 1. See make_mlt.py for the pattern. The same fix applies to the audio playlist's single entry.

Step 3.5 — Fade-to-black + audio fade-out at end

Last 3 s of the song look much more polished with a fade. The generator applies two filters tied to the final shot:

Video: mlt_service="brightness" on the last entry, with level keyframed 1 → 1 → 0 (hold full brightness for length − FADE_S seconds, then ramp to 0). Uses the same time-string keyframes as qtblend; pairs cleanly with the qtblend pan running concurrently.
Audio: mlt_service="avfilter.afade" on the audio entry, with av.type=out, av.start_time=<total − FADE_S>, av.duration=FADE_S. Do NOT use MLT's native volume filter with time-string keyframes — in practice that silences the entire track. avfilter.afade (ffmpeg's afade under MLT's avfilter wrapper) is the reliable path.

Both filter in/out must match their host entry's frame range (same rule as qtblend pans).

Step 4 — Render

mlt-melt project.mlt -consumer avformat:concert_edit.mp4 \
    vcodec=libx264 crf=18 preset=medium acodec=aac ab=192k threads=$(nproc)

CRF 18 / preset medium on a 20-core box renders 163 s of 1080p at ~75 fps, i.e. ~4 minutes for a ~3-minute song. Output is ~12 Mbps. Bump to preset=slow if you want smaller files at the same quality.

Don't bother with a low-res "preview" pass. Tried 854×480 — same wallclock time as 1920×1080. The bottleneck is decoding the 4K source files, not encoding the output. The only way to make iteration faster is to pre-downscale the sources themselves.

Pitfalls / gotchas

qtblend keyframe time scope (the gotcha that bit us twice): the <property name="rect"> keyframe times are filter-local time strings, BUT the filter only animates if the <filter> element also declares in="..." and out="..." matching the entry's producer-frame range. Without those attributes the filter holds its end value (silently static). Working pattern:
```
<entry producer="v_center" in="659" out="898">
  <filter mlt_service="qtblend" in="659" out="898">
    <property name="rect">00:00:00.000=...;00:00:08.000=...</property>
    ...
```
Combinations that DON'T animate: time-strings without filter in/out; percentage keyframes (0%=...;100%=...); frame-number keyframes with or without filter in/out. Verified empirically — see /tmp/pan_test.
Aspect distortion on crops: keep every crop rect at 16:9. If you really need a non-16:9 crop, set distort="1" on qtblend — but the result will stretch.
MLT's volume filter silences everything when you try to use time-string keyframes on its gain property. The property is dB-scaled and parser-fragile. For end-of-song fade-out use avfilter.afade (see Step 3.5) — that wraps ffmpeg's afade and Just Works.
End-of-song edge case: with the T - lag correction, sources with negative scipy lag (file content lies later than ref) need T_end - lag_X to be ≤ the source's file duration. Symmetrically, sources with positive lag need T_start - lag_X ≥ 0 at the song open. The generator clamps in_pos to 0 at the start; for the end, either close on the reference camera or shave a few hundred ms off the final cut.
30.000 vs 29.97 fps mix: harmless — MLT conforms to the profile's frame rate. The temporal alias is far below the sync threshold.
Validate the parsed project before rendering: mlt-melt project.mlt -consumer xml:check.xml round-trips through MLT's loader and shows what it actually saw. If <tractor> reports a short duration, the producers weren't validated (see avformat-novalidate gotcha above).
Don't trust grid/strip eyeballing for choir bounds: reading pixel coordinates off a scaled-down grid overlay is consistently off by several hundred source pixels (we got the choir x-extent wrong by ~300px multiple times on the same frame, and the y-extent by ~240px). Always confirm bounds with the user against marker overlays at full resolution (Step 2.6). Marker-overlay tests beat any amount of mental coordinate conversion.
Tight-zoom y must be face level, NOT the choir bbox midpoint: the vertical midpoint of a standing-choir bbox falls on torsos and legs. Always ask the user for an explicit y-center for any crop with sh ≤ 720. For wider crops (sh=1080+) the bbox midpoint is fine, but align the top of the crop with the top of the choir's heads to avoid empty ceiling/wall above the singers.
Pans on side cameras can pan into empty wall: when the conductor sits at the very edge of the useful image (e.g. right cam where the conductor is at source x=880-1060 with backstage at x<880), a "pan to conductor" or "push to conductor" inevitably reveals the backstage wall on the side of the conductor. Two safe alternatives:
- Replace the conductor-targeted move with a push to the choir center instead (the choir is the safer target — it's far from any edge).
- Use a different camera for the "reveal conductor" beat (the front camera's wide shot, or the opposite side cam). Rule of thumb: any tight crop whose target is within ~5% of the source frame edge is at risk; verify by rendering the end-frame alone before committing.
Crop tops align with content tops, not the bbox center: when a wide crop (sh=1080 over a choir of only ~600px source height) has 400+px of vertical slack, the default "center vertically" leaves half the slack above the heads as ceiling/wall. The user reads this as "wall above singers." Default to sy = choir_top_y (or even slightly above) and let the floor take the remaining slack at the bottom — empty floor reads as context, empty ceiling reads as a framing error.
Trim the song length to your last clean cut: if the recording extends past the music, don't render to the literal file duration. Pick a total_dur ≈2-3s before the last sound and let the fade-out cover it. This also avoids the negative-lag tail-end clipping issue.

Reusable files in this checkout

In the song directory:

make_mlt.py — generator template (the editable bit; copy + tweak per song)
project.mlt — last generated project
concert_edit.mp4 — last rendered output

One directory up (2025-05-03-choir/), shared across songs:

HOWTO-multicam-edit.md — this file
xcorr2.py — envelope-based sync analyzer (takes a WAV dir as arg)
sync_view.sh — live side-by-side sync viewer (mpv lavfi-complex; center audio in left ear, the other angle's audio in right ear — drift becomes audible as flam)
color_match.py — color-gain computer (takes a frames dir as arg; writes <dir>/comparison.jpg and <dir>/gains.py to paste into make_mlt.py). Reference defaults to misha.

kolyshkin/HOWTO-multicam-edit.md

Select an option

No results found