Skip to content

Instantly share code, notes, and snippets.

@rleroi
Created April 14, 2026 14:49
Show Gist options
  • Select an option

  • Save rleroi/7b276a6795c54a927444d03cac1c5dfb to your computer and use it in GitHub Desktop.

Select an option

Save rleroi/7b276a6795c54a927444d03cac1c5dfb to your computer and use it in GitHub Desktop.
Vector Audio Format

Vector Audio Format — Concept Notes

Core Idea

An audio file format that is to PCM what SVG is to bitmap: resolution-independent, mathematically defined, and inherently manipulable. Instead of storing discrete amplitude samples, store mathematical curve definitions (Bézier/B-spline control points) that describe the sound.

Key Insight

Raw PCM waveforms are too complex for efficient curve fitting. But if you first decompose audio via sinusoidal modeling (SMS — Serra & Smith, 1990), the resulting parameter trajectories (frequency, amplitude over time) are smooth, slowly-varying curves — exactly what Bézier curves represent efficiently.

Architecture: Three Track Types

1. Sinusoidal Tracks (tonal content)

Each track = one partial (harmonic or inharmonic).

  • freq: Bézier curve (Hz over time)
  • amp: Bézier curve (amplitude over time)
  • birth / death: start and end time

Handles: sustained notes, vocals, bass, pads, pitched instruments. A piano note might have 30 tracks × ~5 control points each = ~300 floats for a full second (vs. 44,100 PCM samples).

2. Noise Bands (textural/stochastic content)

Bandpass-filtered noise with amplitude envelopes.

  • freq_low / freq_high: frequency range (Hz)
  • amp: Bézier curve (amplitude envelope over time)

Alternative: a 2D Bézier surface (frequency × time → amplitude) for continuous spectral envelope modeling.

Handles: breath, bow scrape, snare wires, cymbal wash, consonants in speech.

3. Transient Events (attacks, clicks, impacts)

Short broadband bursts (~1-5ms).

  • time: when it occurs
  • shape: Bézier curve (amplitude envelope)
  • spectrum: Bézier curve (spectral energy distribution)
  • Or: a tiny PCM snippet (few hundred samples)

Handles: drum stick impact, pick attack, plosives.

Percussion Decomposition Examples

Kick: freq sweep 150→60Hz (sinusoidal) + beater click noise bands 1-5kHz + transient at t=0 Snare: short tone 150-250Hz (sinusoidal) + snare wire noise bands 1-15kHz + transient Hi-hat: no sinusoidal tracks, noise bands 3-18kHz with fast decay + transient Cymbal: inharmonic sinusoidal tracks + broadband noise bands with slow decay + transient

Killer Features (What This Enables)

  • Time stretching: re-parameterize the Bézier curves to a longer time range. Same control points, same frequencies, no pitch change. No phase vocoder artifacts. Transients stay sharp (they're point events, not stretched).
  • Pitch shifting: multiply all frequency curves by a constant. Done.
  • Harmonic editing: boost/suppress/remove individual partials.
  • Sound morphing: interpolate control points between two sounds.
  • Resolution independence: render at any sample rate.
  • Extreme compression: smooth parameter curves compress to very few control points (potentially 200:1-400:1 for simple sounds).
  • Procedural variation: perturb control points slightly for natural-sounding variation.

Rendering Pipeline

For each output sample at time t:

  1. Evaluate each sinusoidal track's freq(t) and amp(t) Bézier curves
  2. Accumulate phase: phase(t) = 2π ∫ freq(t) dt (closed-form for Bézier integrals)
  3. output += amp(t) * sin(phase(t))
  4. Add noise bands: generate white noise, bandpass filter, shape with amp curve
  5. Add transients at their trigger times
  6. Sum all components

The Format Is Essentially...

An additive synthesizer preset extracted from real audio. The file IS a synth patch. The encoder IS the analysis. The decoder IS the synth. It sits between MIDI (pure instructions, no timbre) and PCM (pure samples, no structure).

Where It Works Best vs. Where PCM Wins

  • This format wins: instruments, voice, synths, sound effects, game audio — structured sounds
  • PCM wins: rain, crowd noise, field recordings — unstructured/stochastic sounds
  • Analogous to SVG (illustrations) vs. PNG (photographs)

Analysis Pipeline (Encoder)

  1. Window the signal into overlapping frames (20-50ms, hop 5-10ms)
  2. FFT each frame → magnitude spectrum
  3. Peak picking → find sinusoidal components (use parabolic interpolation for sub-bin accuracy)
  4. Peak tracking across frames → form continuous sinusoidal tracks (birth/continuation/death)
  5. Resynthesize sinusoidal part, subtract from original → residual
  6. Model residual's spectral envelope per frame
  7. Detect transients (onset detection)
  8. Fit Bézier curves to all parameter trajectories (adaptive: more control points during vibrato/change, fewer during sustain)

Practical Build Path

  1. Single note proof of concept: analyze a piano note with SMS (use Python sms-tools), fit Bézier curves to tracks, resynthesize, A/B test
  2. Time stretch test: re-parameterize curves to 2x, compare to phase vocoder
  3. Residual modeling: spectral envelope as Bézier curves, resynthesize as filtered noise
  4. File format spec: define binary/JSON format for tracks + noise + transients
  5. Percussion test: decompose a drum loop into the three track types
  6. Polyphonic audio: use ML source separation (Demucs) as preprocessing, encode each stem independently

Key Tools & Libraries

  • sms-tools (Python, Xavier Serra) — full SMS analysis/synthesis implementation
  • librosa (Python) — STFT, peak picking, onset detection
  • Loris (C++ with Python bindings) — sinusoidal modeling library
  • scipy.interpolate — B-spline fitting
  • scipy.optimize — least-squares curve fitting
  • Demucs (Meta) — ML source separation for polyphonic preprocessing

Essential Reading

  1. McAulay & Quatieri (1986) — "Speech Analysis/Synthesis Based on a Sinusoidal Representation" (foundational peak tracking)
  2. Serra & Smith (1990) — "Spectral Modeling Synthesis" (deterministic + stochastic decomposition)
  3. Serra PhD thesis (1989) — full treatment, freely available
  4. Driedger & Müller (2016) — "A Review of Time-Scale Modification of Music Signals" (survey of time stretching, good context)
  5. Farin — "Curves and Surfaces for CAGD" (Bézier/B-spline math)
  6. Zölzer — "DAFX: Digital Audio Effects" (spectral modeling, time stretching)

Open Questions

  • Optimal Bézier degree for parameter trajectories (cubic? quartic?)
  • Adaptive knot placement strategy — how to decide where to add control points
  • Perceptual error metric for curve fitting (frequency-weighted? psychoacoustic model?)
  • Cymbal/complex inharmonic sound quality ceiling
  • Real-time rendering performance for dense track counts
  • Could the residual use a different curve-based representation than spectral envelope + noise?
@rleroi
Copy link
Copy Markdown
Author

rleroi commented Apr 14, 2026

PoC
Phase 1 — Decoder only
Input: Hand-authored .vec JSON Output: Audio playback + visualization

Implement bezier.js — evaluateCubic(controlPoints, t) → value
Implement AudioWorklet that:
Reads .vec JSON
For each sample: evaluate all tracks' freq/amp curves, sum sinusoids
For noise bands: generate white noise, bandpass, shape with amp curve
Draw tracks on canvas — freq curves as colored lines, amp as opacity/thickness
Hard-code a simple test sound: single note with 5 harmonics decaying at different rates
Verify it sounds like a real note
Milestone: Paste JSON in a textarea → hear sound → see curves on canvas.

Phase 2 — Basic analyzer (FFT + peaks)
Input: Dropped WAV file (decoded via AudioContext.decodeAudioData) Output: Peak data per frame

Implement FFT (or grab a tiny lib — fft.js is ~80 lines, MIT licensed)
Window function (Blackman-Harris for good sidelobe suppression)
Compute magnitude spectrum per frame (frameSize=2048, hop=512, ~86 frames/sec)
Peak picking: find local maxima above a threshold, parabolic interpolation for sub-bin freq accuracy
Visualize: spectrogram on canvas with peak dots overlaid
Milestone: Drop a WAV → see spectrogram with detected peaks highlighted.

Phase 3 — Peak tracking
Input: Peaks per frame from Phase 2 Output: Continuous sinusoidal tracks

Frame-to-frame nearest-frequency matching (within 50Hz threshold)
Track birth: unmatched peak starts a new track
Track death: unmatched track for N consecutive frames → end it
Minimum track length filter (discard tracks shorter than ~30ms — they're noise, not partials)
Visualize: draw tracks as continuous colored lines over the spectrogram
Milestone: Drop a piano note WAV → see clean harmonic lines on the spectrogram. Should clearly show fundamental + harmonics.

Phase 4 — Bézier curve fitting
Input: Raw tracks (arrays of freq/amp values per frame) Output: Bézier control points per track

Start simple: fit one cubic Bézier per track segment
Least-squares fitting: minimize sum of squared errors between curve and data points
Adaptive: if error exceeds threshold, split segment and fit two curves (recursive subdivision)
Visualize: overlay smooth Bézier curves on top of raw tracked data points
Error display: show max/avg deviation per track
Milestone: See smooth curves closely following the raw tracked data. Quantify the fit error.

Phase 5 — Residual extraction + noise modeling
Input: Original PCM + sinusoidal reconstruction from fitted curves Output: Noise band parameters

Resynthesize sinusoidal part from the Bézier tracks
Subtract from original → residual signal
Compute spectral envelope of residual per frame (LPC or cepstral smoothing)
Segment into frequency bands, fit amplitude Bézier curves to each band's energy over time
Visualize: show residual waveform + its spectral envelope
Milestone: Play sinusoidal-only, residual-only, and combined. Combined should sound close to original.

Phase 6 — Full round-trip
Input: WAV file Output: .vec JSON + resynthesized audio + A/B comparison

Wire it all together: drop → analyze → fit → export .vec JSON
Import .vec → synthesize via AudioWorklet
A/B toggle button: instant switch between original PCM and resynthesized
Display file size comparison (PCM bytes vs .vec JSON bytes)
Time stretch slider: re-parameterize curves → hear stretched audio
Pitch shift slider: multiply freq curves → hear shifted audio
Milestone: The full demo. Drop a file, see it decompose, play it back, stretch it, shift it.

Test Sounds (progressive difficulty)
Pure sine wave (trivial — one track, constant freq/amp)
Sawtooth wave (multiple harmonics, constant)
Single piano note (harmonics with amplitude decay — first real test)
Vocal "aah" (vibrato, formants)
Snare drum (mostly noise + transient — tests the non-sinusoidal path)
Short melody (multiple notes, track births and deaths)
Two instruments together (polyphonic — the hard case)
Key Parameters to Tune
Frame size: 2048 samples (good freq resolution at 44.1kHz → ~21Hz per bin)
Hop size: 512 samples (~11.6ms between frames)
Window: Blackman-Harris (good sidelobe suppression for peak picking)
Peak threshold: -60dB below max (ignore noise floor)
Tracking threshold: 50Hz max frequency jump between frames
Min track length: 30ms (~3 frames)
Bézier fit error threshold: TBD — needs experimentation. Start with max 1Hz freq error, 1% amp error
Max Bézier degree: cubic (degree 3) for PoC, higher if needed
Stretch Goals (after PoC works)
Onset detection for automatic transient extraction
Interactive control point dragging with live audio update
Binary .vec format (MessagePack or custom) for realistic file size comparison
Microphone input → live analysis
WebGL spectrogram for smoother rendering
WASM FFT for analyzing longer files
Export resynthesized audio as WAV

@rleroi
Copy link
Copy Markdown
Author

rleroi commented Apr 14, 2026

.vec Format (JSON for PoC, binary later)

{
  "version": 1,
  "sampleRate": 44100,
  "duration": 3.0,
  "sinusoidalTracks": [
    {
      "birth": 0.0,
      "death": 3.0,
      "freq": { "degree": 3, "controlPoints": [[0, 440], [1, 442], [2, 441], [3, 440]] },
      "amp":  { "degree": 3, "controlPoints": [[0, 0], [0.3, 0.8], [2.5, 0.7], [3, 0]] }
    }
  ],
  "noiseBands": [
    {
      "freqLow": 2000,
      "freqHigh": 5000,
      "birth": 0.0,
      "death": 0.5,
      "amp": { "degree": 3, "controlPoints": [[0, 0], [0.01, 0.3], [0.1, 0.1], [0.5, 0]] }
    }
  ],
  "transients": [
    {
      "time": 0.0,
      "duration": 0.003,
      "amp": { "degree": 2, "controlPoints": [[0, 0], [0.001, 1.0], [0.003, 0]] },
      "spectrum": { "degree": 2, "controlPoints": [[200, 0.2], [2000, 1.0], [10000, 0.3]] }
    }
  ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment