Skip to content

Instantly share code, notes, and snippets.

@rleroi
Created April 14, 2026 14:49
Show Gist options
  • Select an option

  • Save rleroi/7b276a6795c54a927444d03cac1c5dfb to your computer and use it in GitHub Desktop.

Select an option

Save rleroi/7b276a6795c54a927444d03cac1c5dfb to your computer and use it in GitHub Desktop.
Vector Audio Format

Vector Audio Format — Concept Notes

Core Idea

An audio file format that is to PCM what SVG is to bitmap: resolution-independent, mathematically defined, and inherently manipulable. Instead of storing discrete amplitude samples, store mathematical curve definitions (Bézier/B-spline control points) that describe the sound.

Key Insight

Raw PCM waveforms are too complex for efficient curve fitting. But if you first decompose audio via sinusoidal modeling (SMS — Serra & Smith, 1990), the resulting parameter trajectories (frequency, amplitude over time) are smooth, slowly-varying curves — exactly what Bézier curves represent efficiently.

Architecture: Three Track Types

1. Sinusoidal Tracks (tonal content)

Each track = one partial (harmonic or inharmonic).

  • freq: Bézier curve (Hz over time)
  • amp: Bézier curve (amplitude over time)
  • birth / death: start and end time

Handles: sustained notes, vocals, bass, pads, pitched instruments. A piano note might have 30 tracks × ~5 control points each = ~300 floats for a full second (vs. 44,100 PCM samples).

2. Noise Bands (textural/stochastic content)

Bandpass-filtered noise with amplitude envelopes.

  • freq_low / freq_high: frequency range (Hz)
  • amp: Bézier curve (amplitude envelope over time)

Alternative: a 2D Bézier surface (frequency × time → amplitude) for continuous spectral envelope modeling.

Handles: breath, bow scrape, snare wires, cymbal wash, consonants in speech.

3. Transient Events (attacks, clicks, impacts)

Short broadband bursts (~1-5ms).

  • time: when it occurs
  • shape: Bézier curve (amplitude envelope)
  • spectrum: Bézier curve (spectral energy distribution)
  • Or: a tiny PCM snippet (few hundred samples)

Handles: drum stick impact, pick attack, plosives.

Percussion Decomposition Examples

Kick: freq sweep 150→60Hz (sinusoidal) + beater click noise bands 1-5kHz + transient at t=0 Snare: short tone 150-250Hz (sinusoidal) + snare wire noise bands 1-15kHz + transient Hi-hat: no sinusoidal tracks, noise bands 3-18kHz with fast decay + transient Cymbal: inharmonic sinusoidal tracks + broadband noise bands with slow decay + transient

Killer Features (What This Enables)

  • Time stretching: re-parameterize the Bézier curves to a longer time range. Same control points, same frequencies, no pitch change. No phase vocoder artifacts. Transients stay sharp (they're point events, not stretched).
  • Pitch shifting: multiply all frequency curves by a constant. Done.
  • Harmonic editing: boost/suppress/remove individual partials.
  • Sound morphing: interpolate control points between two sounds.
  • Resolution independence: render at any sample rate.
  • Extreme compression: smooth parameter curves compress to very few control points (potentially 200:1-400:1 for simple sounds).
  • Procedural variation: perturb control points slightly for natural-sounding variation.

Rendering Pipeline

For each output sample at time t:

  1. Evaluate each sinusoidal track's freq(t) and amp(t) Bézier curves
  2. Accumulate phase: phase(t) = 2π ∫ freq(t) dt (closed-form for Bézier integrals)
  3. output += amp(t) * sin(phase(t))
  4. Add noise bands: generate white noise, bandpass filter, shape with amp curve
  5. Add transients at their trigger times
  6. Sum all components

The Format Is Essentially...

An additive synthesizer preset extracted from real audio. The file IS a synth patch. The encoder IS the analysis. The decoder IS the synth. It sits between MIDI (pure instructions, no timbre) and PCM (pure samples, no structure).

Where It Works Best vs. Where PCM Wins

  • This format wins: instruments, voice, synths, sound effects, game audio — structured sounds
  • PCM wins: rain, crowd noise, field recordings — unstructured/stochastic sounds
  • Analogous to SVG (illustrations) vs. PNG (photographs)

Analysis Pipeline (Encoder)

  1. Window the signal into overlapping frames (20-50ms, hop 5-10ms)
  2. FFT each frame → magnitude spectrum
  3. Peak picking → find sinusoidal components (use parabolic interpolation for sub-bin accuracy)
  4. Peak tracking across frames → form continuous sinusoidal tracks (birth/continuation/death)
  5. Resynthesize sinusoidal part, subtract from original → residual
  6. Model residual's spectral envelope per frame
  7. Detect transients (onset detection)
  8. Fit Bézier curves to all parameter trajectories (adaptive: more control points during vibrato/change, fewer during sustain)

Practical Build Path

  1. Single note proof of concept: analyze a piano note with SMS (use Python sms-tools), fit Bézier curves to tracks, resynthesize, A/B test
  2. Time stretch test: re-parameterize curves to 2x, compare to phase vocoder
  3. Residual modeling: spectral envelope as Bézier curves, resynthesize as filtered noise
  4. File format spec: define binary/JSON format for tracks + noise + transients
  5. Percussion test: decompose a drum loop into the three track types
  6. Polyphonic audio: use ML source separation (Demucs) as preprocessing, encode each stem independently

Key Tools & Libraries

  • sms-tools (Python, Xavier Serra) — full SMS analysis/synthesis implementation
  • librosa (Python) — STFT, peak picking, onset detection
  • Loris (C++ with Python bindings) — sinusoidal modeling library
  • scipy.interpolate — B-spline fitting
  • scipy.optimize — least-squares curve fitting
  • Demucs (Meta) — ML source separation for polyphonic preprocessing

Essential Reading

  1. McAulay & Quatieri (1986) — "Speech Analysis/Synthesis Based on a Sinusoidal Representation" (foundational peak tracking)
  2. Serra & Smith (1990) — "Spectral Modeling Synthesis" (deterministic + stochastic decomposition)
  3. Serra PhD thesis (1989) — full treatment, freely available
  4. Driedger & Müller (2016) — "A Review of Time-Scale Modification of Music Signals" (survey of time stretching, good context)
  5. Farin — "Curves and Surfaces for CAGD" (Bézier/B-spline math)
  6. Zölzer — "DAFX: Digital Audio Effects" (spectral modeling, time stretching)

Open Questions

  • Optimal Bézier degree for parameter trajectories (cubic? quartic?)
  • Adaptive knot placement strategy — how to decide where to add control points
  • Perceptual error metric for curve fitting (frequency-weighted? psychoacoustic model?)
  • Cymbal/complex inharmonic sound quality ceiling
  • Real-time rendering performance for dense track counts
  • Could the residual use a different curve-based representation than spectral envelope + noise?
@rleroi
Copy link
Copy Markdown
Author

rleroi commented Apr 14, 2026

.vec Format (JSON for PoC, binary later)

{
  "version": 1,
  "sampleRate": 44100,
  "duration": 3.0,
  "sinusoidalTracks": [
    {
      "birth": 0.0,
      "death": 3.0,
      "freq": { "degree": 3, "controlPoints": [[0, 440], [1, 442], [2, 441], [3, 440]] },
      "amp":  { "degree": 3, "controlPoints": [[0, 0], [0.3, 0.8], [2.5, 0.7], [3, 0]] }
    }
  ],
  "noiseBands": [
    {
      "freqLow": 2000,
      "freqHigh": 5000,
      "birth": 0.0,
      "death": 0.5,
      "amp": { "degree": 3, "controlPoints": [[0, 0], [0.01, 0.3], [0.1, 0.1], [0.5, 0]] }
    }
  ],
  "transients": [
    {
      "time": 0.0,
      "duration": 0.003,
      "amp": { "degree": 2, "controlPoints": [[0, 0], [0.001, 1.0], [0.003, 0]] },
      "spectrum": { "degree": 2, "controlPoints": [[200, 0.2], [2000, 1.0], [10000, 0.3]] }
    }
  ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment