rleroi/vector-audio-idea.md

Created April 14, 2026 14:49

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/rleroi/7b276a6795c54a927444d03cac1c5dfb.js"></script>
Save rleroi/7b276a6795c54a927444d03cac1c5dfb to your computer and use it in GitHub Desktop.

Download ZIP

Vector Audio Format

Raw

vector-audio-idea.md

Vector Audio Format — Concept Notes

Core Idea

An audio file format that is to PCM what SVG is to bitmap: resolution-independent, mathematically defined, and inherently manipulable. Instead of storing discrete amplitude samples, store mathematical curve definitions (Bézier/B-spline control points) that describe the sound.

Key Insight

Raw PCM waveforms are too complex for efficient curve fitting. But if you first decompose audio via sinusoidal modeling (SMS — Serra & Smith, 1990), the resulting parameter trajectories (frequency, amplitude over time) are smooth, slowly-varying curves — exactly what Bézier curves represent efficiently.

Architecture: Three Track Types

1. Sinusoidal Tracks (tonal content)

Each track = one partial (harmonic or inharmonic).

freq: Bézier curve (Hz over time)
amp: Bézier curve (amplitude over time)
birth / death: start and end time

Handles: sustained notes, vocals, bass, pads, pitched instruments. A piano note might have 30 tracks × ~5 control points each = ~300 floats for a full second (vs. 44,100 PCM samples).

2. Noise Bands (textural/stochastic content)

Bandpass-filtered noise with amplitude envelopes.

freq_low / freq_high: frequency range (Hz)
amp: Bézier curve (amplitude envelope over time)

Alternative: a 2D Bézier surface (frequency × time → amplitude) for continuous spectral envelope modeling.

Handles: breath, bow scrape, snare wires, cymbal wash, consonants in speech.

3. Transient Events (attacks, clicks, impacts)

Short broadband bursts (~1-5ms).

time: when it occurs
shape: Bézier curve (amplitude envelope)
spectrum: Bézier curve (spectral energy distribution)
Or: a tiny PCM snippet (few hundred samples)

Handles: drum stick impact, pick attack, plosives.

Percussion Decomposition Examples

Kick: freq sweep 150→60Hz (sinusoidal) + beater click noise bands 1-5kHz + transient at t=0 Snare: short tone 150-250Hz (sinusoidal) + snare wire noise bands 1-15kHz + transient Hi-hat: no sinusoidal tracks, noise bands 3-18kHz with fast decay + transient Cymbal: inharmonic sinusoidal tracks + broadband noise bands with slow decay + transient

Killer Features (What This Enables)

Time stretching: re-parameterize the Bézier curves to a longer time range. Same control points, same frequencies, no pitch change. No phase vocoder artifacts. Transients stay sharp (they're point events, not stretched).
Pitch shifting: multiply all frequency curves by a constant. Done.
Harmonic editing: boost/suppress/remove individual partials.
Sound morphing: interpolate control points between two sounds.
Resolution independence: render at any sample rate.
Extreme compression: smooth parameter curves compress to very few control points (potentially 200:1-400:1 for simple sounds).
Procedural variation: perturb control points slightly for natural-sounding variation.

Rendering Pipeline

For each output sample at time t:

Evaluate each sinusoidal track's freq(t) and amp(t) Bézier curves
Accumulate phase: phase(t) = 2π ∫ freq(t) dt (closed-form for Bézier integrals)
output += amp(t) * sin(phase(t))
Add noise bands: generate white noise, bandpass filter, shape with amp curve
Add transients at their trigger times
Sum all components

The Format Is Essentially...

An additive synthesizer preset extracted from real audio. The file IS a synth patch. The encoder IS the analysis. The decoder IS the synth. It sits between MIDI (pure instructions, no timbre) and PCM (pure samples, no structure).

Where It Works Best vs. Where PCM Wins

This format wins: instruments, voice, synths, sound effects, game audio — structured sounds
PCM wins: rain, crowd noise, field recordings — unstructured/stochastic sounds
Analogous to SVG (illustrations) vs. PNG (photographs)

Analysis Pipeline (Encoder)

Window the signal into overlapping frames (20-50ms, hop 5-10ms)
FFT each frame → magnitude spectrum
Peak picking → find sinusoidal components (use parabolic interpolation for sub-bin accuracy)
Peak tracking across frames → form continuous sinusoidal tracks (birth/continuation/death)
Resynthesize sinusoidal part, subtract from original → residual
Model residual's spectral envelope per frame
Detect transients (onset detection)
Fit Bézier curves to all parameter trajectories (adaptive: more control points during vibrato/change, fewer during sustain)

Practical Build Path

Single note proof of concept: analyze a piano note with SMS (use Python sms-tools), fit Bézier curves to tracks, resynthesize, A/B test
Time stretch test: re-parameterize curves to 2x, compare to phase vocoder
Residual modeling: spectral envelope as Bézier curves, resynthesize as filtered noise
File format spec: define binary/JSON format for tracks + noise + transients
Percussion test: decompose a drum loop into the three track types
Polyphonic audio: use ML source separation (Demucs) as preprocessing, encode each stem independently

Key Tools & Libraries

sms-tools (Python, Xavier Serra) — full SMS analysis/synthesis implementation
librosa (Python) — STFT, peak picking, onset detection
Loris (C++ with Python bindings) — sinusoidal modeling library
scipy.interpolate — B-spline fitting
scipy.optimize — least-squares curve fitting
Demucs (Meta) — ML source separation for polyphonic preprocessing

Essential Reading

McAulay & Quatieri (1986) — "Speech Analysis/Synthesis Based on a Sinusoidal Representation" (foundational peak tracking)
Serra & Smith (1990) — "Spectral Modeling Synthesis" (deterministic + stochastic decomposition)
Serra PhD thesis (1989) — full treatment, freely available
Driedger & Müller (2016) — "A Review of Time-Scale Modification of Music Signals" (survey of time stretching, good context)
Farin — "Curves and Surfaces for CAGD" (Bézier/B-spline math)
Zölzer — "DAFX: Digital Audio Effects" (spectral modeling, time stretching)

Open Questions

Optimal Bézier degree for parameter trajectories (cubic? quartic?)
Adaptive knot placement strategy — how to decide where to add control points
Perceptual error metric for curve fitting (frequency-weighted? psychoacoustic model?)
Cymbal/complex inharmonic sound quality ceiling
Real-time rendering performance for dense track counts
Could the residual use a different curve-based representation than spectral envelope + noise?

Author

rleroi commented Apr 14, 2026

PoC
Phase 1 — Decoder only
Input: Hand-authored .vec JSON Output: Audio playback + visualization

Implement bezier.js — evaluateCubic(controlPoints, t) → value
Implement AudioWorklet that:
Reads .vec JSON
For each sample: evaluate all tracks' freq/amp curves, sum sinusoids
For noise bands: generate white noise, bandpass, shape with amp curve
Draw tracks on canvas — freq curves as colored lines, amp as opacity/thickness
Hard-code a simple test sound: single note with 5 harmonics decaying at different rates
Verify it sounds like a real note
Milestone: Paste JSON in a textarea → hear sound → see curves on canvas.

Phase 2 — Basic analyzer (FFT + peaks)
Input: Dropped WAV file (decoded via AudioContext.decodeAudioData) Output: Peak data per frame

Implement FFT (or grab a tiny lib — fft.js is ~80 lines, MIT licensed)
Window function (Blackman-Harris for good sidelobe suppression)
Compute magnitude spectrum per frame (frameSize=2048, hop=512, ~86 frames/sec)
Peak picking: find local maxima above a threshold, parabolic interpolation for sub-bin freq accuracy
Visualize: spectrogram on canvas with peak dots overlaid
Milestone: Drop a WAV → see spectrogram with detected peaks highlighted.

Phase 3 — Peak tracking
Input: Peaks per frame from Phase 2 Output: Continuous sinusoidal tracks

Frame-to-frame nearest-frequency matching (within 50Hz threshold)
Track birth: unmatched peak starts a new track
Track death: unmatched track for N consecutive frames → end it
Minimum track length filter (discard tracks shorter than ~30ms — they're noise, not partials)
Visualize: draw tracks as continuous colored lines over the spectrogram
Milestone: Drop a piano note WAV → see clean harmonic lines on the spectrogram. Should clearly show fundamental + harmonics.

Phase 4 — Bézier curve fitting
Input: Raw tracks (arrays of freq/amp values per frame) Output: Bézier control points per track

Start simple: fit one cubic Bézier per track segment
Least-squares fitting: minimize sum of squared errors between curve and data points
Adaptive: if error exceeds threshold, split segment and fit two curves (recursive subdivision)
Visualize: overlay smooth Bézier curves on top of raw tracked data points
Error display: show max/avg deviation per track
Milestone: See smooth curves closely following the raw tracked data. Quantify the fit error.

Phase 5 — Residual extraction + noise modeling
Input: Original PCM + sinusoidal reconstruction from fitted curves Output: Noise band parameters

Resynthesize sinusoidal part from the Bézier tracks
Subtract from original → residual signal
Compute spectral envelope of residual per frame (LPC or cepstral smoothing)
Segment into frequency bands, fit amplitude Bézier curves to each band's energy over time
Visualize: show residual waveform + its spectral envelope
Milestone: Play sinusoidal-only, residual-only, and combined. Combined should sound close to original.

Phase 6 — Full round-trip
Input: WAV file Output: .vec JSON + resynthesized audio + A/B comparison

Wire it all together: drop → analyze → fit → export .vec JSON
Import .vec → synthesize via AudioWorklet
A/B toggle button: instant switch between original PCM and resynthesized
Display file size comparison (PCM bytes vs .vec JSON bytes)
Time stretch slider: re-parameterize curves → hear stretched audio
Pitch shift slider: multiply freq curves → hear shifted audio
Milestone: The full demo. Drop a file, see it decompose, play it back, stretch it, shift it.

Test Sounds (progressive difficulty)
Pure sine wave (trivial — one track, constant freq/amp)
Sawtooth wave (multiple harmonics, constant)
Single piano note (harmonics with amplitude decay — first real test)
Vocal "aah" (vibrato, formants)
Snare drum (mostly noise + transient — tests the non-sinusoidal path)
Short melody (multiple notes, track births and deaths)
Two instruments together (polyphonic — the hard case)
Key Parameters to Tune
Frame size: 2048 samples (good freq resolution at 44.1kHz → ~21Hz per bin)
Hop size: 512 samples (~11.6ms between frames)
Window: Blackman-Harris (good sidelobe suppression for peak picking)
Peak threshold: -60dB below max (ignore noise floor)
Tracking threshold: 50Hz max frequency jump between frames
Min track length: 30ms (~3 frames)
Bézier fit error threshold: TBD — needs experimentation. Start with max 1Hz freq error, 1% amp error
Max Bézier degree: cubic (degree 3) for PoC, higher if needed
Stretch Goals (after PoC works)
Onset detection for automatic transient extraction
Interactive control point dragging with live audio update
Binary .vec format (MessagePack or custom) for realistic file size comparison
Microphone input → live analysis
WebGL spectrogram for smoother rendering
WASM FFT for analyzing longer files
Export resynthesized audio as WAV

Author

rleroi commented Apr 14, 2026

.vec Format (JSON for PoC, binary later)

{
  "version": 1,
  "sampleRate": 44100,
  "duration": 3.0,
  "sinusoidalTracks": [
    {
      "birth": 0.0,
      "death": 3.0,
      "freq": { "degree": 3, "controlPoints": [[0, 440], [1, 442], [2, 441], [3, 440]] },
      "amp":  { "degree": 3, "controlPoints": [[0, 0], [0.3, 0.8], [2.5, 0.7], [3, 0]] }
    }
  ],
  "noiseBands": [
    {
      "freqLow": 2000,
      "freqHigh": 5000,
      "birth": 0.0,
      "death": 0.5,
      "amp": { "degree": 3, "controlPoints": [[0, 0], [0.01, 0.3], [0.1, 0.1], [0.5, 0]] }
    }
  ],
  "transients": [
    {
      "time": 0.0,
      "duration": 0.003,
      "amp": { "degree": 2, "controlPoints": [[0, 0], [0.001, 1.0], [0.003, 0]] },
      "spectrum": { "degree": 2, "controlPoints": [[200, 0.2], [2000, 1.0], [10000, 0.3]] }
    }
  ]
}

rleroi/vector-audio-idea.md

Select an option

No results found

Select an option

No results found

Vector Audio Format — Concept Notes

Core Idea

Key Insight

Architecture: Three Track Types

1. Sinusoidal Tracks (tonal content)

2. Noise Bands (textural/stochastic content)

3. Transient Events (attacks, clicks, impacts)

Percussion Decomposition Examples

Killer Features (What This Enables)

Rendering Pipeline

The Format Is Essentially...

Where It Works Best vs. Where PCM Wins

Analysis Pipeline (Encoder)

Practical Build Path

Key Tools & Libraries

Essential Reading

Open Questions

rleroi commented Apr 14, 2026

Uh oh!

rleroi commented Apr 14, 2026

Uh oh!