An audio file format that is to PCM what SVG is to bitmap: resolution-independent, mathematically defined, and inherently manipulable. Instead of storing discrete amplitude samples, store mathematical curve definitions (Bézier/B-spline control points) that describe the sound.
Raw PCM waveforms are too complex for efficient curve fitting. But if you first decompose audio via sinusoidal modeling (SMS — Serra & Smith, 1990), the resulting parameter trajectories (frequency, amplitude over time) are smooth, slowly-varying curves — exactly what Bézier curves represent efficiently.
Each track = one partial (harmonic or inharmonic).
freq: Bézier curve (Hz over time)amp: Bézier curve (amplitude over time)birth/death: start and end time
Handles: sustained notes, vocals, bass, pads, pitched instruments. A piano note might have 30 tracks × ~5 control points each = ~300 floats for a full second (vs. 44,100 PCM samples).
Bandpass-filtered noise with amplitude envelopes.
freq_low/freq_high: frequency range (Hz)amp: Bézier curve (amplitude envelope over time)
Alternative: a 2D Bézier surface (frequency × time → amplitude) for continuous spectral envelope modeling.
Handles: breath, bow scrape, snare wires, cymbal wash, consonants in speech.
Short broadband bursts (~1-5ms).
time: when it occursshape: Bézier curve (amplitude envelope)spectrum: Bézier curve (spectral energy distribution)- Or: a tiny PCM snippet (few hundred samples)
Handles: drum stick impact, pick attack, plosives.
Kick: freq sweep 150→60Hz (sinusoidal) + beater click noise bands 1-5kHz + transient at t=0 Snare: short tone 150-250Hz (sinusoidal) + snare wire noise bands 1-15kHz + transient Hi-hat: no sinusoidal tracks, noise bands 3-18kHz with fast decay + transient Cymbal: inharmonic sinusoidal tracks + broadband noise bands with slow decay + transient
- Time stretching: re-parameterize the Bézier curves to a longer time range. Same control points, same frequencies, no pitch change. No phase vocoder artifacts. Transients stay sharp (they're point events, not stretched).
- Pitch shifting: multiply all frequency curves by a constant. Done.
- Harmonic editing: boost/suppress/remove individual partials.
- Sound morphing: interpolate control points between two sounds.
- Resolution independence: render at any sample rate.
- Extreme compression: smooth parameter curves compress to very few control points (potentially 200:1-400:1 for simple sounds).
- Procedural variation: perturb control points slightly for natural-sounding variation.
For each output sample at time t:
- Evaluate each sinusoidal track's freq(t) and amp(t) Bézier curves
- Accumulate phase: phase(t) = 2π ∫ freq(t) dt (closed-form for Bézier integrals)
- output += amp(t) * sin(phase(t))
- Add noise bands: generate white noise, bandpass filter, shape with amp curve
- Add transients at their trigger times
- Sum all components
An additive synthesizer preset extracted from real audio. The file IS a synth patch. The encoder IS the analysis. The decoder IS the synth. It sits between MIDI (pure instructions, no timbre) and PCM (pure samples, no structure).
- This format wins: instruments, voice, synths, sound effects, game audio — structured sounds
- PCM wins: rain, crowd noise, field recordings — unstructured/stochastic sounds
- Analogous to SVG (illustrations) vs. PNG (photographs)
- Window the signal into overlapping frames (20-50ms, hop 5-10ms)
- FFT each frame → magnitude spectrum
- Peak picking → find sinusoidal components (use parabolic interpolation for sub-bin accuracy)
- Peak tracking across frames → form continuous sinusoidal tracks (birth/continuation/death)
- Resynthesize sinusoidal part, subtract from original → residual
- Model residual's spectral envelope per frame
- Detect transients (onset detection)
- Fit Bézier curves to all parameter trajectories (adaptive: more control points during vibrato/change, fewer during sustain)
- Single note proof of concept: analyze a piano note with SMS (use Python sms-tools), fit Bézier curves to tracks, resynthesize, A/B test
- Time stretch test: re-parameterize curves to 2x, compare to phase vocoder
- Residual modeling: spectral envelope as Bézier curves, resynthesize as filtered noise
- File format spec: define binary/JSON format for tracks + noise + transients
- Percussion test: decompose a drum loop into the three track types
- Polyphonic audio: use ML source separation (Demucs) as preprocessing, encode each stem independently
- sms-tools (Python, Xavier Serra) — full SMS analysis/synthesis implementation
- librosa (Python) — STFT, peak picking, onset detection
- Loris (C++ with Python bindings) — sinusoidal modeling library
- scipy.interpolate — B-spline fitting
- scipy.optimize — least-squares curve fitting
- Demucs (Meta) — ML source separation for polyphonic preprocessing
- McAulay & Quatieri (1986) — "Speech Analysis/Synthesis Based on a Sinusoidal Representation" (foundational peak tracking)
- Serra & Smith (1990) — "Spectral Modeling Synthesis" (deterministic + stochastic decomposition)
- Serra PhD thesis (1989) — full treatment, freely available
- Driedger & Müller (2016) — "A Review of Time-Scale Modification of Music Signals" (survey of time stretching, good context)
- Farin — "Curves and Surfaces for CAGD" (Bézier/B-spline math)
- Zölzer — "DAFX: Digital Audio Effects" (spectral modeling, time stretching)
- Optimal Bézier degree for parameter trajectories (cubic? quartic?)
- Adaptive knot placement strategy — how to decide where to add control points
- Perceptual error metric for curve fitting (frequency-weighted? psychoacoustic model?)
- Cymbal/complex inharmonic sound quality ceiling
- Real-time rendering performance for dense track counts
- Could the residual use a different curve-based representation than spectral envelope + noise?
.vec Format (JSON for PoC, binary later)
{ "version": 1, "sampleRate": 44100, "duration": 3.0, "sinusoidalTracks": [ { "birth": 0.0, "death": 3.0, "freq": { "degree": 3, "controlPoints": [[0, 440], [1, 442], [2, 441], [3, 440]] }, "amp": { "degree": 3, "controlPoints": [[0, 0], [0.3, 0.8], [2.5, 0.7], [3, 0]] } } ], "noiseBands": [ { "freqLow": 2000, "freqHigh": 5000, "birth": 0.0, "death": 0.5, "amp": { "degree": 3, "controlPoints": [[0, 0], [0.01, 0.3], [0.1, 0.1], [0.5, 0]] } } ], "transients": [ { "time": 0.0, "duration": 0.003, "amp": { "degree": 2, "controlPoints": [[0, 0], [0.001, 1.0], [0.003, 0]] }, "spectrum": { "degree": 2, "controlPoints": [[200, 0.2], [2000, 1.0], [10000, 0.3]] } } ] }