Fixes voice messages being embedded as raw binary instead of just the transcription
When users send voice messages through Telegram, Discord, or other platforms, OpenClaw was incorrectly processing the audio files:
- Voice messages were being treated as text files — specifically
text/tab-separated-values - Raw audio binary was embedded into the chat context — this could be 500-8000+ tokens per message
- Sessions would hit the token limit — after just 3-5 voice messages, sessions would hit 200K+ tokens
- Bot would stop responding — once the token limit was reached, the bot couldn't process new messages
- Bot stops responding to voice-only messages
- Session token count explodes to 200K+ with just a few voice notes
- Logs show
<file name="...ogg" mime="text/tab-separated-values"> - Voice messages work when downloaded separately but fail through the bot
| Voice Length | Wasted Tokens | Effect |
|---|---|---|
| 10-15 seconds | 500-2,000 | Annoying |
| 30-60 seconds | 2,000-8,000+ | Session becomes unusable |
| 2+ minutes | 15,000-50,000+ | Immediate crash |
OGG audio files (used by Telegram voice messages as OGG/Opus) have ASCII-heavy headers:
- OGG header starts with "OggS" — all printable ASCII characters
- Vorbis/Opus metadata contains extensive ASCII text — fields like
ENCODER=,TITLE=, etc. - Metadata uses tab characters — the
guessDelimitedMime()function sees tabs and classifies it as TSV
The file type detection in OpenClaw works like this:
1. looksLikeUtf8Text() checks if >85% of bytes are printable ASCII
→ OGG files PASS this check (ASCII-heavy headers)
2. guessDelimitedMime() looks for tabs and commas
→ OGG metadata has tabs, so it's classified as TSV
3. File gets embedded as <file name="...ogg" mime="text/tab-separated-values">
→ Raw binary floods the context
OpenClaw had a kind === "audio" check, but this only worked when:
- The file was correctly identified as audio BEFORE buffer reading
- The MIME type was properly set by the sender
Telegram doesn't always send proper MIME types for voice messages, so the files would slip through the initial detection and get reclassified as text later in the pipeline.
This patch adds three layers of defense to prevent audio files from being embedded as text:
if (!forcedTextMime && (kind === "image" || kind === "audio" || kind === "video")) {
continue;
}Files identified as audio, image, or video skip text extraction entirely — no buffer reading needed.
const _patchAudioExts = new Set(['.ogg', '.opus', '.mp3', '.wav', '.aac', '.flac', '.m4a', '.oga', '.webm']);
const _patchNameExt = nameHint ? path.extname(nameHint).toLowerCase() : '';
if (_patchAudioExts.has(_patchNameExt)) {
continue;
}Even if kind detection fails, files with audio extensions are skipped. This catches edge cases where the attachment type is misidentified.
function hasBinaryAudioMagic(buffer) {
// OGG container: "OggS" signature
if (buffer.length >= 4 &&
buffer[0] === 0x4F && // O
buffer[1] === 0x67 && // g
buffer[2] === 0x67 && // g
buffer[3] === 0x53) { // S
return true;
}
// MP3 with ID3v2: "ID3" signature
if (buffer[0] === 0x49 && // I
buffer[1] === 0x44 && // D
buffer[2] === 0x33) { // 3
return true;
}
return false;
}Checks the actual file bytes for OGG and MP3 magic signatures. This prevents misclassification even if a file has no extension or wrong MIME type.
Voice transcriptions still work perfectly! This patch only prevents the raw audio binary from being embedded. The transcription is handled separately by the audio capability pipeline and appears in the conversation as expected.
macOS (Homebrew):
/opt/homebrew/lib/node_modules/openclaw/
Linux (nvm):
~/.nvm/versions/node/v22.22.0/lib/node_modules/openclaw/
Linux (npm global):
/usr/lib/node_modules/openclaw/
Windows:
%APPDATA%\npm\node_modules\openclaw\
Option 1: Automated (recommended)
- Download the
fix-audio-binary.patchand theoptional-apply.shfiles in a dir. - Open your CLI/Terminal and cd to that dir.
- Run
bash optional-apply.sh - It should automatically apply the patch for you.
Option 2: Manual
- Download the patch file
- Run:
patch -p1 < openclaw-voice-message-fix.patch /path/to/openclaw/dist/media-understanding/apply.js - Restart:
openclaw gateway restart
After applying the patch:
- Send a voice message to your bot
- Check that the bot responds with the transcription
- Verify no binary in context — you should NOT see
<file name="...ogg" mime="text/tab-separated-values">in session files - Check token count stays reasonable — a 30-second voice note should add ~500-1000 tokens for the transcription, not 8000+
| Before Fix | After Fix |
|---|---|
| Raw binary embedded | Only transcription text |
| 500-8000+ tokens per voice | ~500-1000 tokens (transcription only) |
| Session crashes after 3-5 voices | Session handles many voices normally |
This fix is based on work by the OpenClaw community:
@hclsys — Original magic byte detection approach
- PR #3904:
fix(media): detect binary audio by magic bytes to prevent text misidentification - Identified the root cause and implemented the OGG/MP3 magic byte detection
@null-runner — Extension-based skip and defense-in-depth
- PR #4235:
fix(media): skip audio in extractFileBlocks + hasBinaryAudioMagic defense-in-depth - Combined both early skip (by kind) and magic byte checks for comprehensive coverage
Community Contributors
Special thanks to everyone who reported, tested, and provided logs:
- @MarcBickel — Real-world testing and confirmation
- @angelsole — Production impact analysis
- @CodyAdam — Testing and feedback
- @felixmaechtle — Code review
- #1989 — Bug: Audio-only messages on Telegram not triggering agent response
- #4197 — Feature: Strip raw audio attachment after successful transcription
- #3904 — fix(media): detect binary audio by magic bytes
- #4235 — fix(media): skip audio in extractFileBlocks + defense-in-depth
- This is a temporary patch until the official PRs are merged
- The patch will be overwritten on
npm update— reapply after updates - Works for Telegram, Discord, and all voice message platforms
- Compatible with OpenClaw v2026.1.30 and earlier
This patch is provided as a temporary fix for the OpenClaw project. The original code and fixes are under the OpenClaw project license.
Last updated: February 2026