Skip to content

Instantly share code, notes, and snippets.

@G36maid
Created June 16, 2026 12:34
Show Gist options
  • Select an option

  • Save G36maid/db6034b2d44680754f4d191ea1448d10 to your computer and use it in GitHub Desktop.

Select an option

Save G36maid/db6034b2d44680754f4d191ea1448d10 to your computer and use it in GitHub Desktop.
AkuMa — Pitch-aware audio output design (text + tone → sound)

Design: Pitch-aware audio output (text + tone → sound)

For AkuMa (Next.js 16 + React 19 + TypeScript, the Japanese furigana + pitch-accent tool). Feature: given a word's reading and its pitch/tone pattern, fetch audio that pronounces it with the correct pitch.


0. Current state (verified)

  • Zero audio infrastructure in the repo. grep for Audio, speechSynthesis, mp3, tts, utterance, HTMLAudioElement across src/ → 0 matches.
  • No audio/TTS libraries in package.json.
  • The only API is /api/mark-accent/stream (src/app/api/mark-accent/stream/route.ts) — a same-origin server proxy to upstream api.sessatakuma.dev that streams text → accent-analysis NDJSON. It returns no audio. proxy.config.js (repo root) exposes only buildMarkAccentStreamUrl, DEFAULT_MARK_ACCENT_UPSTREAM_URL (https://api.sessatakuma.dev/v1/mark-accent), isMarkAccentProxyLoop.
  • Upstream api.sessatakuma.dev has no audio endpoint (confirmed with the team). So a pitch-aware engine must be brought up separately.

1. Engine choice — and why it must be VOICEVOX-class

The requirement "fetch sound using text and tone" is the deciding factor. It rules out two whole categories:

  • Generic/cloud TTS (Google / Azure / speechSynthesis) — no pitch input. Reads 箸 and 橋 with whatever contour the voice happens to use. Fails the requirement by definition.
  • OpenJTalk (lightweight open-source JP TTS) — auto-predicts accent from its dictionary; forcing an arbitrary user-edited tone requires rewriting full-context label strings (/A: fields). Doesn't cleanly accept "tone as input." Reject for this use case.

Recommendation: VOICEVOX ENGINE (or its fork COEIROINK), self-hosted in Docker.

Why it fits: VOICEVOX exposes an AudioQuery object whose accent_phrases[].moras[] each carry an explicit pitch float. You POST the text → get the query → overwrite each mora's pitch with the tone computed from the app's Word.accent → POST back to /synthesis → get a WAV stream. The tone is a direct input, which is exactly what "text + tone" demands. Bonus: neural voice quality, per-mora granularity, free/open-source.


2. The core mapping: app accent pattern → per-mora pitch

This is the heart of the feature, so it's worth being precise.

The app's data model (core/word/accentTypes.ts):

export const AccentValue = { None: 0, High: 1, Drop: 2 } as const;
export type AccentValueType = (typeof AccentValue)[keyof typeof AccentValue];

export interface FuriganaItem { text: string; accent: AccentValueType; }

export interface Word {
    surface: string;
    furigana: FuriganaItem[];
    accent: AccentValueType | AccentValueType[];
}

Flatten Word{ reading: string; moraTones: AccentValueType[] } before sending to the route:

  • reading = concatenation of all furigana[].text (pure kana — what the engine must speak).
  • moraTones = the per-mora accent array. For kana-only words it's word.accent (already AccentValueType[]); for kanji-mixed words, resolve from word.furigana[].accent plus WordAnnotationModel.kanaAccents (per annotationLayout.ts). One tone value per kana character/mora.

Map AccentValueType → VOICEVOX mora pitch (in the route, server-side):

High (1)  → HIGH_PITCH  (≈ 5.7)
Drop (2)  → HIGH_PITCH then step-down (the kernel; next mora goes low)
None (0)  → LOW_PITCH   (≈ 3.0)

VOICEVOX mora pitch typically ranges ~0.0–6.5, so HIGH=5.7 / LOW=3.0 gives a clear contour while the kernel (Drop) is expressed by the High→Low transition the app already encodes.

⚠️ Validate before finalizing: confirm Drop's exact semantics by reading Kana.tsx's dot rendering. The mapping above is the design intent; the real values get tuned once you can A/B against the editor's visual pattern.


3. Backend: new same-origin audio route

Mirror the existing mark-accent/stream route exactly in security posture.

proxy.config.js (repo root) — extend, don't duplicate:

export const MARK_TTS_PROXY_PATH    = '/api/mark-accent/audio';
export const DEFAULT_TTS_ENGINE_URL = /* e.g. 'http://tts.internal:50021' */;  // VOICEVOX default port

export function buildMarkAccentAudioUrl(upstream) { /* Next route path */ }
export function isTtsProxyLoop(requestHost, engineUrl) { /* same pattern as isMarkAccentProxyLoop */ }

New route src/app/api/mark-accent/audio/route.ts:

  • POST body: { reading: string; moraTones: number[]; speakerId?: number } (resolved tone array from the client — thin payload, no secret leakage).
  • Server-side:
    1. Same origin check as the existing route (extractRequestOrigin / extractRequestHostOrigin + ALLOWED_DEV_ORIGINS). Lift into a shared helper or copy.
    2. TTS_ENGINE_URL / TTS_ENGINE_API_KEY from env (analogous to MARK_ACCENT_API_KEY). 500 if unset.
    3. Loop guard via isTtsProxyLoop.
    4. Call engine (VOICEVOX):
      • GET /audio_query?text={reading}&speaker={speakerId} → JSON AudioQuery.
      • Walk accent_phrases, overwrite each mora's pitch from moraTones (aligned by mora index across the flattened reading).
      • POST /synthesis?speaker={id} with the modified query → pipe the WAV response body through.
    5. Return body with Content-Type: audio/wav (optional server-side mp3 conversion later).
  • GET → 405, matching the existing route's shape.

Why resolve tones on the client and pitch-values on the server: the client owns the editor's resolved contour (it knows the user's edits); the server owns engine-specific pitch scaling and the key. Keep the wire format engine-agnostic (moraTones) so you can swap engines later.


4. Frontend: hook + UI + i18n + CSS

New hook src/components/AccentEditor/hooks/useAudioPlayback.ts:

  • Inputs: words: Word[].
  • Holds a single module-level HTMLAudioElement (ref) and useState<number | null>(playingWordIndex).
  • play(wordIndex):
    1. Build { reading, moraTones } by flattening words[wordIndex] per §2.
    2. fetch('/api/mark-accent/audio', { method: 'POST', body: JSON.stringify({...}) }).
    3. URL.createObjectURL(await res.blob()), set as audio.src, play; on ended/error clear playingWordIndex.
  • stop() pauses + clears. Expose isPlaying(wordIndex).
  • Wire it into AccentEditor.tsx alongside useWordHistory / useAccentAnalysis. Do not add audio fields to the Word model — playback is transient state, key it by index.
  • Memoize the flatten step; reuse the same resolution logic the renderer uses in ResultContent.tsx / annotationLayout.ts so the tone sent to the engine matches the tone drawn on screen. Extract that resolution into a small pure helper in core/word/ if it isn't already a function.

UI:

  • Per-word play button in ResultContent.tsx, adjacent to each word-inline-cluster. Use a lucide icon (Volume2 / VolumeX while playing). Match the action-button styling pattern; reuse --color-accent-green, --shadow-soft, --radius-md, --duration-interaction-fast.
  • Optional global "play all" in ResultActions.tsx in action-group-right — sequence word indices through play(). Defer to v2 if shipping per-word first.
  • Co-locate a new AudioButton.css (or fold into Result.css); CSS import goes last per ESLint import/order.
  • New user-facing strings → add to all three locales in src/i18nConfig.ts: TranslationSet interface + en + ja + zh (e.g. playAudio, stopAudio, audioLoading, audioUnsupported). A missing key is a type error — do not skip.

5. Deployment topology (don't skip this)

VOICEVOX/COEIROINK is a long-running stateful container — it cannot run inside Next.js on Vercel. The topology is:

Browser ──HTTPS──▶ Vercel (Next.js)  ── internal ──▶  TTS engine container
                  (the /api/mark-accent/audio         (VOICEVOX, your infra /
                   route just proxies, like the        Fly.io / Railway / a VM)
                   existing mark-accent stream)
  • Host the engine somewhere with enough RAM (VOICEVOX wants ~2GB+; GPU optional but faster). Wire its URL + key via env (TTS_ENGINE_URL, TTS_ENGINE_API_KEY) in Vercel project settings, same as MARK_ACCENT_API_KEY.
  • The route must not time out on long readings; VOICEVOX /synthesis is a single round-trip per word, so per-word calls keep latency bounded. Don't synthesize the whole paragraph in one call.

6. File manifest

Action Path
Modify proxy.config.js — add TTS proxy path + URL/loop helpers
Create src/app/api/mark-accent/audio/route.ts
Create src/components/AccentEditor/hooks/useAudioPlayback.ts
Create (or extract) core/word/ helper to flatten Word{reading, moraTones} (share with renderer)
Modify src/components/AccentEditor/components/ResultContent.tsx — add per-word play button
Modify src/components/AccentEditor/components/AccentEditor.tsx — wire useAudioPlayback
Modify (optional) src/components/AccentEditor/components/ResultActions.tsx — global "play all" (v2)
Modify src/components/AccentEditor/components/Result.css — button styles (or new co-located CSS)
Modify src/i18nConfig.tsplayAudio / stopAudio / audioLoading / audioUnsupported in all 3 locales
Modify AGENTS.md — document TTS_ENGINE_URL / TTS_ENGINE_API_KEY env + the new route

7. Verification

  • bun run lintbun run typecheckbun run build (per repo workflow; no test suite).
  • Functional: pick a minimal pair — 箸 (HL) vs 橋 (LH) — and confirm the two play with audibly different contours. That single test proves the tone input is honored end-to-end.
  • Manually edit a word's accent in the editor (flip a mora High↔Low) and confirm the re-fetched audio changes — this proves the user's edited tone, not just the dictionary default, drives synthesis.

8. Risks / open questions

  • Pitch constants (HIGH/LOW values) need tuning by ear — see §2 flag.
  • Mora alignment between the app's kana segmentation and VOICEVOX's internal mora split can differ (long vowels, ん, っ, combined morae). The flatten helper must count morae the same way the engine does. Validate with edge-case words before relying on it.
  • Cost/latency of self-hosting neural TTS — acceptable per-word, but add a loading state (audioLoading) on the button.
  • Engine speaker/voice selection — pick a default speakerId; could be a setting later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment