Design: Pitch-aware audio output (text + tone → sound)

For AkuMa (Next.js 16 + React 19 + TypeScript, the Japanese furigana + pitch-accent tool). Feature: given a word's reading and its pitch/tone pattern, fetch audio that pronounces it with the correct pitch.

0. Current state (verified)

Zero audio infrastructure in the repo. grep for Audio, speechSynthesis, mp3, tts, utterance, HTMLAudioElement across src/ → 0 matches.
No audio/TTS libraries in package.json.
The only API is /api/mark-accent/stream (src/app/api/mark-accent/stream/route.ts) — a same-origin server proxy to upstream api.sessatakuma.dev that streams text → accent-analysis NDJSON. It returns no audio. proxy.config.js (repo root) exposes only buildMarkAccentStreamUrl, DEFAULT_MARK_ACCENT_UPSTREAM_URL (https://api.sessatakuma.dev/v1/mark-accent), isMarkAccentProxyLoop.
Upstream api.sessatakuma.dev has no audio endpoint (confirmed with the team). So a pitch-aware engine must be brought up separately.

1. Engine choice — and why it must be VOICEVOX-class

The requirement "fetch sound using text and tone" is the deciding factor. It rules out two whole categories:

Generic/cloud TTS (Google / Azure / speechSynthesis) — no pitch input. Reads 箸 and 橋 with whatever contour the voice happens to use. Fails the requirement by definition.
OpenJTalk (lightweight open-source JP TTS) — auto-predicts accent from its dictionary; forcing an arbitrary user-edited tone requires rewriting full-context label strings (/A: fields). Doesn't cleanly accept "tone as input." Reject for this use case.

Recommendation: VOICEVOX ENGINE (or its fork COEIROINK), self-hosted in Docker.

Why it fits: VOICEVOX exposes an AudioQuery object whose accent_phrases[].moras[] each carry an explicit pitch float. You POST the text → get the query → overwrite each mora's pitch with the tone computed from the app's Word.accent → POST back to /synthesis → get a WAV stream. The tone is a direct input, which is exactly what "text + tone" demands. Bonus: neural voice quality, per-mora granularity, free/open-source.

2. The core mapping: app accent pattern → per-mora pitch

This is the heart of the feature, so it's worth being precise.

The app's data model (core/word/accentTypes.ts):

export const AccentValue = { None: 0, High: 1, Drop: 2 } as const;
export type AccentValueType = (typeof AccentValue)[keyof typeof AccentValue];

export interface FuriganaItem { text: string; accent: AccentValueType; }

export interface Word {
    surface: string;
    furigana: FuriganaItem[];
    accent: AccentValueType | AccentValueType[];
}

Flatten Word → { reading: string; moraTones: AccentValueType[] } before sending to the route:

reading = concatenation of all furigana[].text (pure kana — what the engine must speak).
moraTones = the per-mora accent array. For kana-only words it's word.accent (already AccentValueType[]); for kanji-mixed words, resolve from word.furigana[].accent plus WordAnnotationModel.kanaAccents (per annotationLayout.ts). One tone value per kana character/mora.

Map AccentValueType → VOICEVOX mora pitch (in the route, server-side):

High (1)  → HIGH_PITCH  (≈ 5.7)
Drop (2)  → HIGH_PITCH then step-down (the kernel; next mora goes low)
None (0)  → LOW_PITCH   (≈ 3.0)

VOICEVOX mora pitch typically ranges ~0.0–6.5, so HIGH=5.7 / LOW=3.0 gives a clear contour while the kernel (Drop) is expressed by the High→Low transition the app already encodes.

⚠️ Validate before finalizing: confirm Drop's exact semantics by reading Kana.tsx's dot rendering. The mapping above is the design intent; the real values get tuned once you can A/B against the editor's visual pattern.

3. Backend: new same-origin audio route

Mirror the existing mark-accent/stream route exactly in security posture.

proxy.config.js (repo root) — extend, don't duplicate:

export const MARK_TTS_PROXY_PATH    = '/api/mark-accent/audio';
export const DEFAULT_TTS_ENGINE_URL = /* e.g. 'http://tts.internal:50021' */;  // VOICEVOX default port

export function buildMarkAccentAudioUrl(upstream) { /* Next route path */ }
export function isTtsProxyLoop(requestHost, engineUrl) { /* same pattern as isMarkAccentProxyLoop */ }

New route src/app/api/mark-accent/audio/route.ts:

POST body: { reading: string; moraTones: number[]; speakerId?: number } (resolved tone array from the client — thin payload, no secret leakage).
Server-side:
1. Same origin check as the existing route (extractRequestOrigin / extractRequestHostOrigin + ALLOWED_DEV_ORIGINS). Lift into a shared helper or copy.
2. TTS_ENGINE_URL / TTS_ENGINE_API_KEY from env (analogous to MARK_ACCENT_API_KEY). 500 if unset.
3. Loop guard via isTtsProxyLoop.
4. Call engine (VOICEVOX):
  - GET /audio_query?text={reading}&speaker={speakerId} → JSON AudioQuery.
  - Walk accent_phrases, overwrite each mora's pitch from moraTones (aligned by mora index across the flattened reading).
  - POST /synthesis?speaker={id} with the modified query → pipe the WAV response body through.
5. Return body with Content-Type: audio/wav (optional server-side mp3 conversion later).
GET → 405, matching the existing route's shape.

Why resolve tones on the client and pitch-values on the server: the client owns the editor's resolved contour (it knows the user's edits); the server owns engine-specific pitch scaling and the key. Keep the wire format engine-agnostic (moraTones) so you can swap engines later.

4. Frontend: hook + UI + i18n + CSS

New hook src/components/AccentEditor/hooks/useAudioPlayback.ts:

Inputs: words: Word[].
Holds a single module-level HTMLAudioElement (ref) and useState<number | null>(playingWordIndex).
play(wordIndex):
1. Build { reading, moraTones } by flattening words[wordIndex] per §2.
2. fetch('/api/mark-accent/audio', { method: 'POST', body: JSON.stringify({...}) }).
3. URL.createObjectURL(await res.blob()), set as audio.src, play; on ended/error clear playingWordIndex.
stop() pauses + clears. Expose isPlaying(wordIndex).
Wire it into AccentEditor.tsx alongside useWordHistory / useAccentAnalysis. Do not add audio fields to the Word model — playback is transient state, key it by index.
Memoize the flatten step; reuse the same resolution logic the renderer uses in ResultContent.tsx / annotationLayout.ts so the tone sent to the engine matches the tone drawn on screen. Extract that resolution into a small pure helper in core/word/ if it isn't already a function.

UI:

Per-word play button in ResultContent.tsx, adjacent to each word-inline-cluster. Use a lucide icon (Volume2 / VolumeX while playing). Match the action-button styling pattern; reuse --color-accent-green, --shadow-soft, --radius-md, --duration-interaction-fast.
Optional global "play all" in ResultActions.tsx in action-group-right — sequence word indices through play(). Defer to v2 if shipping per-word first.
Co-locate a new AudioButton.css (or fold into Result.css); CSS import goes last per ESLint import/order.
New user-facing strings → add to all three locales in src/i18nConfig.ts: TranslationSet interface + en + ja + zh (e.g. playAudio, stopAudio, audioLoading, audioUnsupported). A missing key is a type error — do not skip.

5. Deployment topology (don't skip this)

VOICEVOX/COEIROINK is a long-running stateful container — it cannot run inside Next.js on Vercel. The topology is:

Browser ──HTTPS──▶ Vercel (Next.js)  ── internal ──▶  TTS engine container
                  (the /api/mark-accent/audio         (VOICEVOX, your infra /
                   route just proxies, like the        Fly.io / Railway / a VM)
                   existing mark-accent stream)

Host the engine somewhere with enough RAM (VOICEVOX wants ~2GB+; GPU optional but faster). Wire its URL + key via env (TTS_ENGINE_URL, TTS_ENGINE_API_KEY) in Vercel project settings, same as MARK_ACCENT_API_KEY.
The route must not time out on long readings; VOICEVOX /synthesis is a single round-trip per word, so per-word calls keep latency bounded. Don't synthesize the whole paragraph in one call.

6. File manifest

Action	Path
Modify	`proxy.config.js` — add TTS proxy path + URL/loop helpers
Create	`src/app/api/mark-accent/audio/route.ts`
Create	`src/components/AccentEditor/hooks/useAudioPlayback.ts`
Create (or extract)	`core/word/` helper to flatten `Word` → `{reading, moraTones}` (share with renderer)
Modify	`src/components/AccentEditor/components/ResultContent.tsx` — add per-word play button
Modify	`src/components/AccentEditor/components/AccentEditor.tsx` — wire `useAudioPlayback`
Modify (optional)	`src/components/AccentEditor/components/ResultActions.tsx` — global "play all" (v2)
Modify	`src/components/AccentEditor/components/Result.css` — button styles (or new co-located CSS)
Modify	`src/i18nConfig.ts` — `playAudio` / `stopAudio` / `audioLoading` / `audioUnsupported` in all 3 locales
Modify	`AGENTS.md` — document `TTS_ENGINE_URL` / `TTS_ENGINE_API_KEY` env + the new route

7. Verification

bun run lint → bun run typecheck → bun run build (per repo workflow; no test suite).
Functional: pick a minimal pair — 箸 (HL) vs 橋 (LH) — and confirm the two play with audibly different contours. That single test proves the tone input is honored end-to-end.
Manually edit a word's accent in the editor (flip a mora High↔Low) and confirm the re-fetched audio changes — this proves the user's edited tone, not just the dictionary default, drives synthesis.

8. Risks / open questions

Pitch constants (HIGH/LOW values) need tuning by ear — see §2 flag.
Mora alignment between the app's kana segmentation and VOICEVOX's internal mora split can differ (long vowels, ん, っ, combined morae). The flatten helper must count morae the same way the engine does. Validate with edge-case words before relying on it.
Cost/latency of self-hosting neural TTS — acceptable per-word, but add a loading state (audioLoading) on the button.
Engine speaker/voice selection — pick a default speakerId; could be a setting later.

G36maid/akuma-audio-design.md

Select an option

No results found