For AkuMa (Next.js 16 + React 19 + TypeScript, the Japanese furigana + pitch-accent tool). Feature: given a word's reading and its pitch/tone pattern, fetch audio that pronounces it with the correct pitch.
- Zero audio infrastructure in the repo.
grepforAudio,speechSynthesis,mp3,tts,utterance,HTMLAudioElementacrosssrc/→ 0 matches. - No audio/TTS libraries in
package.json. - The only API is
/api/mark-accent/stream(src/app/api/mark-accent/stream/route.ts) — a same-origin server proxy to upstreamapi.sessatakuma.devthat streams text → accent-analysis NDJSON. It returns no audio.proxy.config.js(repo root) exposes onlybuildMarkAccentStreamUrl,DEFAULT_MARK_ACCENT_UPSTREAM_URL(https://api.sessatakuma.dev/v1/mark-accent),isMarkAccentProxyLoop. - Upstream
api.sessatakuma.devhas no audio endpoint (confirmed with the team). So a pitch-aware engine must be brought up separately.
The requirement "fetch sound using text and tone" is the deciding factor. It rules out two whole categories:
- Generic/cloud TTS (Google / Azure /
speechSynthesis) — no pitch input. Reads 箸 and 橋 with whatever contour the voice happens to use. Fails the requirement by definition. - OpenJTalk (lightweight open-source JP TTS) — auto-predicts accent from its dictionary; forcing
an arbitrary user-edited tone requires rewriting full-context label strings (
/A:fields). Doesn't cleanly accept "tone as input." Reject for this use case.
Recommendation: VOICEVOX ENGINE (or its fork COEIROINK), self-hosted in Docker.
Why it fits: VOICEVOX exposes an AudioQuery object whose accent_phrases[].moras[] each carry an
explicit pitch float. You POST the text → get the query → overwrite each mora's pitch with the
tone computed from the app's Word.accent → POST back to /synthesis → get a WAV stream. The tone
is a direct input, which is exactly what "text + tone" demands. Bonus: neural voice quality, per-mora
granularity, free/open-source.
This is the heart of the feature, so it's worth being precise.
The app's data model (core/word/accentTypes.ts):
export const AccentValue = { None: 0, High: 1, Drop: 2 } as const;
export type AccentValueType = (typeof AccentValue)[keyof typeof AccentValue];
export interface FuriganaItem { text: string; accent: AccentValueType; }
export interface Word {
surface: string;
furigana: FuriganaItem[];
accent: AccentValueType | AccentValueType[];
}Flatten Word → { reading: string; moraTones: AccentValueType[] } before sending to the route:
reading= concatenation of allfurigana[].text(pure kana — what the engine must speak).moraTones= the per-moraaccentarray. For kana-only words it'sword.accent(alreadyAccentValueType[]); for kanji-mixed words, resolve fromword.furigana[].accentplusWordAnnotationModel.kanaAccents(perannotationLayout.ts). One tone value per kana character/mora.
Map AccentValueType → VOICEVOX mora pitch (in the route, server-side):
High (1) → HIGH_PITCH (≈ 5.7)
Drop (2) → HIGH_PITCH then step-down (the kernel; next mora goes low)
None (0) → LOW_PITCH (≈ 3.0)
VOICEVOX mora pitch typically ranges ~0.0–6.5, so HIGH=5.7 / LOW=3.0 gives a clear contour while
the kernel (Drop) is expressed by the High→Low transition the app already encodes.
⚠️ Validate before finalizing: confirmDrop's exact semantics by readingKana.tsx's dot rendering. The mapping above is the design intent; the real values get tuned once you can A/B against the editor's visual pattern.
Mirror the existing mark-accent/stream route exactly in security posture.
proxy.config.js (repo root) — extend, don't duplicate:
export const MARK_TTS_PROXY_PATH = '/api/mark-accent/audio';
export const DEFAULT_TTS_ENGINE_URL = /* e.g. 'http://tts.internal:50021' */; // VOICEVOX default port
export function buildMarkAccentAudioUrl(upstream) { /* Next route path */ }
export function isTtsProxyLoop(requestHost, engineUrl) { /* same pattern as isMarkAccentProxyLoop */ }New route src/app/api/mark-accent/audio/route.ts:
POSTbody:{ reading: string; moraTones: number[]; speakerId?: number }(resolved tone array from the client — thin payload, no secret leakage).- Server-side:
- Same origin check as the existing route (
extractRequestOrigin/extractRequestHostOrigin+ALLOWED_DEV_ORIGINS). Lift into a shared helper or copy. TTS_ENGINE_URL/TTS_ENGINE_API_KEYfrom env (analogous toMARK_ACCENT_API_KEY). 500 if unset.- Loop guard via
isTtsProxyLoop. - Call engine (VOICEVOX):
GET /audio_query?text={reading}&speaker={speakerId}→ JSONAudioQuery.- Walk
accent_phrases, overwrite each mora'spitchfrommoraTones(aligned by mora index across the flattened reading). POST /synthesis?speaker={id}with the modified query → pipe the WAV response body through.
- Return body with
Content-Type: audio/wav(optional server-side mp3 conversion later).
- Same origin check as the existing route (
GET→ 405, matching the existing route's shape.
Why resolve tones on the client and pitch-values on the server: the client owns the editor's
resolved contour (it knows the user's edits); the server owns engine-specific pitch scaling and the
key. Keep the wire format engine-agnostic (moraTones) so you can swap engines later.
New hook src/components/AccentEditor/hooks/useAudioPlayback.ts:
- Inputs:
words: Word[]. - Holds a single module-level
HTMLAudioElement(ref) anduseState<number | null>(playingWordIndex). play(wordIndex):- Build
{ reading, moraTones }by flatteningwords[wordIndex]per §2. fetch('/api/mark-accent/audio', { method: 'POST', body: JSON.stringify({...}) }).URL.createObjectURL(await res.blob()), set asaudio.src, play; onended/errorclearplayingWordIndex.
- Build
stop()pauses + clears. ExposeisPlaying(wordIndex).- Wire it into
AccentEditor.tsxalongsideuseWordHistory/useAccentAnalysis. Do not add audio fields to theWordmodel — playback is transient state, key it by index. - Memoize the flatten step; reuse the same resolution logic the renderer uses in
ResultContent.tsx/annotationLayout.tsso the tone sent to the engine matches the tone drawn on screen. Extract that resolution into a small pure helper incore/word/if it isn't already a function.
UI:
- Per-word play button in
ResultContent.tsx, adjacent to eachword-inline-cluster. Use a lucide icon (Volume2/VolumeXwhile playing). Match theaction-buttonstyling pattern; reuse--color-accent-green,--shadow-soft,--radius-md,--duration-interaction-fast. - Optional global "play all" in
ResultActions.tsxinaction-group-right— sequence word indices throughplay(). Defer to v2 if shipping per-word first. - Co-locate a new
AudioButton.css(or fold intoResult.css); CSS import goes last per ESLintimport/order. - New user-facing strings → add to all three locales in
src/i18nConfig.ts:TranslationSetinterface +en+ja+zh(e.g.playAudio,stopAudio,audioLoading,audioUnsupported). A missing key is a type error — do not skip.
VOICEVOX/COEIROINK is a long-running stateful container — it cannot run inside Next.js on Vercel. The topology is:
Browser ──HTTPS──▶ Vercel (Next.js) ── internal ──▶ TTS engine container
(the /api/mark-accent/audio (VOICEVOX, your infra /
route just proxies, like the Fly.io / Railway / a VM)
existing mark-accent stream)
- Host the engine somewhere with enough RAM (VOICEVOX wants ~2GB+; GPU optional but faster). Wire its
URL + key via env (
TTS_ENGINE_URL,TTS_ENGINE_API_KEY) in Vercel project settings, same asMARK_ACCENT_API_KEY. - The route must not time out on long readings; VOICEVOX
/synthesisis a single round-trip per word, so per-word calls keep latency bounded. Don't synthesize the whole paragraph in one call.
| Action | Path |
|---|---|
| Modify | proxy.config.js — add TTS proxy path + URL/loop helpers |
| Create | src/app/api/mark-accent/audio/route.ts |
| Create | src/components/AccentEditor/hooks/useAudioPlayback.ts |
| Create (or extract) | core/word/ helper to flatten Word → {reading, moraTones} (share with renderer) |
| Modify | src/components/AccentEditor/components/ResultContent.tsx — add per-word play button |
| Modify | src/components/AccentEditor/components/AccentEditor.tsx — wire useAudioPlayback |
| Modify (optional) | src/components/AccentEditor/components/ResultActions.tsx — global "play all" (v2) |
| Modify | src/components/AccentEditor/components/Result.css — button styles (or new co-located CSS) |
| Modify | src/i18nConfig.ts — playAudio / stopAudio / audioLoading / audioUnsupported in all 3 locales |
| Modify | AGENTS.md — document TTS_ENGINE_URL / TTS_ENGINE_API_KEY env + the new route |
bun run lint→bun run typecheck→bun run build(per repo workflow; no test suite).- Functional: pick a minimal pair — 箸 (HL) vs 橋 (LH) — and confirm the two play with audibly different contours. That single test proves the tone input is honored end-to-end.
- Manually edit a word's accent in the editor (flip a mora High↔Low) and confirm the re-fetched audio changes — this proves the user's edited tone, not just the dictionary default, drives synthesis.
- Pitch constants (HIGH/LOW values) need tuning by ear — see §2 flag.
- Mora alignment between the app's kana segmentation and VOICEVOX's internal mora split can differ (long vowels, ん, っ, combined morae). The flatten helper must count morae the same way the engine does. Validate with edge-case words before relying on it.
- Cost/latency of self-hosting neural TTS — acceptable per-word, but add a loading state
(
audioLoading) on the button. - Engine speaker/voice selection — pick a default
speakerId; could be a setting later.