06 — Audio design¶
Every sound the player will hear, why it's there, where it lives, how it's produced. Audio is half the "feels like an old game" pitch — get this wrong and the pixel art doesn't carry it alone.
Four audio layers¶
| Layer | Volume | Mix priority | Source |
|---|---|---|---|
| Music | 15% master, ducks to 5% during dialogue | low | Chiptune loops, generative or pre-baked |
| SFX | 30% master | medium | jsfxr-generated, ~20 distinct samples |
| Tactical tones | 60% master, ducks to 12% while TTS plays | medium-high | Web Audio oscillator, pitch ↔ delta from sonic_model |
| Voice (TTS) | 100% master | high | Pre-rendered MP3 per coach phrase + Web Speech fallback |
All four pipe through Howler.js with one
shared mixer (the tactical-tone oscillator wired into the Howler graph
via a MediaElementAudioSourceNode). Mute toggles per layer in
screens/13-settings.md.
Why a separate tactical-tone layer¶
The sonic-model emits a continuous pitch where the pitch IS the delta — speed delta, brake-pressure delta, longitudinal-G delta — so the driver hears how far they are from the gold-lap reference without listening to words. This is the reflexive sub-50 ms feedback layer (per ADR-017).
It cannot share the SFX bus: SFX are < 1 s one-shots, tactical tones are continuous and never stop. It cannot share the voice bus either, because voice (TTS) is the thing we duck around. It needs its own gain node so we can ramp it down by ~14 dB whenever a coach phrase is speaking and back up when the phrase ends — without affecting any other layer.
SFX library¶
20 distinct samples. Each is a one-shot, < 1 s, generated via
jsfxr with the seeds + parameter files committed
in pitwall-web/public/sfx/. Reproducible from seeds — the JSON in
pitwall-web/scripts/sfx-bake.ts regenerates the .mp3s.
| ID | Use | Length | Character |
|---|---|---|---|
boot_chime |
Title screen entry | 1.2 s | Three-note rising arpeggio (C-E-G) |
cursor_move |
Menu D-pad nav | 30 ms | Tiny click |
cursor_select |
A button confirm | 200 ms | Two-note ding |
cancel |
B button cancel | 150 ms | Soft thud |
dialogue_blip |
Per char during teletype | 20 ms | Very soft tick (max once per 30 ms) |
transition_wipe |
Screen change | 150 ms | Whoosh |
lap_complete |
Lap finish | 800 ms | 4-note fanfare |
pb_unlock |
New personal best | 1.5 s | 6-note ascending fanfare with chord |
medal_award |
New medal awarded | 600 ms | Slot-machine "ding-ding-ding" |
coach_thinking |
Pre-brief generating | loop, 800 ms | 4-tone loop |
over_grip |
HUD: friction circle exceeded | 250 ms | Buzzer (matches 01-visual-language.md ui-bad) |
coast_warning |
HUD: coasting too long on a straight | 400 ms | Slow descending tone |
corner_apex |
HUD: hit apex marker | 100 ms | Quick chirp |
score_tick |
Per metric reveal on score screen | 50 ms | Click |
score_total |
Total score reveal | 1.0 s | Big positive chord |
error_quiet |
Bridge offline / network drop | 600 ms | 2-note descending soft sad tone |
goal_complete |
Session goal achieved | 500 ms | Heroic 3-note motif |
goal_miss |
Session goal missed | 350 ms | 2-note flat tone |
level_up |
Driver level increase | 1.8 s | Big rising chime + fanfare |
night_chime |
End-of-day fade-to-night | 2.0 s | Soft 5-note descending lullaby |
Music¶
8-bar chiptune loops. Each scene has an associated track. Looping is
gapless via Howler's sprite mode.
| Track | Use | Tempo | Key |
|---|---|---|---|
title_loop |
Title screen idle | 92 BPM | C major |
garage_loop |
Garage hub | 80 BPM | A minor |
worldmap_loop |
World map | 85 BPM | G major |
prebrief_loop |
Pre-brief (low-key, atmospheric) | 70 BPM | D minor |
drive_loop |
On-track HUD (energetic) | 130 BPM | E minor |
cooldown_loop |
Cool-down lap | 95 BPM | A major |
score_fanfare |
Stage clear (one-shot, 12 s) | 100 BPM | F major |
eod_loop |
End of day (slow, melancholic) | 60 BPM | C minor |
Generation paths (pick one):
- Hand-composed with Bosca Ceoil or Famistudio — most authentic; slow.
- Suno / Udio prompted with "8-bit chiptune racing arcade, GBA-era, 92 BPM, C major, looping 16-bar arrangement, NES square waves + triangle bass" — fast; license clearance required for commercial use.
- CC0 sample packs like Eric Skiff's Resistor Anthems or Kevin MacLeod's chiptune set — fastest; less specific to the brand.
For May 23 demo, option 3 is the realistic call. Post-Sonoma, revisit.
Voice (TTS)¶
Per 03-character-bible.md, each coach has
~50 canonical phrases pre-rendered to MP3 + a Web Speech API fallback
for any phrase the LLM generates ad-hoc.
Pre-rendered set¶
pitwall-web/public/audio/coaches/
├── trod/
│ ├── greet_morning.mp3 (~2 s)
│ ├── greet_afternoon.mp3
│ ├── greet_evening.mp3
│ ├── greet_long_absence.mp3
│ ├── concept_trail_brake.mp3
│ ├── concept_late_apex.mp3
│ ├── … 50 total per coach …
│ └── farewell_eod.mp3
├── bentley/ (50 files)
├── drill/ (50 files)
├── calm/ (50 files)
└── buddy/ (50 files)
= 250 clips × ~120 KB each ≈ 30 MB total. Service worker caches the active coach's clips on coach-select; downloads the rest in the background.
Web Speech fallback¶
When the LLM (LitertCoach) generates a fresh phrase that isn't pre-rendered, the PWA uses Web Speech API:
function speak(text: string, voiceConfig: { rate: number, pitch: number }) {
const u = new SpeechSynthesisUtterance(text)
u.rate = voiceConfig.rate
u.pitch = voiceConfig.pitch
u.voice = pickVoice(coachId) // best-match Web Speech voice
speechSynthesis.speak(u)
}
Quality is variable by browser/OS. On the Pixel + Chrome, the default en-US voice is acceptable. On macOS Chrome, less so. Pre-rendered is preferred for any phrase that fires more than once.
Voice generation pipeline¶
A bake script (run once or whenever phrases change):
node scripts/voice-bake.ts \
--coach trod \
--phrases data/voices/trod-phrases.json \
--tts gemini-2.5-flash-tts \
--voice "experienced-instructor-male-american-50s" \
--out pitwall-web/public/audio/coaches/trod/
Phrases JSON shape:
// pitwall-web/data/voices/trod-phrases.json
[
{ "id": "greet_morning", "text": "Welcome back, kid. Today we drive." },
{ "id": "concept_trail_brake", "text": "Roll the brake to the apex." },
{ "id": "corner_t11", "text": "Wait for the bump, trail to the third tire stack." },
{ "id": "encourage_clean", "text": "Now THAT was distance." },
{ "id": "disappoint_overdrive", "text": "Slow down. Same line." },
/* ... 50 entries ... */
]
Output filenames are deterministic from id so service-worker cache
keys are stable across rebuilds.
Audio system architecture¶
// pitwall-web/src/lib/audio.ts
import { Howl } from 'howler'
import { ref } from 'vue'
// One reactive flag is the source of truth for ducking. The tactical-tone
// oscillator and any future ducked layer watch it and ramp their gain.
// Setting it true while it is already true extends the duck window
// (multiple back-to-back voice cues won't drop the duck mid-phrase).
export const ttsDucked = ref(false)
let _duckUntil = 0 // monotonic ms — when the active duck window ends
export const audio = {
music: new Map<string, Howl>(),
sfx: new Map<string, Howl>(),
voice: new Map<string, Howl>(),
tactical: null as null | TacticalToneOscillator, // see sonic-model bus
playMusic(track: string) {
/* fade out current track over 500 ms, fade in new */
},
playSfx(id: SfxId) { /* one-shot */ },
playVoice(coachId: CoachId, phraseId: string, hintMs = 0) {
const key = `${coachId}/${phraseId}`
let h = this.voice.get(key)
if (!h) {
h = new Howl({ src: [`/audio/coaches/${key}.mp3`] })
this.voice.set(key, h)
}
// Two ducks for the price of one: music drops to 5%, tactical to 12%.
audio.duckMusic(true)
audio.duckTactical(true, hintMs || (h.duration() * 1000) || 1500)
h.once('end', () => {
audio.duckMusic(false)
// Tactical un-ducks via timer — see duckTactical — so a
// long phrase that finishes early doesn't yank tones up
// before the user has parsed the cue.
})
h.play()
},
speakAdHoc(text: string, voiceConfig: VoiceConfig, hintMs = 0) {
// Web Speech path — used when the LLM emits a phrase outside the
// pre-rendered set. Ducker hint comes from the bridge's
// `expected_tts_ms` on the /cues/stream payload (~150 ms/word, floor 800).
const u = new SpeechSynthesisUtterance(text)
u.rate = voiceConfig.rate
u.pitch = voiceConfig.pitch
u.voice = pickVoice(voiceConfig.coachId)
audio.duckMusic(true)
audio.duckTactical(true, hintMs || estimateMs(text))
u.onend = () => audio.duckMusic(false)
speechSynthesis.speak(u)
},
duckMusic(ducked: boolean) { /* fade music to 5% / 100% */ },
duckTactical(ducked: boolean, holdMs = 0) {
// Engages the duck IMMEDIATELY (8 ms ramp — fast enough that the
// driver doesn't hear the hand-off, slow enough that we don't
// get a click). Releases on a timer so back-to-back cues stack
// their hold windows instead of fighting each other.
if (ducked) {
_duckUntil = Math.max(_duckUntil, performance.now() + holdMs)
ttsDucked.value = true
audio.tactical?.gain.gainNode.gain.linearRampToValueAtTime(
0.12, audio.tactical.ctx.currentTime + 0.008,
)
setTimeout(audio._maybeUnduck, holdMs + 16)
}
},
_maybeUnduck() {
if (performance.now() >= _duckUntil - 8) {
ttsDucked.value = false
audio.tactical?.gain.gainNode.gain.linearRampToValueAtTime(
0.60, audio.tactical.ctx.currentTime + 0.080,
)
}
},
}
The PWA's /cues/stream subscriber feeds expected_tts_ms from each
event into playVoice/speakAdHoc so the duck window matches the
phrase length exactly — no guessing, no hand-tuning.
Audio rules¶
These match the visual rules in 01-visual-language.md:
- Every confirm has a chime. No silent A-button presses.
- Every cancel has a thud. No silent B-button presses.
- Music ducks during dialogue. 100% → 5% over 200 ms; restore when teletype finishes.
- Tactical tones duck during TTS. 60% → 12% over 8 ms (fast hand-off,
no click) when a coach phrase starts; restore over 80 ms (slow ramp so
the driver stays oriented after the cue lands). Window length comes from
the bridge's
expected_tts_mscue field — back-to-back cues extend the duck instead of fighting each other. This is the cognitive-overload fix from ADR-018. Without it, the driver hears continuous brake-delta pitch UNDER a verbal pace note at full volume — provably bad at 130 mph. - TTS never overlaps TTS. A new voice cue interrupts the previous
one (
Howl.stop()then play). The arbiter at the bridge already cools down to one cue per 3 s (ADR-002), so this is a backstop. - No SFX during the on-track HUD's high-attention windows —
between corner-entry and corner-exit, only safety SFX (
over_grip,coast_warning) play. Cursor / dialogue SFX are suppressed. prefers-reduced-motion: reducealso reduces audio: music volume drops to 0, SFX to 50%. Coach voice unchanged (it's the point of the coach). Tactical-tone gain drops to 30% (still present — it's a safety layer).- No SFX delay > 30 ms. Pre-loaded Howls; never lazy-load on button press.
Mute / volume UX¶
In screens/13-settings.md, three sliders:
MASTER ████████████████░░░░ 80%
MUSIC ██████████░░░░░░░░░░ 50%
SFX ████████████████████ 100%
COACH VOICE ████████████████████ 100%
Plus quick toggles: - 🔕 mute all (visible in status bar) - 🎙️ mute coach voice (some drivers prefer the silence-is-coaching baseline)
Settings persist in the active save slot (per 04-state-architecture.md),
so a household sharing a phone keeps per-driver preferences.
Related¶
01-visual-language.md— the audio ↔ visual pairs table03-character-bible.md— voice character for each coachscreens/08-on-track-hud.md— HUD audio rules- ADR-017 — Three-tier coach architecture — pre-rendered phrases are exactly the in-drive coach path