private build

Subtitle Engine

Local multimodal AI subtitle pipeline

PythonMLXWhisperpyannoteNLLB-200Qwen2-VL

$ subtitle run interview.mp4

running

elapsed 00:01:56247 segments · 3 langs

01✓silero vad12.3s

02✓mlx-whisper43.7s

03✓whisperx align08.2s

04✓pyannote diarize19.4s

05✓clap events06.1s

06✓beats events05.8s

07✓nllb-200 translate11.0s

08●qwen2-vl groundrunning…

out → interview.srtfully local · apple silicon

─── 01

Eight stages, fully local

Silero VAD → mlx-whisper (or faster-whisper) → WhisperX alignment → pyannote diarization → CLAP audio→text → BEATs + AST events → NLLB-200 translation → Qwen2.5 (via Ollama) formatting. Output: .srt + .ass.

Zero API cost. A two-hour interview costs $0 to subtitle, at any scale.

─── 02

Why local

No bill. Cloud-billed runs make experimentation expensive; local makes iteration free.
Privacy. Personal recordings, interviews, sensitive material never leave the machine.
Apple Silicon native. mlx-whisper exploits unified memory; a 13B Whisper and a 7B summary model can coexist with the right cache tuning.

─── 03

Language contamination — the v2 fix

WhisperX occasionally inserts an English token into a Telugu transcription (or vice versa). The bug is upstream — Whisper code-switches on unfamiliar phonemes.

v2 fix: a second-pass language classifier per segment. If the classifier disagrees with the VAD-flagged language, the segment is re-transcribed with the language hint locked in. Adds ~15% to runtime; eliminates contamination.

─── 04

What I learned

Diarization quality is bounded by ASR quality. Fix the transcript first.
Visual grounding is a quiet superpower for ambiguous segments. One frame resolves most who-said-what questions.
DAG-style stage caching pays for itself within the first rerun. Never build a pipeline that re-runs every stage on every invocation.
A small hand-labeled eval harness beats vibes. Threshold tuning without ground truth is gambling.

← back to work2025