← back to work2025
private build
Subtitle Engine
Local multimodal AI subtitle pipeline
PythonMLXWhisperpyannoteNLLB-200Qwen2-VL
$ subtitle run interview.mp4
runningelapsed 00:01:56247 segments · 3 langs
01✓silero vad12.3s
02✓mlx-whisper43.7s
03✓whisperx align08.2s
04✓pyannote diarize19.4s
05✓clap events06.1s
06✓beats events05.8s
07✓nllb-200 translate11.0s
08●qwen2-vl groundrunning…
out → interview.srtfully local · apple silicon
─── 01
Eight stages, fully local
Silero VAD → mlx-whisper (or faster-whisper) → WhisperX alignment → pyannote diarization → CLAP audio→text → BEATs + AST events → NLLB-200 translation → Qwen2.5 (via Ollama) formatting. Output: .srt + .ass.
Zero API cost. A two-hour interview costs $0 to subtitle, at any scale.
─── 02
Why local
- No bill. Cloud-billed runs make experimentation expensive; local makes iteration free.
- Privacy. Personal recordings, interviews, sensitive material never leave the machine.
- Apple Silicon native. mlx-whisper exploits unified memory; a 13B Whisper and a 7B summary model can coexist with the right cache tuning.
─── 03
Language contamination — the v2 fix
WhisperX occasionally inserts an English token into a Telugu transcription (or vice versa). The bug is upstream — Whisper code-switches on unfamiliar phonemes.
v2 fix: a second-pass language classifier per segment. If the classifier disagrees with the VAD-flagged language, the segment is re-transcribed with the language hint locked in. Adds ~15% to runtime; eliminates contamination.
─── 04
What I learned
- Diarization quality is bounded by ASR quality. Fix the transcript first.
- Visual grounding is a quiet superpower for ambiguous segments. One frame resolves most who-said-what questions.
- DAG-style stage caching pays for itself within the first rerun. Never build a pipeline that re-runs every stage on every invocation.
- A small hand-labeled eval harness beats vibes. Threshold tuning without ground truth is gambling.
