Smol TTS models are here! OuteTTS-0.1-350M – Zero shot voice cloning, built on LLaMa architecture, CC-BY license! 🔥
- Pure language modeling approach to TTS
- Zero-shot voice cloning
- LLaMa architecture w/ Audio tokens (WavTokenizer)
- BONUS: Works on-device w/ llama.cpp âš¡
Three-step approach to TTS:
- Audio tokenization using WavTokenizer (75 tok per second)
- CTC forced alignment for word-to-audio token mapping
- Structured prompt creation w/ transcription, duration, audio tokens
The model is extremely impressive for 350M parameters! Kudos to the @OuteAI team on such a brilliant feat – I’d love to see this be applied on larger data and smarter backbones like SmolLM.
Author: @reach_vb
Read related articles:

