Smol TTS

Smol TTS

Smol TTS models are here! OuteTTS-0.1-350M – Zero shot voice cloning, built on LLaMa architecture, CC-BY license! 🔥

  • Pure language modeling approach to TTS
  • Zero-shot voice cloning
  • LLaMa architecture w/ Audio tokens (WavTokenizer)
  • BONUS: Works on-device w/ llama.cpp âš¡
Smol TTS demo

Three-step approach to TTS:

  • Audio tokenization using WavTokenizer (75 tok per second)
  • CTC forced alignment for word-to-audio token mapping
  • Structured prompt creation w/ transcription, duration, audio tokens

The model is extremely impressive for 350M parameters! Kudos to the @OuteAI team on such a brilliant feat – I’d love to see this be applied on larger data and smarter backbones like SmolLM.

Author: @reach_vb

Read related articles: