Smol TTS

Smol TTS models are here! OuteTTS-0.1-350M – Zero shot voice cloning, built on LLaMa architecture, CC-BY license! 🔥

Pure language modeling approach to TTS
Zero-shot voice cloning
LLaMa architecture w/ Audio tokens (WavTokenizer)
BONUS: Works on-device w/ llama.cpp ⚡

Smol TTS demo

Three-step approach to TTS:

Audio tokenization using WavTokenizer (75 tok per second)
CTC forced alignment for word-to-audio token mapping
Structured prompt creation w/ transcription, duration, audio tokens

The model is extremely impressive for 350M parameters! Kudos to the @OuteAI team on such a brilliant feat – I’d love to see this be applied on larger data and smarter backbones like SmolLM.

Author: @reach_vb

Read related articles:

Janus 1.3B: A Multi-Modal LLM
Wikiped ia Dataset

November 5, 2024

Tags:

Hugging Face, TTS