The talk covered building a voice bot for inbound telephony in Kazakhstan: Kazakh-Russian code-switching, scarce training data, GPU constraints, and near real-time latency requirements across the full ASR-to-TTS pipeline. Lots of implementation details shared with a room full of practitioners.
We also scraped 427 YouTube channels and processed 21k hours down to ~2,200 hours of clean Kazakh audio, because open-source data simply wasn't enough. Inference ran on a single consumer GPU. It held up.
Coming soon: an open-source release of our best bilingual RU-KZ ASR model. Stay tuned.
Photos: behindhorizon.ru/disk/data_fest_almau
We also scraped 427 YouTube channels and processed 21k hours down to ~2,200 hours of clean Kazakh audio, because open-source data simply wasn't enough. Inference ran on a single consumer GPU. It held up.
Coming soon: an open-source release of our best bilingual RU-KZ ASR model. Stay tuned.
Photos: behindhorizon.ru/disk/data_fest_almau