I've been following jpc [0] on the LAION discord since he started building this last year, and it's a very impressive project.<p>The key here is that the Whisper multilingual ASR model has been trained on a huge amount of data, so its encoder output is a very good representation of the semantic content of speech. This can be used as an open-source, drop-in replacement for the semantic encoder in model architectures like SPEAR-TTS/VALL-E/etc (whose semantic encoders are not publicly available). This is then used to predict acoustic tokens (the output from the quantized/low-bandwidth Encodec audio codec) which is then upsampled/denoised/enhanced with the Vocos vocoder.<p>I know someone is working on Hindi but it would be great to see this extended to other languages for a properly open-source [1], multilingual TTS platform. I think the main bottleneck at the moment is finding people who can procure/clean compliant datasets.<p>[0] <a href="https://github.com/jpc">https://github.com/jpc</a>
[1] jpc/Collabora went to great efforts to ensure that they are only using properly licensed data to train this. I doubt Whisper itself was that compliant, so it's a bit muddy.