I've been following jpc [0] on the LAION discord since he started building this last year, and it's a very impressive project.<p>The key here is that the Whisper multilingual ASR model has been trained on a huge amount of data, so its encoder output is a very good representation of the semantic content of speech. This can be used as an open-source, drop-in replacement for the semantic encoder in model architectures like SPEAR-TTS/VALL-E/etc (whose semantic encoders are not publicly available). This is then used to predict acoustic tokens (the output from the quantized/low-bandwidth Encodec audio codec) which is then upsampled/denoised/enhanced with the Vocos vocoder.<p>I know someone is working on Hindi but it would be great to see this extended to other languages for a properly open-source [1], multilingual TTS platform. I think the main bottleneck at the moment is finding people who can procure/clean compliant datasets.<p>[0] <a href="https://github.com/jpc">https://github.com/jpc</a>
[1] jpc/Collabora went to great efforts to ensure that they are only using properly licensed data to train this. I doubt Whisper itself was that compliant, so it's a bit muddy.
Hi, WhisperSpeech dev here.<p>Thanks for all the nice comments, I was working really hard on this model for quite a few months now but there are still a lot of ways we can make it better.<p>Thanks to generosity of Collabora this is a real open-source project (not just a one-time marketing ploy), so if you want to help improve it or integrate it into something you are building, I'd love to help.<p>You can also buy our undivided engineering attention if you have a business use-case. :)
Interested to see how it performs for Mandarin Chinese speech synthesis, especially with prosody and emotion. The highest quality open source model I've seen so far is EmotiVoice[0], which I've made a CLI wrapper around to generate audio for flashcards.[1] For EmotiVoice, you can apparently also clone your own voice with a GPU, but I have not tested this.[2]<p>[0] <a href="https://github.com/netease-youdao/EmotiVoice">https://github.com/netease-youdao/EmotiVoice</a><p>[1] <a href="https://github.com/siraben/emotivoice-cli">https://github.com/siraben/emotivoice-cli</a><p>[2] <a href="https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Cloning-with-your-personal-data">https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Clon...</a>
I know it's old at this point and doesn't use the fancy new tech, but Mycroft's Mimic 3 is still pretty impressive and is small enough to fit comfortably and generate speech in real time on a raspberry pi [0]. Some of their voices are better than others, but the best of them are definitely equal to the examples of WhisperSpeech given here.<p>[0] <a href="https://mycroft.ai/mimic-3/" rel="nofollow">https://mycroft.ai/mimic-3/</a>
I was looking at video on training a custom voice with Piper, following a tutorial at <a href="https://www.youtube.com/watch?v=b_we_jma220" rel="nofollow">https://www.youtube.com/watch?v=b_we_jma220</a>, and noticed how the datasets required metadata of the text for the source audio files. This training method by Collabora seems to automate that process and only requires an audio file for training.
Is there any work/progress on a NN/trained model based on Internet Phonetic Alphabet (IPA) transcriptions? I.e. to be able to create an IPA transcription and convert it back to sound.<p>That approach would be useful for things like shifting a voice to a different accent and to support voices speaking multiple languages.<p>This can be done to a limited extend for models such as MBROLA voices by mapping the phonemes of one language to the phonemes of the MBROLA voice. MBROLA is more complex in that it supports diphones, and many diphone pairs don't exist, so you need to map 3 phonemes together to get the best matching phonetic transcription.<p>The IPA approach may also make it better to train the phonetic synthesis, given that the IPA vowels are in a formant continuum (similar to colour wheels and cubes). Then, the model could better learn the variations in voice quality and tambre.
Aside: is it just me, or is anyone else just as dumbfounded with how quickly literally every aspect of AI and LLMs and Models and blah blah blah is going?<p>Am I weird in just having my head spin - even though I've also been at leading edge tech before, but this is just me yelling at these new algos on my lawn?
How tunable is the voice?<p>I'm interested in applying TTS to a chat system, and one important feature for that is that there should be as many as possible distinct voices, so that each person would have their own.<p>Would this, or something else be able to do that?