WhisperSpeech – An open source text-to-speech system built by inverting Whisper

464 pointsby nickmccover 1 year ago

17 comments

nmfisherover 1 year ago

I've been following jpc [0] on the LAION discord since he started building this last year, and it's a very impressive project.The key here is that the Whisper multilingual ASR model has been trained on a huge amount of data, so its encoder output is a very good representation of the semantic content of speech. This can be used as an open-source, drop-in replacement for the semantic encoder in model architectures like SPEAR-TTS/VALL-E/etc (whose semantic encoders are not publicly available). This is then used to predict acoustic tokens (the output from the quantized/low-bandwidth Encodec audio codec) which is then upsampled/denoised/enhanced with the Vocos vocoder.I know someone is working on Hindi but it would be great to see this extended to other languages for a properly open-source [1], multilingual TTS platform. I think the main bottleneck at the moment is finding people who can procure/clean compliant datasets.[0] <a href="https://github.com/jpc">https://github.com/jpc</a> [1] jpc/Collabora went to great efforts to ensure that they are only using properly licensed data to train this. I doubt Whisper itself was that compliant, so it's a bit muddy.

评论 #39040575 未加载

评论 #39038903 未加载

jpclover 1 year ago

Hi, WhisperSpeech dev here.Thanks for all the nice comments, I was working really hard on this model for quite a few months now but there are still a lot of ways we can make it better.Thanks to generosity of Collabora this is a real open-source project (not just a one-time marketing ploy), so if you want to help improve it or integrate it into something you are building, I'd love to help.You can also buy our undivided engineering attention if you have a business use-case. :)

评论 #39042194 未加载

评论 #39046494 未加载

sirabenover 1 year ago

Interested to see how it performs for Mandarin Chinese speech synthesis, especially with prosody and emotion. The highest quality open source model I've seen so far is EmotiVoice[0], which I've made a CLI wrapper around to generate audio for flashcards.[1] For EmotiVoice, you can apparently also clone your own voice with a GPU, but I have not tested this.[2][0] <a href="https://github.com/netease-youdao/EmotiVoice">https://github.com/netease-youdao/EmotiVoice</a>[1] <a href="https://github.com/siraben/emotivoice-cli">https://github.com/siraben/emotivoice-cli</a>[2] <a href="https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Cloning-with-your-personal-data">https://github.com/netease-youdao/EmotiVoice/wiki/Voice-Clon...</a>

评论 #39040396 未加载

评论 #39038909 未加载

评论 #39038024 未加载

评论 #39037959 未加载

lolinderover 1 year ago

I know it's old at this point and doesn't use the fancy new tech, but Mycroft's Mimic 3 is still pretty impressive and is small enough to fit comfortably and generate speech in real time on a raspberry pi [0]. Some of their voices are better than others, but the best of them are definitely equal to the examples of WhisperSpeech given here.[0] <a href="https://mycroft.ai/mimic-3/" rel="nofollow">https://mycroft.ai/mimic-3/</a>

评论 #39038614 未加载

评论 #39040490 未加载

评论 #39038906 未加载

nickmccover 1 year ago

I was looking at video on training a custom voice with Piper, following a tutorial at <a href="https://www.youtube.com/watch?v=b_we_jma220" rel="nofollow">https://www.youtube.com/watch?v=b_we_jma220</a>, and noticed how the datasets required metadata of the text for the source audio files. This training method by Collabora seems to automate that process and only requires an audio file for training.

评论 #39040472 未加载

评论 #39037900 未加载

rhdunnover 1 year ago

Is there any work/progress on a NN/trained model based on Internet Phonetic Alphabet (IPA) transcriptions? I.e. to be able to create an IPA transcription and convert it back to sound.That approach would be useful for things like shifting a voice to a different accent and to support voices speaking multiple languages.This can be done to a limited extend for models such as MBROLA voices by mapping the phonemes of one language to the phonemes of the MBROLA voice. MBROLA is more complex in that it supports diphones, and many diphone pairs don't exist, so you need to map 3 phonemes together to get the best matching phonetic transcription.The IPA approach may also make it better to train the phonetic synthesis, given that the IPA vowels are in a formant continuum (similar to colour wheels and cubes). Then, the model could better learn the variations in voice quality and tambre.

评论 #39045313 未加载

odirootover 1 year ago

The Polish sample is really good. Sounds like an audiobook recording.

评论 #39040518 未加载

samstaveover 1 year ago

Aside: is it just me, or is anyone else just as dumbfounded with how quickly literally every aspect of AI and LLMs and Models and blah blah blah is going?Am I weird in just having my head spin - even though I've also been at leading edge tech before, but this is just me yelling at these new algos on my lawn?

评论 #39038162 未加载

评论 #39038120 未加载

dale_glassover 1 year ago

How tunable is the voice?I'm interested in applying TTS to a chat system, and one important feature for that is that there should be as many as possible distinct voices, so that each person would have their own.Would this, or something else be able to do that?

评论 #39040414 未加载

评论 #39042640 未加载

globalnodeover 1 year ago

this is the best tts ive heard, the voice modulates as you'd expect a human to.

评论 #39043783 未加载

评论 #39040460 未加载

zeropover 1 year ago

Can this be run on Mac M1?

评论 #39038122 未加载

londons_exploreover 1 year ago

The first demo on that page was trained from a 32kbps crappy sound quality clip of winston churchill...?garbage in, garbage out?

评论 #39043820 未加载

huytersdover 1 year ago

What’s the text to speech generator that chatGPT uses? It’s the most impressive one I’ve heard so far.

评论 #39038685 未加载

评论 #39038231 未加载

评论 #39038245 未加载

WhackyIdeasover 1 year ago

Can it run local only?

评论 #39040422 未加载

chapmank2030over 1 year ago

i will torture copper and rinny

gbajsonover 1 year ago

Would anyone of you if it is trained to recognize geographical places, famous people, etc?

RockRobotRockover 1 year ago

holy shit what. how?

评论 #39040654 未加载