VALL-E: Neural codec language models are zero-shot text to speech synthesizers

325 点作者 georgehill超过 2 年前

19 条评论

bottlepalm超过 2 年前

Wow, people who lose their voice could basically talk again through text to speech as long as they have previous recordings or themselves.Or text messages that you can listen to in the voice of the people who sent them.Or the death of the audiobook business? Any book read in any voice you want.Or maybe a form of extreme compression, voice is converted to text with Whisper, sent over the wire as text, and re-created with the same voice on the receiver.Or train it with the voice of the computer from TNG, pair it with ChatGPT and now I have the perfect digital assistant.

评论 #34271663 未加载

评论 #34274738 未加载

评论 #34270837 未加载

评论 #34271565 未加载

评论 #34270641 未加载

评论 #34270515 未加载

评论 #34274871 未加载

评论 #34276783 未加载

评论 #34270594 未加载

评论 #34270672 未加载

评论 #34270857 未加载

评论 #34272239 未加载

评论 #34271019 未加载

评论 #34272426 未加载

评论 #34271596 未加载

andrewstuart超过 2 年前

Despite lots of Internet talk about text to speech, there's still no really amazing TTS that you can pay money for and use. They all sound like text to speech.

评论 #34271830 未加载

评论 #34272481 未加载

评论 #34271912 未加载

评论 #34280380 未加载

评论 #34308982 未加载

评论 #34271228 未加载

评论 #34274404 未加载

评论 #34272186 未加载

评论 #34274018 未加载

评论 #34271835 未加载

评论 #34275881 未加载

评论 #34271878 未加载

woodson超过 2 年前

Is it just me, or does the spoken content not correspond with the written prompt in many cases? Though, I’m sure it’s just a problem with matching the right file with the text in the HTML and not a TTS problem.[Edit: My bad, I looked at the page on a phone screen, where only the text and the first audio playback button are visible.]

评论 #34271673 未加载

评论 #34271657 未加载

评论 #34271852 未加载

Jeff_Brown超过 2 年前

> Human voice inflection is hard to mimic because you have to have a contextual understanding of what is being saidIt's a hard problem even for a human. One of the readers for The Economist always emphasizes the wrong word in phrases describing monetary sums. "China's GDP that year was 600 billion dollars, and now it's 8 trillion." It drives me nuts.

评论 #34278603 未加载

vanderZwan超过 2 年前

If you search for the "So what is the campaign about?" example, then compare ground truth vs synthesis, it's clear that it still appears to filter out accents. To give another data-point: a few examples later a man with British accent suddenly sounds American.Then again it intuitively makes sense to me that any speech synthesis will "regress to the mean" of the training data, unless it's explicitly trained to distinguish dialects. The "Speaker’s Emotion Maintenance" examples later on give the impression that that should be possible though.Either way it's still an impressive achievement to my layman's ears!

FloatArtifact超过 2 年前

I've seen a lot of papers over the year for text to speech but nothing truly competitive from the open source community. For a long time now I've wondered if the market's being suppressed. While apple's implementation is not open source, it is progress to an accessible e-library.

评论 #34271644 未加载

anigbrowl超过 2 年前

Well, there goes another line of work for actors. Great for people who long to publish their own audiobooks or produce their own radio plays or animated films (the ability to give performance cues is surely not far behind).Once again, while I'm very much in favor of AI and optimistic about what it can do technologically and the opportunities it creates (more so than most), it's extremely foolish to ignore the fact that it's going to throw a lot of people out of a job through no fault of their own. Vocal performance is a skill, and one that's not all that common. Its facile to blame voice actors for not being software engineers or computer scientists, as if that would have somehow shored up their career options. How is someone supposed to deal with spending years honing a very human expressive skill only to wake up one day and find it suddenly obsolete? I'd say people in that line of work have 1-2 years max before their industry is upended and 50% of them lose 50% of their income.Also, good luck telling whether the corporate phone line you call is manned by an unhelpful human or an AI that is trying really hard but starting to hate its job.

评论 #34276654 未加载

评论 #34272449 未加载

评论 #34273876 未加载

评论 #34273425 未加载

评论 #34276011 未加载

ec109685超过 2 年前

This could support extreme levels of compression, right?Send a few seconds of speech sample. Use speech to text and the reconstitute on the client.

评论 #34270943 未加载

评论 #34275286 未加载

lukeboi超过 2 年前

Relevant aside: What is state-of-the-art for real time text to speech?Most recent papers & projects I've seen are really high quality but are too slow to synthesize speech in real time.

评论 #34271860 未加载

评论 #34274058 未加载

评论 #34272201 未加载

评论 #34271179 未加载

monkeydust超过 2 年前

About a year ago my bank asked me to opt-in to voice recognition for authentication (still have to provide account details, name and dob first). I am very curious to test their system using some of these models but I would seriously hope their tech guys are doing this - thoughts?

moyix超过 2 年前

Is this really impressive? All of the samples I listened to had some degree of weird intonation and digital buzz artifacts. Maybe I thought the state of the art was further ahead than it actually is?

woodson超过 2 年前

One point to note is that they appear to be using (pseudo-)phoneme sequences as inputs instead of characters/text, so you need a frontend that does grapheme to phoneme (G2P) conversion. I found that interesting as many previous models (Tacotron 1/2, FastSpeech 1/2, FastPitch, ...) are more often than not trained on text directly (well, tokens from some tokenizer). This may be more relevant for English, though.

jrd259超过 2 年前

With single sentence examples, it's hard to tell whether this handles informational prosody, e.g. the given/new distinction, or infamous "John called Bill a Republican, and then he insulted him" where the accent means that the antecedent of he is Bill, not John, which further implies that "calling X a Republican" is an insult.

评论 #34278954 未加载

jerpint超过 2 年前

Is there an open source implementation?

评论 #34270649 未加载

perfopt超过 2 年前

Could someone explain what is "Speaker prompt" in this context?

评论 #34274622 未加载

mensetmanusman超过 2 年前

This will be used to log into financial institutes over the phone, I wonder if this type of tech will force people to go back into banks again…

评论 #34277302 未加载

gigel82超过 2 年前

I wonder what computation resources are needed to generate these in real-time (if it's even possible to do real-time). Very impressive.

fenesiistvan超过 2 年前

Psychologists worry about the loss of transference of emotions with TTS.

dzink超过 2 年前

The innovation is spectacular, BUT, there needs to be a signal/low pitch sound that denotes this is generated audio in every single generated sample (likely legally enforced), or your grandma and kids will soon be getting legitimate sounding voice calls from you after someone calls/visits/interviews you first to record and train a model on your voice (as a simplest potential abuse vector, celebrities and anyone with public voice samples would be even easier).

评论 #34270598 未加载

评论 #34271780 未加载

评论 #34271187 未加载

评论 #34271338 未加载

评论 #34272240 未加载

19 条评论

bottlepalm超过 2 年前

评论 #34271663 未加载

评论 #34274738 未加载

评论 #34270837 未加载

评论 #34271565 未加载

评论 #34270641 未加载

评论 #34270515 未加载

评论 #34274871 未加载

评论 #34276783 未加载

评论 #34270594 未加载

评论 #34270672 未加载

评论 #34270857 未加载

评论 #34272239 未加载

评论 #34271019 未加载

评论 #34272426 未加载

评论 #34271596 未加载

andrewstuart超过 2 年前

Despite lots of Internet talk about text to speech, there's still no really amazing TTS that you can pay money for and use. They all sound like text to speech.

评论 #34271830 未加载

评论 #34272481 未加载

评论 #34271912 未加载

评论 #34280380 未加载

评论 #34308982 未加载

评论 #34271228 未加载

评论 #34274404 未加载

评论 #34272186 未加载

评论 #34274018 未加载

评论 #34271835 未加载

评论 #34275881 未加载

评论 #34271878 未加载

woodson超过 2 年前

评论 #34271673 未加载

评论 #34271657 未加载

评论 #34271852 未加载

Jeff_Brown超过 2 年前

评论 #34278603 未加载

vanderZwan超过 2 年前

FloatArtifact超过 2 年前

评论 #34271644 未加载

anigbrowl超过 2 年前

评论 #34276654 未加载

评论 #34272449 未加载

评论 #34273876 未加载

评论 #34273425 未加载

评论 #34276011 未加载

ec109685超过 2 年前

This could support extreme levels of compression, right?Send a few seconds of speech sample. Use speech to text and the reconstitute on the client.

评论 #34270943 未加载

评论 #34275286 未加载

lukeboi超过 2 年前

Relevant aside: What is state-of-the-art for real time text to speech?Most recent papers & projects I've seen are really high quality but are too slow to synthesize speech in real time.

评论 #34271860 未加载

评论 #34274058 未加载

评论 #34272201 未加载

评论 #34271179 未加载

monkeydust超过 2 年前

moyix超过 2 年前

Is this really impressive? All of the samples I listened to had some degree of weird intonation and digital buzz artifacts. Maybe I thought the state of the art was further ahead than it actually is?

woodson超过 2 年前

jrd259超过 2 年前

评论 #34278954 未加载

jerpint超过 2 年前

Is there an open source implementation?

评论 #34270649 未加载

perfopt超过 2 年前

Could someone explain what is "Speaker prompt" in this context?

评论 #34274622 未加载

mensetmanusman超过 2 年前

This will be used to log into financial institutes over the phone, I wonder if this type of tech will force people to go back into banks again…

评论 #34277302 未加载

gigel82超过 2 年前

I wonder what computation resources are needed to generate these in real-time (if it's even possible to do real-time). Very impressive.