NaturalSpeech: End-to-end text to speech synthesis with human-level quality

377 pointsby phsilvaabout 3 years ago

41 comments

webmavenabout 3 years ago

Wowsers. This is a step change in quality compared to SOTA. I suspect that without evaluating samples as a correlated group, distinguishing between the generated samples and those recorded from a human will be little better than a coin toss.And even when evaluating these samples as a group, I may be imagining the distinctions I am drawing from a relatively small selection that might be cherry-picked. Nevertheless:The generated samples are more consistent as a group, and more even in quality, with few instances of emphasis that seem (however slightly) out of place.The recorded human samples vary more between samples (by which I mean the sample as a whole may be emphasized with a bit of extra stress or a small raising or lowering of tone compared to the other samples), and within the sample there is a bit more emphasis on a word or two or slight variance in the length of pauses, mostly appropriate in context (as in, it is similar to what I, a non-professional[0], would have emphasized if I were being asked to record these).In general for a non-dramatic voiceover you want to maintain consistency between passages (especially if they may be heard out of order) without completely flattening the in-passage variation, but tastes vary.Conclusion: For many types of voice work, these generated samples are comparable in quality or slightly superior to recordings of an average professional. For semi-dramatic contexts (eg. audiobooks) the generated samples are firmly in the "more than good enough" zone, more or less comparable to a typical narrator who doesn't "act" as part of their reading.[0] Decades ago in Los Angeles I tried my hand at voiceover and voice acting work, but gave up when it quickly became clear that being even slightly prone to stuffy noses, tonsillitis and sore throats was going to pose a major obstacle to being considered reliable unless I was willing to regularly use decongestants, expectorants, and the like.

评论 #31419677 未加载

causality0about 3 years ago

It's interesting that TTS is getting better and better while consumer access to it is more and more restricted. A decade ago there were a half dozen totally separate TTS engines I could install on my phone and my Kindle came with its own that worked on any book.

评论 #31421036 未加载

评论 #31420237 未加载

评论 #31419334 未加载

评论 #31423790 未加载

评论 #31419412 未加载

评论 #31419725 未加载

thorumabout 3 years ago

See also the recently published Tortoise TTS, which IMO sounds even better: <a href="https://github.com/neonbjb/tortoise-tts" rel="nofollow">https://github.com/neonbjb/tortoise-tts</a>

评论 #31423956 未加载

评论 #31420257 未加载

评论 #31420285 未加载

评论 #31420367 未加载

DonHopkinsabout 3 years ago

Nice pitch envelopes. But it's a bit uncanny that natural human pitch envelopes encode and express what you understand and intend to convey about the meaning of the words you're saying, and what you want to emphasize about each individual word, emotionally. Like how you'll say a word you don't really mean sarcastically. It can figure out it's a question because the sentence ends in a question mark, and it raises the pitch at the end, but it can't figure out what the meaning or point of the question is, and which words to emphasize and stress to convey that meaning. (Not a criticism of this excellent work, just pointing out how hard a problem it is!)For example, compare "rebuke and abash": in the NaturalSpeech, one goes down like she's sure and the other goes up like she's questioning, where in the recording, they are both more balanced and emphasized as equally important words in the sentence. And the pause after insolent in "insolent and daring" sounds uneven compared to the recording, which emphasizes the pair of words more equally and tightly.Jiminy Glick interviews (and does an impression of) Jerry Seinfeld:<a href="https://www.youtube.com/watch?v=AE2utktZ92Y" rel="nofollow">https://www.youtube.com/watch?v=AE2utktZ92Y</a>

评论 #31418636 未加载

jcimsabout 3 years ago

There's no way I'll find it but somewhere along the way there was a collection of samples in which one of these contemporary model-based speech synthesizers (possibly wavenet or tacotron) was forced to output data with no useful text (can't remember if it was just noise or literally zero input). The synthesizer just started creating weird breathy pops and purrs and gibberish utterances. Some of them sounded like panic breathing and it was one of the more jarring things I've heard in quite some time.This isn't exactly it but it's very close - <a href="https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio" rel="nofollow">https://www.deepmind.com/blog/wavenet-a-generative-model-for...</a> CTRL+F 'babbling'

评论 #31430802 未加载

评论 #31418810 未加载

评论 #31418887 未加载

sandreasabout 3 years ago

For german users, I can recommend to take a look at<a href="https://www.thorsten-voice.de/" rel="nofollow">https://www.thorsten-voice.de/</a><a href="https://github.com/thorstenMueller/Thorsten-Voice" rel="nofollow">https://github.com/thorstenMueller/Thorsten-Voice</a>where someone contributed a huge set of his voice samples and a tutorial / script collection to build a pretty decent TTS model LOCALLY.Quality-wise it is not as good as the samples in the article, but its free and pretty easy to follow for a tech enthusiast.

评论 #31422170 未加载

Loeffelmannabout 3 years ago

What's a good TTS cloud service that has anything even close to these voices. I looked at the Google and Amazon ones and was pretty disappointed.

评论 #31419689 未加载

评论 #31421688 未加载

big_fanabout 3 years ago

Industry research lab claims human parity on end-to-end text-to-speech and releases a web page with five samples as proof? Microsoft, you're a little late to the party - Google has been using this playbook for 5 years!

exploriginalmost 3 years ago

It's clear that their dataset contains a lot of newscasts. I wouldn't call this "natural" speech. But it certainly has an application for replacing newscasters/announcers.

themodelplumberabout 3 years ago

Reminds me of how good the choir instruments are these days. <a href="https://youtu.be/ulK3_o7OyEk?t=392" rel="nofollow">https://youtu.be/ulK3_o7OyEk?t=392</a>

评论 #31419260 未加载

评论 #31420283 未加载

评论 #31420375 未加载

midjjiabout 3 years ago

While every sample they provide is suspiciously similar to the human version,(indicating overtraining, either on the samples or on a single voice), where I would have expected a different if still human quality voice from a fully functional system, this tech is coming, and soon. And when it does, voice acting will no longer prevent videogames from having complex stories, and we will find out if the industry is still capable of making them. Looking forward to it :)

评论 #31420725 未加载

评论 #31420341 未加载

Quequauabout 3 years ago

I don't suppose anyone could recommend a good text-to-speech for Linux?Command line is fine but it would be much better if it could trivially take clipboard content for input. The last time I looked I found stuff that wasn't that great and was pretty inconvenient.

评论 #31421680 未加载

评论 #31423233 未加载

est31about 3 years ago

Very subtle differences, can be heard, but I have my headphones on. For example, in the last example, "borne" and "commission" seem to have some kind of artificial noise inside the "b" and "c" sounds. The "th" in "clothing" sounds artificial too. Still, it's extremely amazing, and probably in 90% of settings, people won't be able to find a difference at all. It even does breaths: "scientific certainty <breath> that".

justinlloydabout 3 years ago

This is pretty impressive work, except for this one: "who had borne the Queen's commission, first as cornet, and then lieutenant, in the 10th Hussars"Both the NaturalSpeech and the human said pretty much every word in that sentence completely incorrectly for the context of the words. It is the difference between "the car Seat" and "the car seat". "It's pronounced Ore-garh-no" to paraphrase the insufferable Hermione Granger.

评论 #31419113 未加载

exebookabout 3 years ago

One thing I've noticed is that I can hear human inhale before they continue speaking. Got curious if tts of the future should have this feature too.

评论 #31423658 未加载

评论 #31420751 未加载

microtherionabout 3 years ago

Good quality overall, though it's difficult to tell from a small, hand picked set of examples (which appear to come from the training data, too — have the corresponding recordings been included in the voice build or held out?).There is a rather obvious problem with the stress on "warehouses", and a more subtle problem with "warrants on them", where it's difficult to get the stress pattern just right.

评论 #31419224 未加载

vttsalmost 3 years ago

The Text-To-Speech service by <a href="https://vtts.xyz" rel="nofollow">https://vtts.xyz</a> is the perfect choice for anyone who needs an instant human sounding voiceover for their commercial or non-commercial projects. Got a product to sell online? Why not transform your boring text into a natural sounding voiceover and impress your customers. What about adding a voiceover to your animation or instructional video? It will make it sound more professional and engaging! Our human sounding voices add inflections in the voice that make them sound natural, and our custom text editor makes it easy to get exactly what you want from both Male & Female voices included over 30 different tones, including: Serious, Joyful & normal

DantesKiteabout 3 years ago

That is crazy. Any way I can start using this soon? I have a backlog of articles I’d love to listen to.

评论 #31422233 未加载

评论 #31419643 未加载

lvl102about 3 years ago

Microsoft/Nuance has been doing great in this area. I am very impressed with TTS on Windows. It makes proofing documents that much easier. I do think there is a need for some type of markup (akin to sheet music) for supervised learning.

sebringjabout 3 years ago

You could tell the difference in that the AI pronounced "Hussars" correctly where as the human reader did not. Without adding in our human error, our AI-trained version will be the more educated one for certain going forward.

jollybeanabout 3 years ago

It'd be nice if we could input our own text because otherwise these things are subject to a lot of training corpus and other biases.Sounds really good though.

p1neconeabout 3 years ago

This kind of stuff is going to be amazing for indie gamedevs. I want a model trained for "powerful narrator voice" and villain speeches.

评论 #31422342 未加载

coolspotalmost 3 years ago

> We train our proposed system on 8 NVIDIA V100 GPUs with 32G memorySounds like openly reproducing this result is within independent researchers’ reach.

msluyterabout 3 years ago

Totally tangential comment:You can click play on any/all of the samples simultaneously, resulting in a neat sonic effect vaguely reminiscent of Steve Reich's famous "Come out." [1][1] <a href="https://www.youtube.com/watch?v=g0WVh1D0N50" rel="nofollow">https://www.youtube.com/watch?v=g0WVh1D0N50</a> (skip to like 7 minutes in to get the idea)

srikuabout 3 years ago

I wish for the "naturalspeech versus recording" comparisons they'd used a different voice for the synthesized speech. Otherwise the fact that we may not be able to tell them apart by ear (in a blindfold test) doesn't tell us much about how good it is as a speech synth engine with that evidence alone.

IYashaalmost 3 years ago

As a TTS daily user, sometimes I'm even fine with espeak quality for system messages. But one thing concerns me more than beauty of the voice - the ability to process mixed language text and abbreviations. And I don't see these problems addressed in this project. (

vttsalmost 3 years ago

Do you want to check how it works? you can test the operation of standard voices and advanced neural voices, at this url: <a href="https://vtts.xyz/home/tryme" rel="nofollow">https://vtts.xyz/home/tryme</a>

UltraViolencealmost 3 years ago

I'm not sure what to make of this. The TTS output seems identical to the recording.Why don't they use this tech to recreate some dead actor's speech, for example?

qginabout 3 years ago

Unbelievable. This has traversed the uncanny valley and come out the other side.

infinitoneabout 3 years ago

This is definitely human-level quality. In fact, the synthesized versions pronounce some words better than human. Kudos to MSFT! I think they've been longest in the game too...edit; is the Nuance acquisition compounding yet?

colordropsabout 3 years ago

I actually think the TTS voice is better sounding than the human's voice.

评论 #31419797 未加载

hooloovoo_zooabout 3 years ago

Cadence still seems way off for the AI. Maybe it’s going word by word?

PedroBatistaabout 3 years ago

Very cool, but..What's the end game here? because I cannot use it, I cannot buy it and this seems more than just a scientific paper.So what's the objective here?

karmasimidaabout 3 years ago

I think we have reached the stage of development of AI, I am no longer surprised/excited by this results by any means.

baxuzabout 3 years ago

Is there any TTS engine which isn't based on English?I'd love to be able to use an assistant device in Croatian in my lifetime.

评论 #31422593 未加载

评论 #31420562 未加载

skykoolerabout 3 years ago

Wow, this is the first speech synthesis I've seen on here where I thought I was listening to a human at first.

wrycoderabout 3 years ago

Is this available in other languages yet?

hrdwdmrblabout 3 years ago

[Take my money meme] I want this for articles and books now, please

nwatababout 3 years ago

Sounds nice. I'm interested in making business based on TTS

blueflowabout 3 years ago

What the fuck is end-to-end text? My Bullshit-O-Meter is off the charts. End of what? I only know end-to-end encryption.

评论 #31421791 未加载

SemanticStrenghabout 3 years ago

Can someone please upload the results? On <a href="https://paperswithcode.com/sota/text-to-speech-synthesis-on-ljspeech" rel="nofollow">https://paperswithcode.com/sota/text-to-speech-synthesis-on-...</a>

41 comments

webmavenabout 3 years ago

评论 #31419677 未加载

causality0about 3 years ago

评论 #31421036 未加载

评论 #31420237 未加载

评论 #31419334 未加载

评论 #31423790 未加载

评论 #31419412 未加载

评论 #31419725 未加载

thorumabout 3 years ago

See also the recently published Tortoise TTS, which IMO sounds even better: <a href="https://github.com/neonbjb/tortoise-tts" rel="nofollow">https://github.com/neonbjb/tortoise-tts</a>

评论 #31423956 未加载

评论 #31420257 未加载

评论 #31420285 未加载

评论 #31420367 未加载

DonHopkinsabout 3 years ago

评论 #31418636 未加载

jcimsabout 3 years ago

评论 #31430802 未加载

评论 #31418810 未加载

评论 #31418887 未加载

sandreasabout 3 years ago

评论 #31422170 未加载

Loeffelmannabout 3 years ago

What's a good TTS cloud service that has anything even close to these voices. I looked at the Google and Amazon ones and was pretty disappointed.

评论 #31419689 未加载

评论 #31421688 未加载

big_fanabout 3 years ago

exploriginalmost 3 years ago

It's clear that their dataset contains a lot of newscasts. I wouldn't call this "natural" speech. But it certainly has an application for replacing newscasters/announcers.

themodelplumberabout 3 years ago

Reminds me of how good the choir instruments are these days. <a href="https://youtu.be/ulK3_o7OyEk?t=392" rel="nofollow">https://youtu.be/ulK3_o7OyEk?t=392</a>

评论 #31419260 未加载

评论 #31420283 未加载

评论 #31420375 未加载

midjjiabout 3 years ago

评论 #31420725 未加载

评论 #31420341 未加载

Quequauabout 3 years ago

评论 #31421680 未加载

评论 #31423233 未加载

est31about 3 years ago

justinlloydabout 3 years ago

评论 #31419113 未加载

exebookabout 3 years ago

One thing I've noticed is that I can hear human inhale before they continue speaking. Got curious if tts of the future should have this feature too.

评论 #31423658 未加载

评论 #31420751 未加载

microtherionabout 3 years ago

评论 #31419224 未加载

vttsalmost 3 years ago

DantesKiteabout 3 years ago

That is crazy. Any way I can start using this soon? I have a backlog of articles I’d love to listen to.

评论 #31422233 未加载

评论 #31419643 未加载

lvl102about 3 years ago

sebringjabout 3 years ago

jollybeanabout 3 years ago

It'd be nice if we could input our own text because otherwise these things are subject to a lot of training corpus and other biases.Sounds really good though.

p1neconeabout 3 years ago

This kind of stuff is going to be amazing for indie gamedevs. I want a model trained for "powerful narrator voice" and villain speeches.

评论 #31422342 未加载

coolspotalmost 3 years ago

> We train our proposed system on 8 NVIDIA V100 GPUs with 32G memorySounds like openly reproducing this result is within independent researchers’ reach.

msluyterabout 3 years ago

srikuabout 3 years ago

IYashaalmost 3 years ago

vttsalmost 3 years ago

UltraViolencealmost 3 years ago

I'm not sure what to make of this. The TTS output seems identical to the recording.Why don't they use this tech to recreate some dead actor's speech, for example?

qginabout 3 years ago

Unbelievable. This has traversed the uncanny valley and come out the other side.

infinitoneabout 3 years ago

colordropsabout 3 years ago

I actually think the TTS voice is better sounding than the human's voice.

评论 #31419797 未加载

hooloovoo_zooabout 3 years ago

Cadence still seems way off for the AI. Maybe it’s going word by word?

PedroBatistaabout 3 years ago

Very cool, but..What's the end game here? because I cannot use it, I cannot buy it and this seems more than just a scientific paper.So what's the objective here?

karmasimidaabout 3 years ago

I think we have reached the stage of development of AI, I am no longer surprised/excited by this results by any means.

baxuzabout 3 years ago

Is there any TTS engine which isn't based on English?I'd love to be able to use an assistant device in Croatian in my lifetime.

评论 #31422593 未加载

评论 #31420562 未加载

skykoolerabout 3 years ago

Wow, this is the first speech synthesis I've seen on here where I thought I was listening to a human at first.

wrycoderabout 3 years ago

Is this available in other languages yet?

hrdwdmrblabout 3 years ago

[Take my money meme] I want this for articles and books now, please

nwatababout 3 years ago

Sounds nice. I'm interested in making business based on TTS

blueflowabout 3 years ago

What the fuck is end-to-end text? My Bullshit-O-Meter is off the charts. End of what? I only know end-to-end encryption.

评论 #31421791 未加载

SemanticStrenghabout 3 years ago