Wowsers. This is a step change in quality compared to SOTA. I suspect that without evaluating samples as a correlated group, distinguishing between the generated samples and those recorded from a human will be little better than a coin toss.<p>And even when evaluating these samples as a group, I may be imagining the distinctions I am drawing from a relatively small selection that might be cherry-picked. Nevertheless:<p>The generated samples are more consistent as a group, and more even in quality, with few instances of emphasis that seem (however slightly) out of place.<p>The recorded human samples vary more <i>between</i> samples (by which I mean the sample as a whole may be emphasized with a bit of extra stress or a small raising or lowering of tone compared to the other samples), and within the sample there is a bit more emphasis on a word or two or slight variance in the length of pauses, mostly appropriate in context (as in, it is similar to what I, a non-professional[0], would have emphasized if I were being asked to record these).<p>In general for a non-dramatic voiceover you want to maintain consistency between passages (especially if they may be heard out of order) without completely flattening the in-passage variation, but tastes vary.<p>Conclusion: For many types of voice work, these generated samples are comparable in quality or slightly superior to recordings of an average professional. For semi-dramatic contexts (eg. audiobooks) the generated samples are firmly in the "more than good enough" zone, more or less comparable to a typical narrator who doesn't "act" as part of their reading.<p>[0] Decades ago in Los Angeles I tried my hand at voiceover and voice acting work, but gave up when it quickly became clear that being even slightly prone to stuffy noses, tonsillitis and sore throats was going to pose a major obstacle to being considered reliable unless I was willing to regularly use decongestants, expectorants, and the like.
It's interesting that TTS is getting better and better while consumer access to it is more and more restricted. A decade ago there were a half dozen totally separate TTS engines I could install on my phone and my Kindle came with its own that worked on any book.
See also the recently published Tortoise TTS, which IMO sounds even better: <a href="https://github.com/neonbjb/tortoise-tts" rel="nofollow">https://github.com/neonbjb/tortoise-tts</a>
Nice pitch envelopes. But it's a bit uncanny that natural human pitch envelopes encode and express what you understand and intend to convey about the meaning of the words you're saying, and what you want to emphasize about each individual word, emotionally. Like how you'll say a word you don't really mean sarcastically. It can figure out it's a question because the sentence ends in a question mark, and it raises the pitch at the end, but it can't figure out what the meaning or point of the question is, and which words to emphasize and stress to convey that meaning. (Not a criticism of this excellent work, just pointing out how hard a problem it is!)<p>For example, compare "rebuke and abash": in the NaturalSpeech, one goes down like she's sure and the other goes up like she's questioning, where in the recording, they are both more balanced and emphasized as equally important words in the sentence. And the pause after insolent in "insolent and daring" sounds uneven compared to the recording, which emphasizes the pair of words more equally and tightly.<p>Jiminy Glick interviews (and does an impression of) Jerry Seinfeld:<p><a href="https://www.youtube.com/watch?v=AE2utktZ92Y" rel="nofollow">https://www.youtube.com/watch?v=AE2utktZ92Y</a>
There's no way I'll find it but somewhere along the way there was a collection of samples in which one of these contemporary model-based speech synthesizers (possibly wavenet or tacotron) was forced to output data with no useful text (can't remember if it was just noise or literally zero input). The synthesizer just started creating weird breathy pops and purrs and gibberish utterances. Some of them sounded like panic breathing and it was one of the more jarring things I've heard in quite some time.<p>This isn't exactly it but it's very close - <a href="https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio" rel="nofollow">https://www.deepmind.com/blog/wavenet-a-generative-model-for...</a> CTRL+F 'babbling'
For german users, I can recommend to take a look at<p><a href="https://www.thorsten-voice.de/" rel="nofollow">https://www.thorsten-voice.de/</a><p><a href="https://github.com/thorstenMueller/Thorsten-Voice" rel="nofollow">https://github.com/thorstenMueller/Thorsten-Voice</a><p>where someone contributed a huge set of his voice samples and a tutorial / script collection to build a pretty decent TTS model LOCALLY.<p>Quality-wise it is not as good as the samples in the article, but its free and pretty easy to follow for a tech enthusiast.
Industry research lab claims human parity on end-to-end text-to-speech and releases a web page with five samples as proof? Microsoft, you're a little late to the party - Google has been using this playbook for 5 years!
It's clear that their dataset contains a lot of newscasts. I wouldn't call this "natural" speech. But it certainly has an application for replacing newscasters/announcers.
Reminds me of how good the choir instruments are these days. <a href="https://youtu.be/ulK3_o7OyEk?t=392" rel="nofollow">https://youtu.be/ulK3_o7OyEk?t=392</a>
While every sample they provide is suspiciously similar to the human version,(indicating overtraining, either on the samples or on a single voice), where I would have expected a different if still human quality voice from a fully functional system, this tech is coming, and soon. And when it does, voice acting will no longer prevent videogames from having complex stories, and we will find out if the industry is still capable of making them. Looking forward to it :)
I don't suppose anyone could recommend a good text-to-speech for Linux?<p>Command line is fine but it would be much better if it could trivially take clipboard content for input. The last time I looked I found stuff that wasn't that great and was pretty inconvenient.
Very subtle differences, can be heard, but I have my headphones on. For example, in the last example, "borne" and "commission" seem to have some kind of artificial noise inside the "b" and "c" sounds. The "th" in "clothing" sounds artificial too. Still, it's extremely amazing, and probably in 90% of settings, people won't be able to find a difference at all. It even does breaths: "scientific certainty <breath> that".
This is pretty impressive work, except for this one:
"who had borne the Queen's commission, first as cornet, and then lieutenant, in the 10th Hussars"<p>Both the NaturalSpeech and the human said pretty much every word in that sentence completely incorrectly for the context of the words. It is the difference between "the car Seat" and "the car seat". "It's pronounced Ore-garh-no" to paraphrase the insufferable Hermione Granger.
One thing I've noticed is that I can hear human inhale before they continue speaking. Got curious if tts of the future should have this feature too.
Good quality overall, though it's difficult to tell from a small, hand picked set of examples (which appear to come from the training data, too — have the corresponding recordings been included in the voice build or held out?).<p>There is a rather obvious problem with the stress on "warehouses", and a more subtle problem with "warrants on them", where it's difficult to get the stress pattern just right.
The Text-To-Speech service by <a href="https://vtts.xyz" rel="nofollow">https://vtts.xyz</a> is the perfect choice for anyone who needs an instant human sounding voiceover for their commercial or non-commercial projects. Got a product to sell online? Why not transform your boring text into a natural sounding voiceover and impress your customers. What about adding a voiceover to your animation or instructional video? It will make it sound more professional and engaging! Our human sounding voices add inflections in the voice that make them sound natural, and our custom text editor makes it easy to get exactly what you want from both Male & Female voices included over 30 different tones, including: Serious, Joyful & normal
Microsoft/Nuance has been doing great in this area. I am very impressed with TTS on Windows. It makes proofing documents that much easier. I do think there is a need for some type of markup (akin to sheet music) for supervised learning.
You could tell the difference in that the AI pronounced "Hussars" correctly where as the human reader did not. Without adding in our human error, our AI-trained version will be the more educated one for certain going forward.
It'd be nice if we could input our own text because otherwise these things are subject to a lot of training corpus and other biases.<p>Sounds really good though.
> We train our proposed system on 8 NVIDIA V100 GPUs with 32G memory<p>Sounds like openly reproducing this result is within independent researchers’ reach.
Totally tangential comment:<p>You can click play on any/all of the samples simultaneously, resulting in a neat sonic effect vaguely reminiscent of Steve Reich's famous "Come out." [1]<p>[1] <a href="https://www.youtube.com/watch?v=g0WVh1D0N50" rel="nofollow">https://www.youtube.com/watch?v=g0WVh1D0N50</a> (skip to like 7 minutes in to get the idea)
I wish for the "naturalspeech versus recording" comparisons they'd used a different voice for the synthesized speech. Otherwise the fact that we may not be able to tell them apart by ear (in a blindfold test) doesn't tell us much about how good it is as a speech synth engine with that evidence alone.
As a TTS daily user, sometimes I'm even fine with espeak quality for system messages. But one thing concerns me more than beauty of the voice - the ability to process mixed language text and abbreviations. And I don't see these problems addressed in this project. (
Do you want to check how it works? you can test the operation of standard voices and advanced neural voices, at this url: <a href="https://vtts.xyz/home/tryme" rel="nofollow">https://vtts.xyz/home/tryme</a>
I'm not sure what to make of this. The TTS output seems identical to the recording.<p>Why don't they use this tech to recreate some dead actor's speech, for example?
This is definitely human-level quality. In fact, the synthesized versions pronounce some words better than human. Kudos to MSFT! I think they've been longest in the game too...<p>edit; is the Nuance acquisition compounding yet?
Very cool, but..<p>What's the end game here? because I cannot use it, I cannot buy it and this seems more than just a scientific paper.<p>So what's the objective here?
Can someone please upload the results?
On <a href="https://paperswithcode.com/sota/text-to-speech-synthesis-on-ljspeech" rel="nofollow">https://paperswithcode.com/sota/text-to-speech-synthesis-on-...</a>