Google Cloud Text-To-Speech Powered by DeepMind WaveNet Technology

412 pointsby pseudobryabout 7 years ago

25 comments

coryfkleinabout 7 years ago

Google also announced today their Tacotron engine which features new prosody modeling speech generation. It allows them to generate speech that mimics personal intonation, accents, and rhythm, effectively mimicking an individuals "expression" in their speech.HN discussion here: <a href="https://news.ycombinator.com/item?id=16691197" rel="nofollow">https://news.ycombinator.com/item?id=16691197</a>

评论 #16692819 未加载

评论 #16693636 未加载

slaymaker1907about 7 years ago

For any Google devs lurking out there, it doesn't seem to work at all in Firefox on Windows. It looks like it has something to do with custom web components with the following message:ReferenceError: customElements is not definedAlso apparently some assertion errors with webcomponents (minified so line numbers not useful).

评论 #16692668 未加载

评论 #16693099 未加载

评论 #16692330 未加载

评论 #16693662 未加载

ollinabout 7 years ago

Does anyone have a GitHub project for epub -> mp3 using this service yet (for automatic audiobook generation)? May make it myself if I have time but curious if anyone already has set it up.EDIT: this is almost exactly their sample application (<a href="https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/texttospeech/cloud-client" rel="nofollow">https://github.com/GoogleCloudPlatform/python-docs-samples/t...</a>). Was able to get it working with epubs using pypandoc within the hour. Now just need to make it upload to Overcast...EDIT 2: Can now convert epubs directly to mp3s on Overcast. Yay!

评论 #16695569 未加载

评论 #16693795 未加载

评论 #16694186 未加载

qeternityabout 7 years ago

The average English word is 4.5 characters and the average English speaker speaks 110-150 words per minute. This means that at $16/1m characters, we can generate speech at a cost between $28.57-39/hr. Per Google's post, WaveNet now costs 50ms of TPU time per 1s of speech generated, meaning, at 100% utilization, a TPU can generate somewhere between $571.40-780/hr. Google's TPUs can be deployed (by third parties) at $6.50/hr. That's some sweet sweet margin.

评论 #16691811 未加载

评论 #16692333 未加载

评论 #16691799 未加载

评论 #16691930 未加载

StavrosKabout 7 years ago

Here's a simple Python script that will fetch some sample audio using the request on the demo page and save it in a file:<a href="https://www.pastery.net/nujfhw/" rel="nofollow">https://www.pastery.net/nujfhw/</a>I have no idea what the rate limits are, so please don't abuse it, I wrote it because the demo didn't work in Firefox and I wanted to play around with it more extensively.

评论 #16697565 未加载

Jakobabout 7 years ago

Having an English text but setting the language to another one like German or French is hilarious.You get e.g. ze Dscherman aczent or de frensch onehe.

评论 #16693071 未加载

评论 #16693683 未加载

评论 #16692997 未加载

tambourine_manabout 7 years ago

The US English synthesized version is truly remarkable. Borderline scarily good.The fact that the preview only seems to work on Chrome (and silently breaks everywhere else) is not cool, thought.

评论 #16692775 未加载

PostOnceabout 7 years ago

Am I wrong in thinking that the cost of generating (realistic-sounding, learned-model) speech on commodity hardware will be near-zero soon, largely negating the value of a SaaS?I've been waiting a long time for decent sounding open source TTS software for narrating books to me, and now with deep learning it's either here or very near here, and the hardware is going to keep getting more performant at the same price. I guess that will be very appealing to businesses relying on TTS (e.g. call centers and phone robots and mobile apps with TTS, etc)

评论 #16693884 未加载

评论 #16693922 未加载

评论 #16694305 未加载

neomabout 7 years ago

As someone who struggles greatly with the written words, I'm so thankful to see this. For the last year or so I've poked around every few months to see if they'd opened this up more generally. I'd be more than happy to pay $30-60/mth (more if it had Spritz) for the ability to have high quality, high speech speed, text to speech for my emails, documents and news articles I'd like to consume.

benjismithabout 7 years ago

Interesting! I'd love to see a thorough comparison with the Amazon Polly service...<a href="https://aws.amazon.com/polly/" rel="nofollow">https://aws.amazon.com/polly/</a>Polly is priced at $4 per million characters and the Google WaveNet voices are $16 (compared with the Google non-WaveNet voices, which are also $4).After listening to a few samples from each service, the voice quality and prosody modeling seem roughly on par between Polly and WaveNet, or at least the differences I heard didn't seem to justify a 4x price multiplier.But I'd love to hear an informed opinion from someone with more expertise...

评论 #16695982 未加载

评论 #16699168 未加载

joelthelionabout 7 years ago

I for one welcome our new wavenet telemarketing overlords...

评论 #16693316 未加载

WheelsAtLargeabout 7 years ago

It's very good. The voices reminds me of speech from real life people with accents. It's good enough for voice overs where previously real-life voices would be too expensive. I would say that it's better than Amazon's Polly when it's used to read long passages of text.

评论 #16693432 未加载

avivabout 7 years ago

I have not seen any mention on licensing and whether you can cache and replay voice responses. Amazon Polly specifically allows caching.

ryeguy_24about 7 years ago

Based on the pricing of $16 per 1 million characters (roughly equal to a 400-500 page book), doesn't this severely threaten the voiceover market place? I just priced the cost of a human voiceover on VoiceBunny.com for a 400-page book and I got an average turnaround time of 90 days / $15K cost vs WaveNet's $16 cost and only 30 mins of computational time. That sounds like an interesting disruptor to me.

评论 #16694909 未加载

bufferoverflowabout 7 years ago

I wish they had some beautiful voices, not some of the most generic-sounding men and women.

评论 #16692498 未加载

评论 #16693716 未加载

评论 #16754633 未加载

评论 #16692578 未加载

remirabout 7 years ago

Imagine teaching these voices to sing. Something like DeepMind WaveNet Song Generator.You upload your music to the cloud, set some parameters (genre, tempo, emotion, etc) and a bunch of lyrics and the thing will spit out awesome vocals for you.

评论 #16694108 未加载

ImJasonHabout 7 years ago

Quick, someone remake Translation Party using Speech-to-Text-to-Speech-to-Text-to-Speech-ad-infinitum<a href="https://cloud.google.com/text-to-speech/docs/quickstart" rel="nofollow">https://cloud.google.com/text-to-speech/docs/quickstart</a> <a href="https://cloud.google.com/speech/docs/sync-recognize" rel="nofollow">https://cloud.google.com/speech/docs/sync-recognize</a>

评论 #16703583 未加载

tmalsburg2about 7 years ago

This is great, but there remain very difficult problems to be solved. The prosody generated by this is fairly generic and not informed by a true understanding of the text. Consider this sentence:I have plans to leave.If you stress the word "plans", the sentence means that the speaker is not necessarily intending to actually leave. However, when the stress is on "leave", the speaker definitely intends to leave. A human reader can easily infer the correct meaning from context but text-to-speech systems can't because they don't have any systematic understanding of the things being talked about and the social pragmatics of the discourse. As long as these issues aren't solved, text-to-speech systems will make mistakes. These mistakes will be easy to spot in some cases but can also have catastrophic consequences in other cases: "I have plans to bomb North Korea."

评论 #16696458 未加载

kokimameabout 7 years ago

I'm using Amazon Polly for a few of months to make videos for language learners. And I realize English voices powered by WaveNet slightly better than those of Amazon but the default Japanese sounds way too worse. Anyway, their pricing and platform are almost same with Amazon, so I definitely need to add another interface for this TTS into my app. You can listen to Amazon Polly voices with the video I made: <a href="https://www.youtube.com/watch?v=ysMp0k4oR5c" rel="nofollow">https://www.youtube.com/watch?v=ysMp0k4oR5c</a>

lyspabout 7 years ago

I picked 3 random paragraphs from a random article on a local online news site.The voices did sound quite natural and "news-readery", however the one issue I did find is adding a pause between words.With the example phrase: "He bought himself a boat and then took it to his house". You often expect a small pause after the word "boat".I was able to manually fix it by adding some commas and full stops, however the AI was not able to pick up those pauses naturally.It sounded like someone was rushing through the speech instead of stopping occasionally to "take a breath".

coryfkleinabout 7 years ago

The demo is available at <a href="https://cloud.google.com/text-to-speech/" rel="nofollow">https://cloud.google.com/text-to-speech/</a>Requires Chrome.

vereloabout 7 years ago

Is it just me, or would a demo really make this posting much more interesting?Edit: There is one, on the actual Google Cloud Text-To-Speech page, so a few clicks in and you'll get one.

评论 #16693676 未加载

StavrosKabout 7 years ago

Is there any API for generating speech that sounds like Google Now's assistant? The quality of that is much, much better than this new service.

评论 #16693343 未加载

tristanjabout 7 years ago

Are there any voice samples?

评论 #16691843 未加载

评论 #16692351 未加载

daoudcabout 7 years ago

I had an idea this morning for a personalised "podcast" that could read out e.g. the weather in your area, any new and important emails, the headlines and first paragraph of top stories from your favourite sources and notifications from social media.I think this is the missing thing that was needed to make this viable.

评论 #16692872 未加载

评论 #16693423 未加载

25 comments

coryfkleinabout 7 years ago

评论 #16692819 未加载

评论 #16693636 未加载

slaymaker1907about 7 years ago

评论 #16692668 未加载

评论 #16693099 未加载

评论 #16692330 未加载

评论 #16693662 未加载

ollinabout 7 years ago

评论 #16695569 未加载

评论 #16693795 未加载

评论 #16694186 未加载

qeternityabout 7 years ago

评论 #16691811 未加载

评论 #16692333 未加载

评论 #16691799 未加载

评论 #16691930 未加载

StavrosKabout 7 years ago

评论 #16697565 未加载

Jakobabout 7 years ago

Having an English text but setting the language to another one like German or French is hilarious.You get e.g. ze Dscherman aczent or de frensch onehe.

评论 #16693071 未加载

评论 #16693683 未加载

评论 #16692997 未加载

tambourine_manabout 7 years ago

The US English synthesized version is truly remarkable. Borderline scarily good.The fact that the preview only seems to work on Chrome (and silently breaks everywhere else) is not cool, thought.

评论 #16692775 未加载

PostOnceabout 7 years ago

评论 #16693884 未加载

评论 #16693922 未加载

评论 #16694305 未加载

neomabout 7 years ago

benjismithabout 7 years ago

评论 #16695982 未加载

评论 #16699168 未加载

joelthelionabout 7 years ago

I for one welcome our new wavenet telemarketing overlords...

评论 #16693316 未加载

WheelsAtLargeabout 7 years ago

评论 #16693432 未加载

avivabout 7 years ago

I have not seen any mention on licensing and whether you can cache and replay voice responses. Amazon Polly specifically allows caching.

ryeguy_24about 7 years ago

评论 #16694909 未加载

bufferoverflowabout 7 years ago

I wish they had some beautiful voices, not some of the most generic-sounding men and women.

评论 #16692498 未加载

评论 #16693716 未加载

评论 #16754633 未加载

评论 #16692578 未加载

remirabout 7 years ago

评论 #16694108 未加载

ImJasonHabout 7 years ago

评论 #16703583 未加载

tmalsburg2about 7 years ago

评论 #16696458 未加载

kokimameabout 7 years ago

lyspabout 7 years ago

coryfkleinabout 7 years ago

The demo is available at <a href="https://cloud.google.com/text-to-speech/" rel="nofollow">https://cloud.google.com/text-to-speech/</a>Requires Chrome.

vereloabout 7 years ago

Is it just me, or would a demo really make this posting much more interesting?Edit: There is one, on the actual Google Cloud Text-To-Speech page, so a few clicks in and you'll get one.

评论 #16693676 未加载

StavrosKabout 7 years ago

Is there any API for generating speech that sounds like Google Now's assistant? The quality of that is much, much better than this new service.