Google also announced today their Tacotron engine which features new prosody modeling speech generation. It allows them to generate speech that mimics personal intonation, accents, and rhythm, effectively mimicking an individuals "expression" in their speech.<p>HN discussion here: <a href="https://news.ycombinator.com/item?id=16691197" rel="nofollow">https://news.ycombinator.com/item?id=16691197</a>
For any Google devs lurking out there, it doesn't seem to work at all in Firefox on Windows. It looks like it has something to do with custom web components with the following message:<p>ReferenceError: customElements is not defined<p>Also apparently some assertion errors with webcomponents (minified so line numbers not useful).
Does anyone have a GitHub project for epub -> mp3 using this service yet (for automatic audiobook generation)? May make it myself if I have time but curious if anyone already has set it up.<p><i></i>EDIT<i></i>: this is almost exactly their sample application (<a href="https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/texttospeech/cloud-client" rel="nofollow">https://github.com/GoogleCloudPlatform/python-docs-samples/t...</a>). Was able to get it working with epubs using pypandoc within the hour. Now just need to make it upload to Overcast...<p><i></i>EDIT 2<i></i>: Can now convert epubs directly to mp3s on Overcast. Yay!
The average English word is 4.5 characters and the average English speaker speaks 110-150 words per minute. This means that at $16/1m characters, we can generate speech at a cost between $28.57-39/hr. Per Google's post, WaveNet now costs 50ms of TPU time per 1s of speech generated, meaning, at 100% utilization, a TPU can generate somewhere between $571.40-780/hr. Google's TPUs can be deployed (by third parties) at $6.50/hr. That's some sweet sweet margin.
Here's a simple Python script that will fetch some sample audio using the request on the demo page and save it in a file:<p><a href="https://www.pastery.net/nujfhw/" rel="nofollow">https://www.pastery.net/nujfhw/</a><p>I have no idea what the rate limits are, so please don't abuse it, I wrote it because the demo didn't work in Firefox and I wanted to play around with it more extensively.
Having an English text but setting the language to another one like German or French is hilarious.<p>You get e.g. ze Dscherman aczent or de frensch onehe.
The US English synthesized version is truly remarkable. Borderline scarily good.<p>The fact that the preview only seems to work on Chrome (and silently breaks everywhere else) is not cool, thought.
Am I wrong in thinking that the cost of generating (realistic-sounding, learned-model) speech on commodity hardware will be near-zero soon, largely negating the value of a SaaS?<p>I've been waiting a long time for decent sounding open source TTS software for narrating books to me, and now with deep learning it's either here or very near here, and the hardware is going to keep getting more performant at the same price. I guess that will be very appealing to businesses relying on TTS (e.g. call centers and phone robots and mobile apps with TTS, etc)
As someone who struggles greatly with the written words, I'm so thankful to see this. For the last year or so I've poked around every few months to see if they'd opened this up more generally. I'd be more than happy to pay $30-60/mth (more if it had Spritz) for the ability to have high quality, high speech speed, text to speech for my emails, documents and news articles I'd like to consume.
Interesting! I'd love to see a thorough comparison with the Amazon Polly service...<p><a href="https://aws.amazon.com/polly/" rel="nofollow">https://aws.amazon.com/polly/</a><p>Polly is priced at $4 per million characters and the Google WaveNet voices are $16 (compared with the Google non-WaveNet voices, which are also $4).<p>After listening to a few samples from each service, the voice quality and prosody modeling seem roughly on par between Polly and WaveNet, or at least the differences I heard didn't seem to justify a 4x price multiplier.<p>But I'd love to hear an informed opinion from someone with more expertise...
It's very good. The voices reminds me of speech from real life people with accents. It's good enough for voice overs where previously real-life voices would be too expensive. I would say that it's better than Amazon's Polly when it's used to read long passages of text.
Based on the pricing of $16 per 1 million characters (roughly equal to a 400-500 page book), doesn't this severely threaten the voiceover market place? I just priced the cost of a human voiceover on VoiceBunny.com for a 400-page book and I got an average turnaround time of 90 days / $15K cost vs WaveNet's $16 cost and only 30 mins of computational time. That sounds like an interesting disruptor to me.
Imagine teaching these voices to sing. Something like DeepMind WaveNet Song Generator.<p>You upload your music to the cloud, set some parameters (genre, tempo, emotion, etc) and a bunch of lyrics and the thing will spit out awesome vocals for you.
This is great, but there remain very difficult problems to be solved. The prosody generated by this is fairly generic and not informed by a true understanding of the text. Consider this sentence:<p>I have plans to leave.<p>If you stress the word "plans", the sentence means that the speaker is not necessarily intending to actually leave. However, when the stress is on "leave", the speaker definitely intends to leave. A human reader can easily infer the correct meaning from context but text-to-speech systems can't because they don't have any systematic understanding of the things being talked about and the social pragmatics of the discourse. As long as these issues aren't solved, text-to-speech systems will make mistakes. These mistakes will be easy to spot in some cases but can also have catastrophic consequences in other cases: "I have plans to bomb North Korea."
I'm using Amazon Polly for a few of months to make videos for language learners. And I realize English voices powered by WaveNet slightly better than those of Amazon but the default Japanese sounds way too worse. Anyway, their pricing and platform are almost same with Amazon, so I definitely need to add another interface for this TTS into my app. You can listen to Amazon Polly voices with the video I made: <a href="https://www.youtube.com/watch?v=ysMp0k4oR5c" rel="nofollow">https://www.youtube.com/watch?v=ysMp0k4oR5c</a>
I picked 3 random paragraphs from a random article on a local online news site.<p>The voices did sound quite natural and "news-readery", however the one issue I did find is adding a pause between words.<p>With the example phrase: "He bought himself a boat and then took it to his house". You often expect a small pause after the word "boat".<p>I was able to manually fix it by adding some commas and full stops, however the AI was not able to pick up those pauses naturally.<p>It sounded like someone was rushing through the speech instead of stopping occasionally to "take a breath".
The demo is available at <a href="https://cloud.google.com/text-to-speech/" rel="nofollow">https://cloud.google.com/text-to-speech/</a><p>Requires Chrome.
Is it just me, or would a demo really make this posting much more interesting?<p>Edit: There is one, on the actual Google Cloud Text-To-Speech page, so a few clicks in and you'll get one.
I had an idea this morning for a personalised "podcast" that could read out e.g. the weather in your area, any new and important emails, the headlines and first paragraph of top stories from your favourite sources and notifications from social media.<p>I think this is the missing thing that was needed to make this viable.