Personally, I find I dislike any "emotion" added to TTS -- I find Alexa's emo markup, a la:<p><a href="https://developer.amazon.com/en-US/blogs/alexa/alexa-skills-kit/2019/11/new-alexa-emotions-and-speaking-styles" rel="nofollow">https://developer.amazon.com/en-US/blogs/alexa/alexa-skills-...</a><p>to be disturbing and without much added value. (Such as used with games like Jeopardy.)<p>If used, the application of these tags needs to be both meticulous in its proper context, somewhat non-deterministically applied, and with randomized prosody. Repeated usage of the same overstated emotive content is annoying and unnatural (worse than a "flat" presentation) and only serves to underscore the underlying inflexible conversational content.
Exciting to see our research making broad impact across the industry! <a href="https://arxiv.org/abs/1802.08435" rel="nofollow">https://arxiv.org/abs/1802.08435</a>
Speech Synthesis has always baffled me. You could run a reasonable (albeit strangely accented) version on 16Mhz Macs without major CPU impact. The code including sound data was less than a megabyte.<p>In order to achieve modest improvements in dictation we're throwing entire GPU arrays at the problem. What happened in the middle? Was there really no room for improvement until we went full AI?
Impressive but also still sounds "robotic" like AWS Polly. I wonder if they'll fuse that tech where you can sample someone's voice from a paragraph and build something. Then you could hire a voice actor(ress) and maybe license their voice? I don't know how that would work.
The weaknesses of TTS twig different people in different ways. For example, Microsoft Zira and the older Google TTS voice rank near the top for me, while I find every single one of the modern Google voices so horrible as to provoke instant anger when I hear them.
Yeah, awesome! This proprietary transcription algorithm must make it a hell of a lot easier for NSA databases. If this is deployed and used by FB so they send the finished and full transcripts of calls and other voice traffic [1] instead of the original audio to be transcribed later, it will all be more efficienct! // sarcasm<p>[1] <a href="https://theintercept.com/2015/05/05/nsa-speech-recognition-snowden-searchable-text/" rel="nofollow">https://theintercept.com/2015/05/05/nsa-speech-recognition-s...</a>