The iOS 11 Siri sounds like it's a real person talking, it's amazing. Does anyone know if there's an open-source TTS library available with such quality (or if anyone is working on one, from this paper)?<p>I would love to have my home speakers announce things in this voice.
A research paper published by Apple? About Siri?! Unheard of! Last time I was at an NLP conference wth Apple employees they wouldn't say anything about how Siri speech worked, despite being very inquisitive about everyone else's publications. Good to see some change.
My favorite part is that the runtime runs on device. I moved back to Android, but persistently one thing Apple does that I like is they don't move things to the internet as often as Google does. On Android, you get degraded TTS if the internet is shoddy.
I couldn't read the paper yet, and also I know very little about this, but listening to the audio samples it seems that one of the most notable changes was the intonation in changing phrases. Did anyone else catch something like that? I'm not sure I'm doing a good job at explaining. If you listen to all iOS11 samples it'll stand out.<p>Anyway, it's the only way I can still identify this as a fake voice. The intonation always follows the same cadence (not sure if that's the word?). We really shouldn't have overused the word awesome before this kind of thing came along.<p>There's also a kind of dread too, tbh, this kind of seamless TTS has the potential to change a lot of things. First of all criminals are going to love this, youtube pranksters too. Eventually this will shake up the voice acting industry in a possibly not healthy way for the voice actors, while at the same time allowing projects with a shorter budget to have incredible voice work (also dubbing).<p>What I think is really important, tho, is that as we move away from the uncanny valley we change our relationships with those voices, our brains don't have the capacity to listen to a voice this real and not imagine it as a person, even for adults.<p>Ironically at this moment I'm using an old threadless sweatshirt that says "this was supposed to be the future" but nowadays I can honestly say we're getting there.
The difference between the Siri voices from iOS 9-11 is startling. I can still here some issues especially at the ends of phrases, but it's extremely good.
This just made me realize that every time you see a strong AI in fiction, it still has a computer-sounding voice. If we ever develop strong AI, we will probably already have perfectly natural speech synthesis. And if not, the AI could develop it for us.<p>But I suppose an AI might choose to use a computer-sounding voice to remind us that it is a computer. Kind of like those inaccurate sound effects in movies - they have become so common that it seems more wrong to omit them. (TV Tropes calls this "The Coconut Effect".)
The prosody and and continuity of the speech is dramatically improved. This is hard to do and very impressive (especially given that it is being done on-device).<p>Personally, I'm less pleased with the actual new voice itself, although that is more a subjective judgment. After listening to many hundreds of voice talent auditions for Alexa, it's hard to step back from that level of pickiness.
Kinda sad to see that the names of the authors are omitted, although you can infer some of them from the quote:<p>> For more details on the new Siri text-to-speech system, see our published paper “Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System”<p><i>[9] T. Capes, P. Coles, A. Conkie, L. Golipour, A. Hadjitarkhani, Q. Hu, N. Huddleston, M. Hunt, J. Li, M. Neeracher, K. Prahallad, T. Raitio, R. Rasipuram, G. Townsend, B. Williamson, D. Winarsky, Z. Wu, H. Zhang. Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System, Interspeech, 2017.</i><p>Why not just add the names by default?
It might seem silly, but I'm looking forward to the first AI talk therapist. Most of the benefit of therapy is the talking, so it's not as crazy as it sounds.
Good blog post and audio samples notwithstanding, annoying that they don't put the paper on Arxiv. As they themselves point to in the blog post, the learning architecture was introduced in 2014's "Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis" so it's not clear how much of this is just good engineering vs novel research.
The obvious question would be a head-to-head qualitative comparison vs. WaveNet. It seems that they have advanced siri vs. siri prior, but does this work advance the field?
There's no question the diction of iOS 11 is much improved. But I liked the voice & timbre of the old speaker better - it sounds more authoritative.
Now if only it didn't feel like when I'm asking Siri to do a task it has a very small pool of pre-set options I get to choose from. It still feels rather restricted, but I'm excited they're really investing into it.
I don't like the higher pitch/sharp tone from iOS 11. I like a warmer and deeper tone in iOS 10. I feel like having a more mature/experience assistant.