TTS has definitely evolved over the years. If you compare Google Maps voice to Stephen Hawking, it's night and day.<p>However, I can definitely understand how TTS technology looks stagnant. Part of this is that going from nothing to something reasonable happened exceedingly quickly. Early TTS research was supported by the US government which saw that early systems were comprehensible (if not wonderful sounding) and declared victory. Funding went to other problems in computational linguistics (like speech recognition, information extraction, etc.) and so did a lot of the workforce.<p>Modern systems usually involve many hours of speech from a single person and use variable length units to form more natural speech. Many systems still sound pasted together because that's how enterprise technology goes. How many banks have online banking that seems like it's from the 90s? You can't compare what systems can do to what some call center has installed as its technology. Someone has linked to Inova. Google and Nuance have good systems as well, but there's a balance between resources and perfect speech.<p>In terms of some of the issues. . . When you're going through a finite amount of recorded speech, you have to choose something that fits. It isn't going to be perfect in many cases. You have to deal with things like F0 declination. You have to deal with how long phonemes are going to be. You have to deal with breaks in utterances.<p>And the fact is that we can understand Hawking's 1980s TTS.<p>If you want to start thinking about the problem more, try inputting these two statements into Inova "Do you really want to see all of it? Do you want to see all of it? I want to see all of it." Notice how it tries to rise around "really" in the first sentence. It's trying to match how we would speak - rising for "really" in the first sentence and rising for the question-ending in both questions. But it kinda misses in both cases. Still, in some ways it's amazing that it recognizes "really" as something that should go up. It recognizes that questions go up at the end. It recognizes how the non-question goes down as the sentence progresses. And it finds things within its data set to fit to how it thinks the sentence is going to go. But it doesn't have perfect language understanding so it doesn't know exactly how things would be said - a lot of sounding natural isn't making the phonemes more accurately, but the intonation and attitude of the speech. It also has to find something that fits. Lots of smart things are done, but it's pulling from a limited amount of recorded speech - speech that has been sliced in many useful ways, but still limited.<p>TTS has definitely evolved and I think that Google, Nuance, and others are definitely pushing it forward. You're going to interact with a lot of legacy systems that feel like they're still in the Hawking era. But most ATMs I use don't even have touch screens (opting for buttons on the side of the screen) and even fancy ATMs like Wells Fargo don't feel like an iPad. You don't want to compare to systems that are so far away from modern, commercially available systems.<p>There is definitely work being done on it and it's definitely become much better. But to an extent, it isn't something that a lot of companies are going to work on. How big is the market for TTS? Before you say, "it's useful in loads of things," think about the market for maps. A lot of it is Google or Apple Maps. Loads of apps integrate mapping, but don't want to map the world or run their own infrastructure for serving it. Some use OpenStreetMaps, but they're really just serving generated tiles rather than re-mapping the world. If you were to create a TTS startup, what would your business model be? Pay us money to TTS your text rather than getting it for free from Google's Android TTS? The issue is that TTS is more a feature than a product. Companies like Amazon might bring a company like Inova in-house. Nuance sells their stuff to companies like Apple. Google is large enough to build it. But you'd be pitching something that doesn't directly solve something for customers and trying to hire very smart (expensive) people who you hope will be able to come up with something new that isn't an obvious "with more resources, better" solution. Remember, TTS needs to be done in real-time, possibly on low-powered mobile devices (don't eat their battery or storage) or over the network (don't make our AWS spend go through the roof). If you're going to sell to an app maker as a no-network TTS, how much bloat are you adding to the app?<p>It's just a hard market to be in given that the 1980s solution works, even if it doesn't sound realistic and modern already-available systems are quite reasonable.