Ask HN: Are any startups working on text-to-speech?

36 pointsby bossxover 9 years ago

It seems like TTS technology hasn't evolved much over the years and I was wondering if any startups are working on making it sound more realistic?

24 comments

ivan_ahover 9 years ago

Have you tried the TTS in Mac OS X? You can run it on the command line using:<pre><code> say 'this is a test' </code></pre> I for one think it's very good quality (at least for the Alex voice)

评论 #10928383 未加载

compumikeover 9 years ago

What do you see as the business use case for more realistic text-to-speech?We use TTS extensively within the Pantelligent iOS and Android apps, and it's something our users requested and get a lot of value out of. It seems like the existing solutions are already good enough / dramatically above the threshold of usefulness for an interactive real-time-guidance application like ours, and just keep getting better from the OS side.

评论 #10930778 未加载

评论 #10927084 未加载

lutuspover 9 years ago

> It seems like TTS technology hasn't evolved much over the years and I was wondering if any startups are working on making it sound more realistic?It's not true that TTS hasn't improved (see below). Many people are working on this, both in academia and in private enterprise. It's an obvious and potentially valuable part of the human-computer interface.This is not to suggest that it's easy -- the mathematics and vocal tract modeling problems are formidable. The only reason there are reasonable TTS resources now is because of the rapid increase in computer power -- power that's needed to support this feature.Here's a site chosen at random that offers a high-quality TTS example:<a href="https://www.ivona.com/" rel="nofollow">https://www.ivona.com/</a>It's pretty good based on prevailing standards, and it's the outcome of a lot of work.To find the companies working on this, just Google for "high-quality tts".

insolubleover 9 years ago

I've thought about getting into this area myself, but I was too afraid there was not enough market for it. This was several years ago. As far as CPU power, today's average PC is easily 5x what's necessary for perfect speech. The real question is the algorithms being used. Perhaps the fear is that the algorithm would be pirated. I mean just look at the history of digital audio (or video) and encoding, with things like Xvid and Ogg. Basically every time a good algorithm even starts to gain traction, an open "alternative" is made available practically overnight. This is not to say I don't like open algorithms. In fact, I believe that any standard algorithm should be open. This fact, however, is enough to deter research in this field. Perhaps a Web service that converted text to speech would be one option, but it would have limited applicability.Edit: Perhaps a Kickstarter or related would be a good idea since this type of feature would be useful by so many people. Nearly everyone has functioning ears. (no offence to those who don't)

praccuover 9 years ago

Ivona, acquired by Amazon, is a hugely notable example. They're great folks, and do great work. Alexa'a voice was done by them, but an early customer of theirs was the Polish public transit system.A lot of really great work is happening in academia; I'm not going to name names because I'd forget someone deserving.(Shameless plug: we [0] do speech and language consulting including custom TTS.)[0] cobaltspeech.com

评论 #10931670 未加载

lorenzorhoadesover 9 years ago

A ML based app to read articles to me in the morning on my way to work may have some commercial success. The ones currently have a hard time reading through an entire article without sounding like an robot, or reading a headline as if it was part of the previous sentence.

tkjefover 9 years ago

I made this app called Ultimate Alerts that allows for all email and text messages to be read over TTS when your car goes over 10mph for over 1 minute.It switches back to normal settings when you go below 10mph for over 1 minute. Helpful with switching everything to TTS when you're driving automatically.Lots of other functionality as well. Check it out: <a href="https://play.google.com/store/apps/details?id=com.org.imsono.emailnew&hl=en" rel="nofollow">https://play.google.com/store/apps/details?id=com.org.imsono...</a>

jacquesmover 9 years ago

<a href="https://news.ycombinator.com/item?id=10925826" rel="nofollow">https://news.ycombinator.com/item?id=10925826</a>May be of interest to you.

tomasienover 9 years ago

FWIW I think TTS is a bad interface. That said, IBM Watson is getting pretttty good. Check it out, they're willing to work with startups too.

评论 #10929526 未加载

pshapiro99over 9 years ago

I've also noticed this stagnation. The quality of the spoken voice TTS sound depends upon two things, I've heard -- processor speed and memory (RAM). Processor speed has increased dramatically in the past few years. I wish someone would design TTS that only works on the fastest processors. There seems to be too much lowest-common-denominator going on in this field.

评论 #10927055 未加载

yrezguiover 9 years ago

Have a look to Watson Text to Speech API: <a href="http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/text-to-speech.html" rel="nofollow">http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercl...</a>

AndrewMBlissover 9 years ago

This is not really a "text to speech" but an voice search engine. It is called Mobvoi and is invested by Google. <a href="http://chumenwenwen.com/global.html" rel="nofollow">http://chumenwenwen.com/global.html</a>

andrewbarbaover 9 years ago

I've messed around with Api.ai (<a href="https://docs.api.ai/docs/tts" rel="nofollow">https://docs.api.ai/docs/tts</a>) and it is quite impressive. Full-featured, amazing pricing (free), and goes way beyond just TTS.

lazyjonesover 9 years ago

OSX's "say" is amazing (judging from their english and german voices). Do you believe there's enough room for improvement to build a business case on it, even though some companies have worked on the problem for decades?

infocollectorover 9 years ago

Does anyone know of a good one that will without an internet connection on ubuntu 14.04?

Animatsover 9 years ago

Try Vocaloid.[1] It's overkill for plain text to speech, but quite good. Better for Japanese than for English.[1] <a href="http://www.vocaloid.com/en/" rel="nofollow">http://www.vocaloid.com/en/</a>

bckmnover 9 years ago

I'm building a web-to-speech(-to-podcast) app that's gaining traction. Integrating new voices and integrations all the time. <a href="https://narro.co" rel="nofollow">https://narro.co</a>

elchudi2over 9 years ago

<a href="http://www.mivoq.it/en/" rel="nofollow">http://www.mivoq.it/en/</a> is trying to advance TTS technology

jerelunruhover 9 years ago

I recently ran onto this for the browser: <a href="http://responsivevoice.org/" rel="nofollow">http://responsivevoice.org/</a>

hobonumber1over 9 years ago

Try Houndify from SoundHound! (www.houndify.com).

评论 #10927362 未加载

justincormackover 9 years ago

Apple bought Cambridge UK startup VocalIQ that does this (in part, they also did recognition work) a year or so ago.

npalliover 9 years ago

Here are three vendors that provide a good TTS apis. Have you evaluated the performance. What did you find lacking?1. Nuance2. AT&T3. IBM Watson

评论 #10927224 未加载

voiceclonrover 9 years ago

Shameless plug. Give www.voiceclonr.com a try (something I built a while back)

mdasenover 9 years ago

TTS has definitely evolved over the years. If you compare Google Maps voice to Stephen Hawking, it's night and day.However, I can definitely understand how TTS technology looks stagnant. Part of this is that going from nothing to something reasonable happened exceedingly quickly. Early TTS research was supported by the US government which saw that early systems were comprehensible (if not wonderful sounding) and declared victory. Funding went to other problems in computational linguistics (like speech recognition, information extraction, etc.) and so did a lot of the workforce.Modern systems usually involve many hours of speech from a single person and use variable length units to form more natural speech. Many systems still sound pasted together because that's how enterprise technology goes. How many banks have online banking that seems like it's from the 90s? You can't compare what systems can do to what some call center has installed as its technology. Someone has linked to Inova. Google and Nuance have good systems as well, but there's a balance between resources and perfect speech.In terms of some of the issues. . . When you're going through a finite amount of recorded speech, you have to choose something that fits. It isn't going to be perfect in many cases. You have to deal with things like F0 declination. You have to deal with how long phonemes are going to be. You have to deal with breaks in utterances.And the fact is that we can understand Hawking's 1980s TTS.If you want to start thinking about the problem more, try inputting these two statements into Inova "Do you really want to see all of it? Do you want to see all of it? I want to see all of it." Notice how it tries to rise around "really" in the first sentence. It's trying to match how we would speak - rising for "really" in the first sentence and rising for the question-ending in both questions. But it kinda misses in both cases. Still, in some ways it's amazing that it recognizes "really" as something that should go up. It recognizes that questions go up at the end. It recognizes how the non-question goes down as the sentence progresses. And it finds things within its data set to fit to how it thinks the sentence is going to go. But it doesn't have perfect language understanding so it doesn't know exactly how things would be said - a lot of sounding natural isn't making the phonemes more accurately, but the intonation and attitude of the speech. It also has to find something that fits. Lots of smart things are done, but it's pulling from a limited amount of recorded speech - speech that has been sliced in many useful ways, but still limited.TTS has definitely evolved and I think that Google, Nuance, and others are definitely pushing it forward. You're going to interact with a lot of legacy systems that feel like they're still in the Hawking era. But most ATMs I use don't even have touch screens (opting for buttons on the side of the screen) and even fancy ATMs like Wells Fargo don't feel like an iPad. You don't want to compare to systems that are so far away from modern, commercially available systems.There is definitely work being done on it and it's definitely become much better. But to an extent, it isn't something that a lot of companies are going to work on. How big is the market for TTS? Before you say, "it's useful in loads of things," think about the market for maps. A lot of it is Google or Apple Maps. Loads of apps integrate mapping, but don't want to map the world or run their own infrastructure for serving it. Some use OpenStreetMaps, but they're really just serving generated tiles rather than re-mapping the world. If you were to create a TTS startup, what would your business model be? Pay us money to TTS your text rather than getting it for free from Google's Android TTS? The issue is that TTS is more a feature than a product. Companies like Amazon might bring a company like Inova in-house. Nuance sells their stuff to companies like Apple. Google is large enough to build it. But you'd be pitching something that doesn't directly solve something for customers and trying to hire very smart (expensive) people who you hope will be able to come up with something new that isn't an obvious "with more resources, better" solution. Remember, TTS needs to be done in real-time, possibly on low-powered mobile devices (don't eat their battery or storage) or over the network (don't make our AWS spend go through the roof). If you're going to sell to an app maker as a no-network TTS, how much bloat are you adding to the app?It's just a hard market to be in given that the 1980s solution works, even if it doesn't sound realistic and modern already-available systems are quite reasonable.

评论 #10929011 未加载