Hey HN, I’m the founder of AssemblyAI (<a href="https://www.assemblyai.com" rel="nofollow">https://www.assemblyai.com</a>). We're building an API for customizable speech recognition. Developers and companies use our API for things like transcribing phone calls and building voice powered smart devices. Unlike current speech recognition APIs, developers can customize our API to more accurately recognize an unlimited amount of industry specific words or phrases unique to what they're building without any training required. For example, you can recognize thousands of product or person names with our API. Or you can more accurately recognize commands/phrases common or custom to your use case.<p>We've developed our own deep neural network speech recognition architecture, and aren't using any open source speech frameworks like Kaldi or Sphinx (just Tensorflow). Because of this, we're able to run things more affordably and pass those savings on to developers.<p>I used to work on projects that had speech recognition requirements before starting AssemblyAI, and saw how limiting, expensive, and hard to work with traditional speech recognition services and APIs were. We want to help developers and companies easily build products with speech recognition.<p>Would love feedback from the HN community on what we're building, and if you have any questions about deep learning or deep learning in production ask away!
You seem to be using a slightly tweaked CTC-based architecture built in tensorflow (possibly with Baidu's warp-ctc) but marketing it as some super-secret technology you invented in-house. I don't see any performance benchmarks or WER results we can compare with other APIs, but the pricing is the same. Surely character-based approach lets you add new words without pronunciations, but that process is not as flawless as you make it seem, especially when you lack language model data for new words. Now I'm still a bit confused why somebody would use AssemblyAI over other APIs given the same price. And FYI you are not using Kaldi / Sphinx because the guys behind them did not endorse CTC and are purposefully avoiding putting it in there, though for example Kaldi's chain models are also sequence based. There was also Eesen that tried to implement CTC on top of Kaldi. Sorry if this came off too harsh, but I am a little suspicious about the novelty of the approach here.
I remember 10 years ago Nuance used legal threats to eliminate competition in this field, to the extent that greatly discouraged any startup speech recognition companies.<p>Google was able to get around it, just because they became heavier..<p>Did this significantly change since then?
> We've developed our own deep neural network speech recognition architecture, and aren't using any open source speech frameworks like Kaldi or Sphinx (just Tensorflow). Because of this, we're able to run things more affordably and pass those savings on to developers.<p>Kaldi and Sphinx are <i>far</i> more efficient than any tensorflow transcription model I've ever seen.<p>I assume this is an oversight ?
Been wanting something like this for years. I have a bunch of old speeches and radio shows I'd like to transcribe. They all have "terms of art", and noone at Google would tell me how to train their API to adapt to my use case. Too bad I missed this Beta; hope you allow more people in soon.<p>Can you clarify: does your API allow me to run the transcriber, pause it when I see an error, tell it what the corrected text is, then continue with that correction taken into account?
Small issue I notice that the email links on the pricing page: they're swapped, with "Basic" having an "Enterprise Plan" subject line and vice versa