This is HUGE in my opinion. Prior to this, in order to get near state-of-the-art speech recognition in your system/application you either had to have/hire expertise to build your own or pay Nuance a significant amount of money to use theirs. Nuance has always been a "big bad" company in my mind. If I recall correctly, they've sued many of their smaller competitors out of existence and only do expensive enterprise deals. I'm glad their near monopoly is coming to an end.<p>I think Google's API will usher in a lot of new innovative applications.
> To attract developers, the app will be free at launch with pricing to be introduced at a later date.<p>Doesn't this mean you could spend time developing and building on the platform without knowing if your application is economically feasible? Seems like a huge risk to take for anything other than a hobby project.
I came across CMU Sphnix speech recognition library (<a href="http://cmusphinx.sourceforge.net" rel="nofollow">http://cmusphinx.sourceforge.net</a>) that has a BSD-style license and they just released a big update last month. It supports embedded and remote speech recognition. Could be a nice alternative for someone who may not need all of the bells and whistles and prefers to have more control rather than relying on an API which may not be free for long.<p>Side note: if anyone is interested in helping with an embedded voice recognition project please ping me.
Tangentially related: Does anyone remember the name of this startup/service that was on HN (I believe), that enables you to infer actions from plaintext.<p>Eg: "Switch on the lights" becomes<p>{"action": "switch_on",
"thing" : "lights"
}<p>etc.. I'm trying really hard to remember the name but it escapes me.<p>Speech recognition and <above service> will go very well together.
In case you're not interested in having google run your speech recognition:<p>CMU Sphinx:
<a href="http://cmusphinx.sourceforge.net/" rel="nofollow">http://cmusphinx.sourceforge.net/</a><p>Julius:
<a href="http://julius.osdn.jp/en_index.php" rel="nofollow">http://julius.osdn.jp/en_index.php</a>
If you're having trouble (like me) to find your "Google Cloud Platform user account ID" to sign up for Limited Preview access, it's just the email address for your Google Cloud account. Took me only 40 minutes to figure that one out.
I wrote a client library for this in C# by reverse engineering what chrome did at the time (totally not legit/unsupported by google, possibly against their TOS). I have never used it for anything serious, and am glad now there is an endorsed way to do this.<p><a href="https://bitbucket.org/josephcooney/cloudspeech" rel="nofollow">https://bitbucket.org/josephcooney/cloudspeech</a>
Key sentence:<p>> The Google Cloud Speech API, which will cover over 80 languages and will work with any application in real-time streaming or batch mode, will offer full set of APIs for applications to “see, hear and translate,” Google says.
Pretty impressive from the limited look the website (<a href="https://cloud.google.com/speech/" rel="nofollow">https://cloud.google.com/speech/</a>) gives: the fact that Google will clean the audio of background noise for you and supports streamed input is particularly interesting.<p>I don't know I should feel about Google taking even more data from me (and other users). How would integrating this service work legally? Would you need to alert users that Google will keep their recordings on file (probably indefinitely and without being able to delete them)?
Unless I have gone crazy google has had a STT available to tinker with for awhile. It is one of the options for jasper [1]. Hopefully this means it will be easier to setup now.<p>Would be nice if they just open sourced it though but I imagine that is at crossed purposes with their business.<p>[1] <a href="https://jasperproject.github.io/documentation/configuration/" rel="nofollow">https://jasperproject.github.io/documentation/configuration/</a>
SoundHound released Houndify[1], their voice API last year which goes deeper than just speech recognition to include Speech-to-Meaning, Context and Follow-up, and Complex and Compound Queries. It will be cool to see what people will do with speech interfaces in the near future.<p>[1] <a href="https://www.houndify.com/" rel="nofollow">https://www.houndify.com/</a>
Houndify launched last year and provides both speech recognition and natural language understanding. They have a free plan that never expires and transparent pricing. It can handle very complex queries that Google can't.
FWIW I'd just finished a large blog post researching ways to automate podcast transcription and subsequent NLP.<p>It includes lots of links to relevant research, tools, and services. Also includes discussion of the pros and cons of various services (Google/MS/Nuance/IBM/Vocapia etc.) and the value of vocabulary uploads and speaker profiles.<p><a href="http://blog.timbunce.org/2016/03/22/semi-automated-podcast-transcription-2/" rel="nofollow">http://blog.timbunce.org/2016/03/22/semi-automated-podcast-t...</a>
For anyone who wants to try these areas a bit:<p>My trial of a Python speech library on Windows:<p>Speech recognition with the Python "speech" module:<p><a href="http://jugad2.blogspot.in/2014/03/speech-recognition-with-python-speech.html" rel="nofollow">http://jugad2.blogspot.in/2014/03/speech-recognition-with-py...</a><p>and also the opposite:<p><a href="http://code.activestate.com/recipes/578839-python-text-to-speech-with-pyttsx/?in=user-4173351" rel="nofollow">http://code.activestate.com/recipes/578839-python-text-to-sp...</a>
FWIW, Google followed the same strategy with Cloud Vision (iirc)..they released it in closed beta for a couple of months [0], then made it generally available with a pricing structure [1].<p>I've never used Nuance but I've played around with IBM Watson [2], which gives you 1000 free minutes a month, and then 2 cents a minute afterwards. Watson allows you to upload audio in 100MB chunks (or is it 10 minute chunks?, I forgot), whereas Google currently allows 2 minutes per request (edit: according to their signup page [5])...but both Watson and Google allow streaming so that's probably a non-issue for most developers.<p>From my non-scientific observation...Watson does pretty well, such that I would consider using it for quick, first-pass transcription...it even gets a surprising number of proper nouns correctly including "ProPublica" and "Ken Auletta" -- though fudges things in other cases...its vocab does not include "Theranos", which is variously transcribed as "their in house" and "their nose" [3]<p>It transcribed the "Trump Steaks" commercial nearly perfect...even getting the homophones in "<i>when it comes to great steaks I just raise the stakes the sharper image is one of my favorite stores with fantastic products of all kinds that's why I'm thrilled they agree with me trump steaks are the world's greatest steaks and I mean that in every sense of the word and the sharper image is the only store where you can buy them</i>"...though later on, it messed up "steak/stake" [4]<p>It didn't do as great a job on this Trump "Live Free or Die" commercial, possibly because of the booming theme music...I actually did a spot check with Google's API on this and while Watson didn't get "New Hampshire" at the beginning, Google <i>did</i> [4]. Judging by how well YouTube manages to caption videos of all sorts, I would say that Google probably has a strong lead in overall accuracy when it comes to audio in the wild, just based on the data it processes.<p>edit: fixed the Trump steaks transcription...Watson transcribed the first sentence correctly, but not the other "steaks"<p>[0] <a href="http://www.businessinsider.com/google-offers-computer-vision-tech-2015-12" rel="nofollow">http://www.businessinsider.com/google-offers-computer-vision...</a><p>[1] <a href="http://9to5google.com/2016/02/18/cloud-vision-api-beta-pricing/" rel="nofollow">http://9to5google.com/2016/02/18/cloud-vision-api-beta-prici...</a><p>[2] <a href="https://github.com/dannguyen/watson-word-watcher" rel="nofollow">https://github.com/dannguyen/watson-word-watcher</a><p>[3] <a href="https://gist.github.com/dannguyen/71d49ff62e9f9eb51ac6" rel="nofollow">https://gist.github.com/dannguyen/71d49ff62e9f9eb51ac6</a><p>[4] <a href="https://www.youtube.com/watch?v=EYRzpWiluGw" rel="nofollow">https://www.youtube.com/watch?v=EYRzpWiluGw</a><p>[5] <a href="https://services.google.com/fb/forms/speech-api-alpha/" rel="nofollow">https://services.google.com/fb/forms/speech-api-alpha/</a>
"Google may choose to raise those prices over time, after it becomes the dominant player in the industry."<p>...Isn't that specifically what anticompetition laws were written to prevent?
I would say that Google's main goal here is in expanding their training data set, as opposed to creating a new revenue stream. If it hurts competitors (e.g. Nuance) that might only be a side-effect of that main objective, and likely they will not aim to hurt the competition intentionally.<p>As others here have pointed out, the value now for GOOG is in building the best training data-set in the business, as opposed to just racing to find the best algorithm.
Has anyone tried adding OpenEars to their app, to prevent having to send things over the internet from e.g. a basement? Is it any good at recognizing basic speech?
In the sign-up form they state that "Note that each audio request is limited to 2 minutes in length." Does anyone know what "audio request" is? Does it mean that it's limited to 2 minutes when doing real-time recognition, or just that longer periods will count as more "audio requests" and result in a higher bill?<p>Do they provide a way to send audio via WebRTC or WebSocket from a browser?
I thought I read open source, then I realized open access. I believe in the past there was a similar API, or maybe it was based on Google Translate. But I swear at one point people wrote hackathon projects using some voice APIs.
Nice! Curious how it compares to amazon's avs that went public this week.<p><a href="https://github.com/amzn/alexa-avs-raspberry-pi" rel="nofollow">https://github.com/amzn/alexa-avs-raspberry-pi</a>
I would be hesitant to build an entire application that relied on this API only to have it removed in a few months or years when Google realizes it sucks up time and resources and makes them no money.
cool, next up is a way to tweak the speech API to recognize patterns in stocks and capex .. wasn't that what Renaissance Technologies did ?<p>really GooG should democratize quant stuff next .. diy hedge fund algos.
I'm reading many libraries here, I wonder what's the best open and multi platform software for spech recognition to code with vim, Atom etc. I only saw a hybrid system working with dragon + Python on Windows. I would like to train/ customize my own system since I'm starting to have pain in tendons, and wrists. Do you think this Google Api can make it? Not being local looks like a limiting factor for speed, lag.
What is the difference from a speech recognition API and [NLP libraries](<a href="https://opennlp.apache.org/" rel="nofollow">https://opennlp.apache.org/</a>)? This information was not easily found with a few google searches, so I figured others might have the same question.
Don't get too excited: <a href="https://www.google.com/search?q=google+shuts+down+api" rel="nofollow">https://www.google.com/search?q=google+shuts+down+api</a>
I hope this opens up some new app possibilities for the Pebble Time. I believe right now they use Nuance and it's very limited to only responding to texts.
So this was very, very exciting until I realized you have to be using Google Cloud Platform to sign up for the preview. Unfortunately all of my stuff is in AWS and I <i>could</i> move it over but I'm not going (far too much hassle to preview an API I may not end up using, ultimately).<p>Regardless this is still very exiting. I haven't found anything that's as good as Google's voice recognition. I only hope this ends up being cheap and accessible outside of their platform.