The most interesting bit for me is at the end of another blog entry:<p><a href="http://blogs.technet.com/b/inside_microsoft_research/archive/2012/06/14/deep-neural-network-speech-recognition-debuts.aspx" rel="nofollow">http://blogs.technet.com/b/inside_microsoft_research/archive...</a><p>"An intern at Microsoft Research Redmond, George Dahl, now at the University of Toronto,<p><a href="http://www.cs.toronto.edu/~gdahl/" rel="nofollow">http://www.cs.toronto.edu/~gdahl/</a><p>contributed insights into the working of DNNs and experience in training them. His work helped Yu and teammates produce a paper called Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition.<p><a href="http://research.microsoft.com/pubs/144412/DBN4LVCSR-TransASLP.pdf" rel="nofollow">http://research.microsoft.com/pubs/144412/DBN4LVCSR-TransASL...</a><p>In October 2010, Yu presented the paper during a visit to Microsoft Research Asia. Seide was intrigued by the research results, and the two joined forces in a collaboration that has scaled up the new, DNN-based algorithms to thousands of hours of training data."
The demo site (<a href="http://www.msravs.com/audiosearch_demo/" rel="nofollow">http://www.msravs.com/audiosearch_demo/</a>) blocks browsers other than IE and Firefox based on the user agent string. Use WebKit's developer tools to change your user agent and you'll be able to get in.
Imagine the power of this for students. This would have made school so much easier. Simply record every lecture and then use this to search for keywords.<p>Awesome.
Can someone please explain senones to me? Can't find much on Google.<p>The article says that they are a fragment of a phoneme, but how small a fragment are we talking? 2-3 per phoneme, or many more?<p>Also - I'd be curious how much the phoneme in a word can vary based on accent.
For those keeping score, google's image feature extractor shares the same core principles as microsoft's speech recognizer.<p>EDIT: by keeping score I mean keeping track of which techniques are being used where.
On a immediately useful practical note, OneNote also contains this functionality (obviously not as powerful). I've used it to record a meeting's audio sync'd to my notes, and then be able to search the audio to jump exactly to where someone mentioned something and review context. Saved my ass on at least one occasion.
Research paper on the system: <a href="http://www.se.cuhk.edu.hk/hccl/publications/pub/HLT2006.pdf" rel="nofollow">http://www.se.cuhk.edu.hk/hccl/publications/pub/HLT2006.pdf</a>
This seems very related to this <a href="http://www.youtube.com/watch?v=ZmNOAtZIgIk" rel="nofollow">http://www.youtube.com/watch?v=ZmNOAtZIgIk</a> speak by Andrew Ng. It is a 40min speak, but he explains very simply how all this works for images and some examples about the audio case.
It is incredible how using this deep learning techniques we can teach this "neural networks" to recognize such complicated patterns. It is like reverse engineering the brain's algorithms.<p>BTW I took his Coursera's course about Machine Learning and it was great! I also recommend it A LOT to gather basic ML knowledge.
How does this compare to Microsoft's Old HTK (HMM Toolkit)? The language used on the website seems to point to a lot of the same things. Is this breaking it down to actual IPA phonemes?<p>I'm mostly curious because I used the HTK for my thesis and would like to know how they compare (besides, one being just 'newer').
Vlingo, Siri, and others have been doing speaker independent auto-adapting speech recognition for years and talking about systems requiring 'training' and improvements there sound like this article is 5 years old. Great to see innovation in this space but this article is very light on detail.
related link: <a href="http://research.microsoft.com/en-us/news/features/speechrecognition-082911.aspx" rel="nofollow">http://research.microsoft.com/en-us/news/features/speechreco...</a>