This is great. I particularly like that they also automatically generated dirty versions for their training set, because that's exactly what I ended up doing for my dissertation project (a computer vision system [1] that automatically referees Scrabble boards). I also used dictionary analysis and the classifier's own confusion matrix to boost its accuracy.<p>If you're also interested in real time OCR like this, I did a write up [2] of the approach that worked well for my project. It only needed to recognize Scrabble fonts, but it could be extended to more fonts by using more training examples.<p>[1] <a href="http://brm.io/kwyjibo/" rel="nofollow">http://brm.io/kwyjibo/</a><p>[2] <a href="http://brm.io/real-time-ocr/" rel="nofollow">http://brm.io/real-time-ocr/</a>
I am 15 years into this computers thing and this blog post made me feel like "those guys are doing black magic".<p>Neural networks and deep learning are truly awesome technologies.
The most awesome and surprising thing about this is that the whole thing runs <i>locally</i> on your smartphone! You don't need network connection. All dictionaries, grammar processing, image processing, DNN - the whole stack runs on phone. I used this on my trip to Moscow and it was truely god send because it didn't need expensive international data plans (assuming you have connectivity!). English usage is fairly rare in Russia and it was just fun to learn Russian this way by pointing at interesting things.
I used this in Brazil this last March to read menus. It works extremely well. The mistranslations make it even more fun. Much faster than learning Portuguese!<p>I took a few screen shots. Aligning the phone, focus, light, shadows on the small menu font was difficult. You must keep steady. Sadly, I ended up hitting the volume control on this best example. Tasty cockroaches! Ha! <a href="http://imgur.com/j9iRaY0" rel="nofollow">http://imgur.com/j9iRaY0</a>
Word Lens is impressive. It came from a small startup. Google didn't develop it; it was a product before Google bought it. I saw an early version being shown around TechShop years ago, before Google Glass, even. It was quite fast even then, translating signs and keeping the translation positioned over the sign as the phone was moved in real time. But the initial version was English/Spanish only.
I see no mention of it, but I'd be surprised if they didn't use some form of knowledge distilling [1] (which Hinton came up with, so really no excuse), to condense a large neural network into a much smaller one.<p>[1] <a href="http://arxiv.org/abs/1503.02531" rel="nofollow">http://arxiv.org/abs/1503.02531</a>
WordLens/Google Translate is the most futuristic thing that my phone is able to do. It's specially useful in countries that don't use the latin alphabet.
"Squeezes" is very relative. These phones are equal to or larger than most desktops 10-15 years ago, back when I was doing AI research with evolutionary computing and genetic algorithms. We did some pretty mean stuff on those machines, and now we have them in our pockets.
They did this even more impressively when squeezing their speech recognition engine to mobile devices.<p><a href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41176.pdf" rel="nofollow">http://static.googleusercontent.com/media/research.google.co...</a>
A possibly relevant research paper that they didn't mention: "Distilling the Knowledge in a Neural Network" <a href="http://arxiv.org/abs/1503.02531" rel="nofollow">http://arxiv.org/abs/1503.02531</a>
What are the advantages of using a neural network over generating classification trees or using other machine learning methods? I'm not too familiar with how neural nets work, but it seems like they require more creator input than other methods, which could be good or bad I suppose.
The article mentions algorithmically generating the training set. See here for some earlier research in this area: <a href="http://bheisele.com/heisele_research.html#3D_models" rel="nofollow">http://bheisele.com/heisele_research.html#3D_models</a>
Here's a short video about Google Translate just released.<p><a href="https://www.youtube.com/watch?v=0zKU7jDA2nc&index=1&list=PLeqAcoTy5741GXa8rccolGQaj_nVGw76g" rel="nofollow">https://www.youtube.com/watch?v=0zKU7jDA2nc&index=1&list=PLe...</a>
This technology has been around since 2010 and was developed by Word Lens, which was acquired by google in 2014:<p><a href="https://en.wikipedia.org/wiki/Word_Lens" rel="nofollow">https://en.wikipedia.org/wiki/Word_Lens</a>
For those unfamiliar with google's deep learning, this talk covers their recent efforts pretty well <a href="https://youtu.be/kO-Iw9xlxy4" rel="nofollow">https://youtu.be/kO-Iw9xlxy4</a> (not technical)
Doesn't this article seem to say that the size of the training set is related to the size of the resulting network? It should be proportional to the number of nodes/layers that the network is configured for, not proportional to the number of training instances. Am I missing something?
I generated training sets for an OCR project in JavaScript [1] a while ago using a modified version of a captcha generator [2] (practically the same technique mentioned in this article).<p>[1] <a href="https://github.com/mateogianolio/mlp-character-recognition" rel="nofollow">https://github.com/mateogianolio/mlp-character-recognition</a><p>[2] <a href="https://github.com/mateogianolio/mlp-character-recognition/blob/master/captcha.js" rel="nofollow">https://github.com/mateogianolio/mlp-character-recognition/b...</a>
I wonder if they use some kind of (neural) language model for their translations. Using only a dictionary (as in the text) would be about 60 years behind the state of the art...
Why do they need a deep learning model for this? They are obviously targeting signs, product names, menus and similar. Model will obviously fail in translating large texts.<p>Was there any advantage of using a deep learning model instead of something more computationally simple?
I don't get it. They say they use a dictionary, and they say it works without an Internet connection. How can both things be true? I'm pretty sure there's not, say, a Quechua dictionary on my phone.
Given the reliability of closed captions on YouTube and the frequency of errors in plaintext Google translate, I wouldn't be surprised if this service fails often, and often when you need it most.
WordLens was an awesome app and it's good to see that Google is continuing the development.<p>The new fad for using the 'deep' learning buzzword annoys me though. It seems so meaningless. What makes one kind of neural net 'deep' and are all the other ones suddenly 'shallow' ?
Just waiting for the paper to come out that'll detail all the transformations that were done on the training data specifically for the phone and how did they arrive at deciding to use them.<p>> To achieve real-time, we also heavily optimized and hand-tuned the math operations. That meant using the mobile processor’s SIMD instructions and tuning things like matrix multiplies to fit processing into all levels of cache memory.<p>Let's see how this turns out to be. I'm still skeptical if other apps might crash because of this.