Neural networks have their place, but are probably the most complicated and opaque machine learning tool. They are also hard to set up: so many parameters! Given that, I found it really strange that they went straight for a neural network (and then implemented one themselves!). Surely the place to start would be Naive Bayes, followed up with regularized logistic regression (through something like glmnet). Heck, even random forests would do quite well on this task I imagine, although thats getting closer to on the complexity and opaqueness spectrum towards NN.<p>There is also no evidence of doing cross-validation, and in another comment they say they used entire data set to do variable selection - a pretty bad mistake. They justify by saying they aren't in an academic environment, but thats kind of a bad excuse, as given the way they've done it I'm very unsure whether they actually are getting the accuracy they think they are.<p>I also worry that they sunk two man-months into this when they could probably have achieved similar if not better results with off-the-shelf and battled-tested tools. Sets off a lot of warning bells.
This is a great example of how privacy is not optional, even in "opt-in" systems such as Instagram and FB. That Instagram does not <i>require</i> you to have a Facebook profile, and Facebook does not <i>require</i> you to list gender means very little in terms of your own privacy.<p>Merely choosing to withhold information about yourself does not insulate you from a breach of privacy. That others do disclose such information allows 3rd parties to make really good guesses and inferences about you.<p>There's a strange morality here: at what point is it unethical to voluntarily disclose data about oneself, if it could be used in a way to harm someone else's privacy? Short of drawing a moral boundary (it could very well be impossible), we might do well to at least acknowledge the cost to these methods, alongside their benefits.
It's unusual to see a coherent, from-first-principles explanation of a neural network. Especially one that's commercially valuable (i presume) to Totems.<p>Mildly alarmed to learn I'm only .039 probability male, though - better bloke it up on Instagram.
Thanks for sharing your experience! Couple of questions<p>Why implement the training in NodeJS and not use an existing library in R or Python (scikit-learn) and just implement the scoring (feedforward network) in Node?<p>Did you just use a single test/train split? What is the variation in Res if you run cross validation?<p>Your article suggests that you used MI to select the 10k best features. Did you perform this MI feature selection before your test/train split? If so, you would already be "using" your class labels, and the results will be biased. It is likely your true generalisation error will be lower.
Your implementation of momentum seems off, you just add a multiple of last error, instead of adding exponentially declining contributions from the past. I think you want<p><pre><code> double dW = alpha_ * val_[l][j] * D_[l+1][i] + beta_ * dW_[l+1][i][j];
W_[l+1][i][j] += dW;
</code></pre>
If you want to get an output class probability, softmax is the standard way. Minimize KL-divergence instead of squared error.<p>You don't seem to be doing any regularization. It could maybe give you better generalization.<p>I think you could get a speedup by doing your linalg with blas, I guess this would complicate the code though, making it a tradeof.<p>Training on multiple threads and averaging is a nice touch. It would be interesting to hear if (how much) it improved your results.
Giving it a go with most of my friends and I'd say the success rate was definitely below .5, and it was pretty sure about it.<p>What seems odd is that the "test tool" allows you to <i>tweet</i> whether it's wrong or right. Why not just have it make a call to your API or something to tell you directly, so you can look at the profiles and figure out what's gone wrong?
Having used /harthur/brain before and being deeply interested in Neural Networks, I have to say that this is one of the most interesting articles about the topic I've ever seen.<p>Thank you for sharing the C version, I'll use it for sure.
This was submitted 4 days ago [1], and then was deleted. Anyone know what was up with that?<p>[1] <a href="https://news.ycombinator.com/item?id=8368186" rel="nofollow">https://news.ycombinator.com/item?id=8368186</a>
> <i>Our platform retrieves or refreshes around 400 user profiles per second (this is managed using 4 high-bandwidth servers co-located with instagram’s API servers on AWS). </i><p>Interesting, since Instagram's API only allows 5,000 requests per hour, (<a href="http://instagram.com/developer/limits/" rel="nofollow">http://instagram.com/developer/limits/</a>) and does not support bulk requests of user data. How does this application bypass this limit?
It's true that it seems to be a lot of work in implementation. NN have a complexity/performance ratio much higher than other algorithms. But hey ! la fin justifie les moyens, I'm quite impressed with the result and had a lot of fun with the demo and the article. Keep it up guys !
It doesn't predict my account correctly:<p><pre><code> PROBABILITY FEMALE: 0.997
PROBABILITY MALE: 0.569
</code></pre>
I wonder if the fact that I mostly just post pictures with no text accompanying them skews things.
My account (@matiassingers) got some very interesting numbers, and most of my photos definitely do have a caption and hashtags.<p><pre><code> PROBABILITY FEMALE: 0.003
PROBABILITY MALE: 0.001</code></pre>
1.000 probability of being a man. Thank you for affirming my masculinity.<p>However, my business has a 0.885 probability of being a woman, which is odd for a men's brand.
Interesting blog, interesting ideas, but completely bogus results. It's very inaccurate. Just using simple NB you'll get much better than this.