Using Machine Learning and Node.js to detect the gender of Instagram Users

142 pointsby spoluover 10 years ago

17 comments

idunningover 10 years ago

Neural networks have their place, but are probably the most complicated and opaque machine learning tool. They are also hard to set up: so many parameters! Given that, I found it really strange that they went straight for a neural network (and then implemented one themselves!). Surely the place to start would be Naive Bayes, followed up with regularized logistic regression (through something like glmnet). Heck, even random forests would do quite well on this task I imagine, although thats getting closer to on the complexity and opaqueness spectrum towards NN.There is also no evidence of doing cross-validation, and in another comment they say they used entire data set to do variable selection - a pretty bad mistake. They justify by saying they aren't in an academic environment, but thats kind of a bad excuse, as given the way they've done it I'm very unsure whether they actually are getting the accuracy they think they are.I also worry that they sunk two man-months into this when they could probably have achieved similar if not better results with off-the-shelf and battled-tested tools. Sets off a lot of warning bells.

评论 #8387591 未加载

评论 #8386951 未加载

adelevieover 10 years ago

This is a great example of how privacy is not optional, even in "opt-in" systems such as Instagram and FB. That Instagram does not require you to have a Facebook profile, and Facebook does not require you to list gender means very little in terms of your own privacy.Merely choosing to withhold information about yourself does not insulate you from a breach of privacy. That others do disclose such information allows 3rd parties to make really good guesses and inferences about you.There's a strange morality here: at what point is it unethical to voluntarily disclose data about oneself, if it could be used in a way to harm someone else's privacy? Short of drawing a moral boundary (it could very well be impossible), we might do well to at least acknowledge the cost to these methods, alongside their benefits.

评论 #8385866 未加载

评论 #8386340 未加载

评论 #8386781 未加载

gstarover 10 years ago

It's unusual to see a coherent, from-first-principles explanation of a neural network. Especially one that's commercially valuable (i presume) to Totems.Mildly alarmed to learn I'm only .039 probability male, though - better bloke it up on Instagram.

评论 #8386221 未加载

评论 #8386034 未加载

dn5over 10 years ago

Thanks for sharing your experience! Couple of questionsWhy implement the training in NodeJS and not use an existing library in R or Python (scikit-learn) and just implement the scoring (feedforward network) in Node?Did you just use a single test/train split? What is the variation in Res if you run cross validation?Your article suggests that you used MI to select the 10k best features. Did you perform this MI feature selection before your test/train split? If so, you would already be "using" your class labels, and the results will be biased. It is likely your true generalisation error will be lower.

评论 #8385839 未加载

im3w1lover 10 years ago

Your implementation of momentum seems off, you just add a multiple of last error, instead of adding exponentially declining contributions from the past. I think you want<pre><code> double dW = alpha_ * val_[l][j] * D_[l+1][i] + beta_ * dW_[l+1][i][j]; W_[l+1][i][j] += dW; </code></pre> If you want to get an output class probability, softmax is the standard way. Minimize KL-divergence instead of squared error.You don't seem to be doing any regularization. It could maybe give you better generalization.I think you could get a speedup by doing your linalg with blas, I guess this would complicate the code though, making it a tradeof.Training on multiple threads and averaging is a nice touch. It would be interesting to hear if (how much) it improved your results.

评论 #8386451 未加载

antiheroover 10 years ago

Giving it a go with most of my friends and I'd say the success rate was definitely below .5, and it was pretty sure about it.What seems odd is that the "test tool" allows you to tweet whether it's wrong or right. Why not just have it make a call to your API or something to tell you directly, so you can look at the profiles and figure out what's gone wrong?

评论 #8387483 未加载

franciscopover 10 years ago

Having used /harthur/brain before and being deeply interested in Neural Networks, I have to say that this is one of the most interesting articles about the topic I've ever seen.Thank you for sharing the C version, I'll use it for sure.

tzsover 10 years ago

This was submitted 4 days ago [1], and then was deleted. Anyone know what was up with that?[1] <a href="https://news.ycombinator.com/item?id=8368186" rel="nofollow">https://news.ycombinator.com/item?id=8368186</a>

评论 #8385567 未加载

minimaxirover 10 years ago

> Our platform retrieves or refreshes around 400 user profiles per second (this is managed using 4 high-bandwidth servers co-located with instagram’s API servers on AWS). Interesting, since Instagram's API only allows 5,000 requests per hour, (<a href="http://instagram.com/developer/limits/" rel="nofollow">http://instagram.com/developer/limits/</a>) and does not support bulk requests of user data. How does this application bypass this limit?

评论 #8385422 未加载

评论 #8385429 未加载

_upover 10 years ago

Wouldn't Bayesian filter be better suited? There must be a reason Spam Filter use them instead of Neural Networks.

评论 #8385583 未加载

syldorover 10 years ago

It's true that it seems to be a lot of work in implementation. NN have a complexity/performance ratio much higher than other algorithms. But hey ! la fin justifie les moyens, I'm quite impressed with the result and had a lot of fun with the demo and the article. Keep it up guys !

评论 #8386847 未加载

m0nasticover 10 years ago

It doesn't predict my account correctly:<pre><code> PROBABILITY FEMALE: 0.997 PROBABILITY MALE: 0.569 </code></pre> I wonder if the fact that I mostly just post pictures with no text accompanying them skews things.

mts_over 10 years ago

My account (@matiassingers) got some very interesting numbers, and most of my photos definitely do have a caption and hashtags.<pre><code> PROBABILITY FEMALE: 0.003 PROBABILITY MALE: 0.001</code></pre>

turbostylerover 10 years ago

1.000 probability of being a man. Thank you for affirming my masculinity.However, my business has a 0.885 probability of being a woman, which is odd for a men's brand.

评论 #8385675 未加载

评论 #8385522 未加载

hnriotover 10 years ago

Interesting blog, interesting ideas, but completely bogus results. It's very inaccurate. Just using simple NB you'll get much better than this.

lpgauthover 10 years ago

PROBABILITY FEMALE: 0.003 PROBABILITY MALE: 0.999Errr, so it's out of 1.002?

评论 #8386358 未加载

plgover 10 years ago

@teganandsaraPROBABILITY FEMALE: 0.003PROBABILITY MALE: 0.996I would say this doesn't work very well.

评论 #8386763 未加载

评论 #8386478 未加载