How Quid uses deep learning with small data

113 点作者 Nimsical超过 8 年前

19 条评论

rspeer超过 8 年前

The baseline I'd like to see this compared to is the not-very-deep-learning "bag of tricks" that's conveniently implemented in fastText [1].[1] <a href="https://github.com/facebookresearch/fastText" rel="nofollow">https://github.com/facebookresearch/fastText</a>

评论 #12991130 未加载

评论 #12993307 未加载

prajit超过 8 年前

If you're interested in Sequence to Sequence tasks (e.g. neural machine translation or abstractive summarization) with small data, check out our recent paper from Google Brain tackling this problem (disclaimer, I'm the first author): <a href="https://arxiv.org/abs/1611.02683" rel="nofollow">https://arxiv.org/abs/1611.02683</a>

评论 #12992142 未加载

jackschultz超过 8 年前

Curious how you guys got training data for this. Did someone have to go through and rate whether or not a sentence was quality or not? And how many training examples did you use? You say it was "difficult to develop a large set" but I'm curious how large that set actually was.Edit: Also, do you think more data or a "better" or "more sophisticated" model would make the results better? I would guess more data would trump better model, but not sure.

评论 #12991122 未加载

pilooch超过 8 年前

Comparison is wrong between tfidf on words and CNN char. You should use char ngrams along with LR and this will beat all your classifiers with high probability. This is because your CNN char does not have enough data to draw all the useful chat ngrams. Doing it as preprocessing and passing it to LR is in practice always better on small datasets. You can go one step forward and add layers and test an MLP on your char ngrams.

评论 #12995185 未加载

评论 #12994681 未加载

YeGoblynQueenne超过 8 年前

It's worth keeping in mind that learning from few examples is not such a big deal. What is really hard to do (and a long-standing problem in machine learning) is learning a model that generalises well to unseen data.So the question is: does the OP really show good generalisation?It's hard to see how one would even begin to test this, in the case of the OP. The OP describes an experiment where a few hundred instances were drawn from a set of 50K, and used both for training and testing (by holding out a few, rather than cross-validating, if I got that right).I guess one way to go about it is to use the trained model to label your unseen data (the rest of the 50k) and then go through that model-labelled data by hand, and try to figure out how well the model did.We're talking here about natural language, however, where the domain is so vast that even the full 50k instances are very few to learn well. That doesn't have to do anything with the model being trained, deep or shallow. It has everything to do with the fact that you can say the same thing in 100k different ways, and still not exhaust all the ways to say that one thing. So 50k examples are either not enough examples of different ways to say the same thing, or not enough examples of the different things you can say, or, most probably, both.It's also worth remembering that deep nets can overfit much worse than other methods, exactly because they are so good at memorising training data. It's very hard to figure out what a deep net is really learning, but it would not be at all surprising to find out that your "powerful" model is just a very expensive alternative to Ctrl + C.It's just memorised your examples, see?

tadkar超过 8 年前

There's something strange about the ROC curve here. It seems that the feature engineered and logistic regression methods can pick out some examples very easily (20% true positive rate at a very low false positive rate) but the CNN seems to not be able to make many predictions at a low false positive rate. It then catches up later. It's almost like it can't pick out the easy examples, but does just as good a job on the harder ones.

评论 #12995251 未加载

kmike84超过 8 年前

The link to LIME looks a bit out of place - LIME is an algorithm of explaining classifier decisions which is most useful for cases when you can't inspect weights and map them back to features. For TF*IDF + Logistic Regression there is no need to use LIME, one can just use weights and feature names directly. LIME is more helpful for all other models (there is a lot of caveats though), not for the basic tfidf + linear classifier model.

评论 #12995217 未加载

SubiculumCode超过 8 年前

I can't seem to find where the sample size is mentioned. It mentions that Quid has 50,000 company descriptions, but is n=50,000 tiny in thr ML/DeepLearning world?I do neuroscience research and where I am coming from I have maybe n=150to200 per class. And that is not generally regarded as a tiny sample.

评论 #12990988 未加载

评论 #12990996 未加载

评论 #12992174 未加载

gwenzek超过 8 年前

> A downfall of CNNs for text is that unlike for images, the input sequences are varying sizes (i.e., varying size sentences), which means most text inputs must be “padded” with some number of 0’s, so that all inputs are the same size.Actually Kim's model you're using doesn't require padding because it uses k-Max over time pooling.Also kuddos for NOT updating your word embeddings during training! A lot of people are doing it, but IMHO it's a mistake most of the time.

评论 #12995265 未加载

dmichulke超过 8 年前

about that detecting generic text that conveys little informationCan I have that for my email? (Seriously) And as browser plugin? Oh and on telephone, TV, radio and in real-life would be also nice.It's probably also a nice predictor of startup success, developer quality and sales guy effectiveness.I just wonder if I would ever read or hear a Politician again.Very inspiring...

评论 #12995290 未加载

zump超过 8 年前

What's the difference between softmax with categorical loss ([0, 1]) and sigmoid binary loss? ([0/1])

sriku超过 8 年前

Since word embeddings were the starting point, I'm wondering what would the impact be if they'd stretched the vector sequences to the same length using linear or whatever interpolation as opposed to zero padding the sentences.

评论 #12995296 未加载

xiamx超过 8 年前

Why do you not have a dev dataset? Gridsearch over your test dataset is bound to overfit

cocktailpeanuts超过 8 年前

Wait, isn't "deep learning with small data" just machine learning, after all the buzzwords cancel themselves out?I thought the whole point of "deep learning" is its approach to using data.

评论 #12991415 未加载

lukaslalinsky超过 8 年前

Is all machine learning called "deep learning" now? Where is the line between "normal" machine learning algorithms and deep learning?

评论 #12995228 未加载

GFK_of_xmaspast超过 8 年前

I used to know some Quid people back in the day, good to see them here (and that they're international now, congrats to them).

itschekkers超过 8 年前

really nice article -- easy to follow, sensible steps, clean code. I haven't done too much text ML and this was a nice piece to follow - thanks!

评论 #12995224 未加载

deepnotderp超过 8 年前

What about FastText?

thinkr42超过 8 年前

This post is a joke. Seriously, it amazes me that the entire industry seems fixated on a handful of techniques, just like they were with random forests 10 years ago, just like they were on SVMs ten years before that, just like they were base neural networks before that. There's a simpler way, nature almost requires it.

评论 #12993210 未加载