A better way to build ML: Why you should be using Active Learning

136 点作者 razcle超过 4 年前

13 条评论

natch超过 4 年前

It's hard to find articles like this that give a glimpse into what is used by larger shops doing ML. I take this one with a grain of salt due to the source being a vendor, but still it is generous with the amount of detail and with its even mentioning some alternative solutions for cases that might fit those, so that is really appreciated.The pros working in big shops who write these tend to overlook the tiny use cases such as apps that recognize a cat coming through a cat door (as opposed to a raccoon) which can get by with minuscule training.There's a lot of discussion of "big data" but small data is amazingly powerful too. I wish there was more bridging of these two worlds — to have tools that deal with the needs of small data, without the assumption that training a model takes days or months, and on the other side, to have the big data world share more insights about how they manage their data for the big cases. There is a ton of info out there but what I find lacking is info about how labeling and tagging is managed on a large scale (I'm interested in both, big and small, as well as medium). Maybe I'm just missing something. This article gave some good clues — thanks!

评论 #26031354 未加载

评论 #26030391 未加载

评论 #26029373 未加载

realradicalwash超过 4 年前

Nice to see some active learning around here. To add a data point from a less successful story:In one of our research projects, we used AL to improve part-of-speech prediction, inspired by work by Rehbein and Ruppenhofer, e.g. <a href="https://www.aclweb.org/anthology/P17-1107/" rel="nofollow">https://www.aclweb.org/anthology/P17-1107/</a>Our data base was a corpus of Scientific English from 17th-now and for our data and situation, we found that choosing the right tool/model and having the right training data were the most important things. Once that was in place, active learning did not, unfortunately, add that much. For different tools/settings, we got about +/-0.2% in accuracy for checking 200k tokens and only correcting 400 of them.Maybe one problem was that AL was only triggered when a majority vote was inconclusive. Also, we used it on top of individualised, gs training data. I guess things can look different if you don't have a gs to start with. And if you have better computational resources: Our oracles spent quite some time waiting, which is why we even reorganised the original design to then process batches of corrections.As so often, those null results were hard to publish :|Either way, I thought I'd share our experiences. Your work sounds really cool, best of luck!

评论 #26030621 未加载

porphyra超过 4 年前

A more detailed and technical writeup on the benefits of active learning: You should try active learning - <a href="https://medium.com/aquarium-learning/you-should-try-active-learning-37a86aab1afb" rel="nofollow">https://medium.com/aquarium-learning/you-should-try-active-l...</a>Also, Aquarium Learning is just awesome. Super slick.

评论 #26032780 未加载

rocauc超过 4 年前

Nice read.Can you shed some light on what you think are the most valuable methods for identifying high entropy examples for the model to learn faster? I'm familiar with Pool-Based Sampling, Stream-Based Selective Sampling, Membership Query Synthesis[1], but less certain which techniques are most useful in NLP.[1] <a href="https://blog.roboflow.com/what-is-active-learning/" rel="nofollow">https://blog.roboflow.com/what-is-active-learning/</a>

评论 #26028913 未加载

anonymouse008超过 4 年前

Ha! This is amazing -- we did a similar process for an EEG research project, and it was stellar (working memory and learning curves)! Until now, I didn't have the right words to articulate what we did - so thank you for the incantation!

评论 #26028557 未加载

nailer超过 4 年前

Mike from Humanloop here - if you're interested in active learning we'll be around on this thread, also we're looking for fullstack SW engineers and ML engineers - <a href="https://news.ycombinator.com/item?id=25992607" rel="nofollow">https://news.ycombinator.com/item?id=25992607</a>

评论 #26027911 未加载

评论 #26028527 未加载

andy99超过 4 年前

I have a suggestion about the first plot you show in the writeup. From what I can see, it is based on a finite pool of data and so it undersells active learning: performance shoots up as AL finds the interesting points, but then the curve flattens and is less steep than the random curve as the "boring" points get added. It would be nice to see the same curve for a bigger training pool where AL was able to get to a target accuracy without running out of valuable training points. I suspect that would make the difference between the two curves much more stark. As it is, it just looks like AL does better for very low data but to get to high accuracy you need to use the whole dataset anyway so it's a wash between AL and random.

评论 #26029707 未加载

e2e4超过 4 年前

Startups especially benefit from Active Learning: <a href="https://www.slideshare.net/nrubens/1-of-40-recommender-systems-and-active-learning-for-startups" rel="nofollow">https://www.slideshare.net/nrubens/1-of-40-recommender-syste...</a>A slightly deeper intro: <a href="https://www.slideshare.net/nrubens/active-learning-in-recommender-systems" rel="nofollow">https://www.slideshare.net/nrubens/active-learning-in-recomm...</a>p.s. am the author of the above presentations; great to see Active Learning (AL) to finally get proper attention (I've been working in the AL area for 10+ years).

woeirua超过 4 年前

I'm not sure I really understand the advantage to AL in this context. Sure you get better performance earlier, but if you want the best performance you still appear to have to train with the same amount of data. Given the training -> example identification -> annotation -> training loop is going to be much slower than just continuing to annotate data and then running all the data at once (for a variety of reasons), I think if you were to do an honest total time and total monetary cost comparison you would probably come out with AL being more expensive overall... Am I missing something here?

评论 #26031868 未加载

dexter89_kp3超过 4 年前

what is your thought on synthetic data vs active learning?For some domains, with privacy concerns or rarity of objects, getting labelled data for deep learning is challenging.There is decent research on sim2real i.e transferring models trained on synthetic data to real world applications <a href="https://arxiv.org/pdf/1703.06907.pdf" rel="nofollow">https://arxiv.org/pdf/1703.06907.pdf</a>

评论 #26029685 未加载

nicoburns超过 4 年前

Active learning sounds like a step closer to how humans and other animals learn: iteratively with continuous feedback.

评论 #26028561 未加载

andrewmutz超过 4 年前

Spam filter is an interesting choice of motivating example, since usually it is your users labeling the data, rather than something that happens during the R&D process. You could try to use active learning but I'm not sure the users would like that product experience.

评论 #26028153 未加载

lwhsiao超过 4 年前

Hi Mike,Can you talk about the tradeoffs or relationship between active learning and weak supervision from your point of view?

评论 #26028132 未加载

评论 #26027915 未加载

评论 #26028342 未加载