I'm trying to learn from the experience of others about weak labelling.<p>Weak labelling: writing heuristic rules to approximately label a dataset, and using those as a form of 'weak' supervision for training a machine learning model. Snorkel.org were pioneers of this approach.<p>Would love to hear any real world tales of trying it! What worked well and what didn't? How easy was it to get domain experts to write the rules? How did you mix ground truth data with probabilistic labels? that sort of thing.<p>context: we're building tools in this space.
I've used Snorkel quite a bit at work, usually combined with transformers models.<p>It has worked quite well for us. The snorkel public package is a bit out of date now, as I think they're building a SaaS solution and focusing more on that. But aside from that it's quite easy to use. Other downside is that lot's of cool ideas are present in the papers but not fully implemented (not complaining though!). Also thinking of a diverse set of heuristics can be hard.<p>We use snorkel a lot for bootstrapping text classifiers. Our classification models don't require much domain expertise, as it's pretty easy to tell if a text sample is classified correctly, so the main advantage is just avoiding labeling costs and quicker prototyping. We find that we can usually use embedding similarity as a good heuristic. I wrote up a little bit about this approach here if you're curious: <a href="https://cultivate.com/why-cultivate-uses-embeddings-for-rapid-prototyping/" rel="nofollow">https://cultivate.com/why-cultivate-uses-embeddings-for-rapi...</a><p>Happy to answer any additional questions you have too :)