Hey HN, I’m Jordan cofounder of Humanloop (YC S20) and I’m excited to show you Programmatic — an annotation tool for building large labeled datasets for NLP <i>without manual annotation</i>.<p>Programmatic is like a REPL for data annotation. You:<p><pre><code> 1. Write simple rules/functions that can approximately label the data
2. Get near-instant feedback across your entire corpus
3. Iterate and improve your rules
</code></pre>
Finally, it uses a Bayesian label model [1] to convert these noisy annotations into a single, large, clean dataset, which you can then use for training machine learning models. You can programmatically label millions of datapoints in the time taken to hand-label hundreds.<p>What we do differently from weak supervision packages like Snorkel/skweak[1] is to focus on UI to give near-instantaneous feedback. We love these packages but when we tried to iterate on labeling functions we had to write a ton of boilerplate code and wrestle with pandas to understand what was going on. Building a dataset programmatically requires you to grok the impact of labeling rules on a whole corpus of text. We’ve been told that the exploration tools and feedback makes the process feel game-like and even fun (!!).<p>We built it because we see that getting labeled data remains a blocker for businesses using NLP today. We have a platform for active learning (see our Launch HN [2]) but we wanted to give software engineers and data scientists a way to build the datasets needed themselves and to make best use of subject-matter-experts’ time.<p>The package is free and you can install it now as a pip package [2]. It supports NER / span extraction tasks at the moment and document classification will be added soon. To help improve it, we'd love to hear your feedback or any success/failures you’ve had with weak supervision in the past.<p>[1]: We use a HMM model for NER tasks, and Naive-Bayes for classification using the two approaches given in the papers below:
Pierre Lison, Jeremy Barnes, and Aliaksandr Hubin. "skweak: Weak Supervision Made Easy for NLP." <a href="https://arxiv.org/abs/2104.09683" rel="nofollow">https://arxiv.org/abs/2104.09683</a> (2021)
Alex Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, Chris Ré. "Data Programming: Creating Large Training Sets, Quickly" <a href="https://arxiv.org/abs/1605.07723" rel="nofollow">https://arxiv.org/abs/1605.07723</a> (NIPS 2016)<p>[2]: Our Launch HN for our main active learning platform, Humanloop – <a href="https://news.ycombinator.com/item?id=23987353" rel="nofollow">https://news.ycombinator.com/item?id=23987353</a><p>[3]: Can install it directly here <a href="https://docs.programmatic.humanloop.com/tutorials/quick-start" rel="nofollow">https://docs.programmatic.humanloop.com/tutorials/quick-star...</a>