TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Programmatic – a REPL for creating labeled data

26 点作者 jordn大约 3 年前
Hey HN, I’m Jordan cofounder of Humanloop (YC S20) and I’m excited to show you Programmatic — an annotation tool for building large labeled datasets for NLP <i>without manual annotation</i>.<p>Programmatic is like a REPL for data annotation. You:<p><pre><code> 1. Write simple rules&#x2F;functions that can approximately label the data 2. Get near-instant feedback across your entire corpus 3. Iterate and improve your rules </code></pre> Finally, it uses a Bayesian label model [1] to convert these noisy annotations into a single, large, clean dataset, which you can then use for training machine learning models. You can programmatically label millions of datapoints in the time taken to hand-label hundreds.<p>What we do differently from weak supervision packages like Snorkel&#x2F;skweak[1] is to focus on UI to give near-instantaneous feedback. We love these packages but when we tried to iterate on labeling functions we had to write a ton of boilerplate code and wrestle with pandas to understand what was going on. Building a dataset programmatically requires you to grok the impact of labeling rules on a whole corpus of text. We’ve been told that the exploration tools and feedback makes the process feel game-like and even fun (!!).<p>We built it because we see that getting labeled data remains a blocker for businesses using NLP today. We have a platform for active learning (see our Launch HN [2]) but we wanted to give software engineers and data scientists a way to build the datasets needed themselves and to make best use of subject-matter-experts’ time.<p>The package is free and you can install it now as a pip package [2]. It supports NER &#x2F; span extraction tasks at the moment and document classification will be added soon. To help improve it, we&#x27;d love to hear your feedback or any success&#x2F;failures you’ve had with weak supervision in the past.<p>[1]: We use a HMM model for NER tasks, and Naive-Bayes for classification using the two approaches given in the papers below: Pierre Lison, Jeremy Barnes, and Aliaksandr Hubin. &quot;skweak: Weak Supervision Made Easy for NLP.&quot; <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2104.09683" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2104.09683</a> (2021) Alex Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, Chris Ré. &quot;Data Programming: Creating Large Training Sets, Quickly&quot; <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1605.07723" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1605.07723</a> (NIPS 2016)<p>[2]: Our Launch HN for our main active learning platform, Humanloop – <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=23987353" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=23987353</a><p>[3]: Can install it directly here <a href="https:&#x2F;&#x2F;docs.programmatic.humanloop.com&#x2F;tutorials&#x2F;quick-start" rel="nofollow">https:&#x2F;&#x2F;docs.programmatic.humanloop.com&#x2F;tutorials&#x2F;quick-star...</a>

3 条评论

razcle大约 3 年前
Hi Raza here, one of the other co-founders.<p>I know that HN likes to nerd out over technical details so thought I’d share a bit more on how we aggregate the noisy labels to clean them up.<p>At the moment we use the great Skweak [1] open source library to do this. Skweak uses an HMM to infer the most likely unobserved label given the evidence of the votes from each of the labelling functions.<p>This whole strategy of first training a label model and then training a neural net was pioneered by Snorkel. We’ve used this approach for now but we actually think there are big opportunities for improvement.<p>We’re working on an end-to-end approach that de-noises the labelling function and trains the model at the same time. So far we’ve seen improvements on the standard benchmarks [2] and are planning to submit to Neurips.<p>R<p>[1]: Skweak package: <a href="https:&#x2F;&#x2F;github.com&#x2F;NorskRegnesentral&#x2F;skweak" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;NorskRegnesentral&#x2F;skweak</a> [2] Wrench benchmark: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2109.11377" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2109.11377</a>
hmaguire大约 3 年前
I&#x27;ve been using this for the past month or so in Beta and it&#x27;s fantastic. I&#x27;m a DS at an NLP startup and it&#x27;s totally changed the way we develop new tagging and classification models (and how we explore unlabelled data more generally).
评论 #30956592 未加载
jordn大约 3 年前
Just like to clarify that <i>this goes beyond a rule-based system.</i> Rules can get you pretty far[1] but this improves on that by intelligently discounting the bad rules using weak supervision techniques. The end result here is a pile of labeled data which you train your model on. The model trained on this data can generalise well beyond those labels.<p>[1]: Aside: working at Alexa, I was surprised that something like 80% of utterances were covered by rules rather than an ML model. People have learned to use Alexa for a small handful of things and you can cover those fairly well using a way to generate rules from phrase patterns and catalogs of nouns.