Hi HN! I'm excited to share Autolabel, an open-source Python library to label and enrich text datasets with any Large Language Model (LLM) of your choice.<p>We built Autolabel because access to clean, labeled data is a huge bottleneck for most ML/data science teams. The most capable LLMs are able to label data with high accuracy, and at a fraction of the cost and time compared to manual labeling. With Autolabel, you can leverage LLMs to label any text dataset with <5 lines of code.<p>We’re eager for your feedback!
This is very interesting to me. We spent a significant time “labelling” data when I was in the public sector digitalisation. Basically what was done, was to do the LLM part manually and then have engines like this on top of it. Having used ChatGPT to write JSDoc documentation for a while now, and been very impressed with how good it is when it understands your code through good use of naming conventions, I’m fairly certain it’ll be the future of “librarian” styled labelling of case files.<p>But the key issue is going to be privacy. I’m not big on LLM, so I’m sorry if this is obvious, but can I use something like this without sending my data outside my own organisation?
Thank you for open sourcing this! This seems very useful, especially because of the confidence estimation, which lets you use LLMs for the points they can do well and fall back to manual labelling for the rest.
>Refuel provides LLMs that can compute confidence scores for every label, if the LLM you've chosen doesn't provide token-level log probabilities.<p>How does this work exactly?
You just posted this here <a href="https://news.ycombinator.com/item?id=36384015">https://news.ycombinator.com/item?id=36384015</a><p>It's one thing to show HN / share, its another thing to spam it with your ads.