TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Automated Data Quality at Scale

1 点作者 anishathalye将近 2 年前

1 comment

anishathalye将近 2 年前
Sharing some context here: in grad school, I spent months writing custom data analysis code and training ML models to find errors in large-scale datasets like ImageNet, work that eventually resulted in this paper (<a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2103.14749" rel="nofollow noreferrer">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2103.14749</a>) and demo (<a href="https:&#x2F;&#x2F;labelerrors.com&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;labelerrors.com&#x2F;</a>).<p>Since then, I’ve been interested in building tools to automate this sort of analysis. We’ve finally gotten to the point where a web app can do automatically in a couple of hours what I spent months doing in Jupyter notebooks back in 2019—2020. It was really neat to see the software we built automatically produce the same figures and tables that are in our papers.<p>The blog post shared here is results-focused, talking about some of the data and dataset-level issues that a tool using data-centric AI algorithms can automatically find in ImageNet, which we used as a case study. Happy to answer any questions about the post or data-centric AI in general here!<p>P.S. all of our core algorithms are open-source, in case any of you are interested in checking out the code: <a href="https:&#x2F;&#x2F;github.com&#x2F;cleanlab&#x2F;cleanlab">https:&#x2F;&#x2F;github.com&#x2F;cleanlab&#x2F;cleanlab</a>