TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Automated Data Quality at Scale

1 pointsby anishathalyealmost 2 years ago

1 comment

anishathalyealmost 2 years ago
Sharing some context here: in grad school, I spent months writing custom data analysis code and training ML models to find errors in large-scale datasets like ImageNet, work that eventually resulted in this paper (<a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2103.14749" rel="nofollow noreferrer">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2103.14749</a>) and demo (<a href="https:&#x2F;&#x2F;labelerrors.com&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;labelerrors.com&#x2F;</a>).<p>Since then, I’ve been interested in building tools to automate this sort of analysis. We’ve finally gotten to the point where a web app can do automatically in a couple of hours what I spent months doing in Jupyter notebooks back in 2019—2020. It was really neat to see the software we built automatically produce the same figures and tables that are in our papers.<p>The blog post shared here is results-focused, talking about some of the data and dataset-level issues that a tool using data-centric AI algorithms can automatically find in ImageNet, which we used as a case study. Happy to answer any questions about the post or data-centric AI in general here!<p>P.S. all of our core algorithms are open-source, in case any of you are interested in checking out the code: <a href="https:&#x2F;&#x2F;github.com&#x2F;cleanlab&#x2F;cleanlab">https:&#x2F;&#x2F;github.com&#x2F;cleanlab&#x2F;cleanlab</a>