科技回声

1 comment

Sharing some context here: in grad school, I spent months writing custom data analysis code and training ML models to find errors in large-scale datasets like ImageNet, work that eventually resulted in this paper (<a href="https://arxiv.org/abs/2103.14749" rel="nofollow noreferrer">https://arxiv.org/abs/2103.14749</a>) and demo (<a href="https://labelerrors.com/" rel="nofollow noreferrer">https://labelerrors.com/</a>).<p>Since then, I’ve been interested in building tools to automate this sort of analysis. We’ve finally gotten to the point where a web app can do automatically in a couple of hours what I spent months doing in Jupyter notebooks back in 2019—2020. It was really neat to see the software we built automatically produce the same figures and tables that are in our papers.<p>The blog post shared here is results-focused, talking about some of the data and dataset-level issues that a tool using data-centric AI algorithms can automatically find in ImageNet, which we used as a case study. Happy to answer any questions about the post or data-centric AI in general here!<p>P.S. all of our core algorithms are open-source, in case any of you are interested in checking out the code: <a href="https://github.com/cleanlab/cleanlab">https://github.com/cleanlab/cleanlab</a>

Automated Data Quality at Scale

1 comment

Automated Data Quality at Scale

1 comment