Senior data scientists know great ROI in real-world ML projects comes from finding/fixing issues in the dataset rather than tinkering too much with models. But this is done manually today via ad hoc scripts (Jupyter notebooks). In data-centric AI, we also use software that can automatically detect data issues (mislabeled examples, outliers, etc) to make all this more systematic (better coverage, reproducibility, efficiency, etc). While some companies are starting to offer commercial platforms for data-centric AI, cleanlab is: fully open-source, a complete software framework that can be used for many data-types and ML tasks, and I've published all of the novel algorithms cleanlab uses to help you improve messy real-world ML datasets.<p>In one-line of python, cleanlab can automatically:<p>(1) find mislabeled data + train robust models
(2) detect outliers
(3) estimate consensus + annotator-quality for datasets labeled by multiple annotators
(4) suggest which data is best to label or re-label next (active learning)<p>It has quick 5min tutorials for many types of data (image, text, tabular, audio, etc) and ML tasks (classification, entity recognition, image/document tagging, etc).<p>Engineers used cleanlab at Google to clean and train robust models on speech data, at Amazon to estimate how often the Alexa device doesn’t wake, at Wells Fargo to train reliable financial prediction models, and at Microsoft, Tesla, Facebook, etc. Hopefully you'll find cleanlab useful in your ML applications, it's super easy to try out!