There are a lot of great articles about measuring the performance of data annotator’s agreement on labels, like this one https://towardsdatascience.com/the-definite-guide-for-creating-an-academic-level-dataset-with-industry-requirements-and-6db446a26cb2.<p>I see mentions in a lot of places of Cohen’s Kappa/Krippendorf’s alpha, Fleischer’s Kappa, Comparing to predefined ground truth, etc.<p>If you’re managing an annotation process in your organization, how do you evaluate your annotators, and what challenges have you faced in the process?<p>As a side note, is anyone using programmatic labeling in a real dataset? Thoughts?