I often see subtle misuses of interrater reliability metrics.<p>For example, imagine you're running a Search Relevance task, where search raters label query/result pairs on a 5-point scale: Very Relevant (+2), Slightly Relevant (+1), Okay (0), Slightly Irrelevant (-1), Very Irrelevant (-2).<p>Marking "Very Relevant" vs. "Slightly Relevant" isn't a big difference, but "Very Relevant" vs. "Very Irrelevant" is. However, most IRR calculations don't take this kind of ordering into account, so it gets ignored!<p>Cohen's kappa is a rather simplistic and flawed metric, but a good starting point to understanding interrater reliability metrics.