I work in this domain, dealing exclusively with recognition for assistants, which is different from dictation. We measure three things, top down:<p>- whole phrase intent recognition rates. Run the transcribed phrase through a classifier to identify what the phrase is asking for, and compare that to what was expected, calculating an F1 score. Keep track of phrases that score poorly: they need to be improved.<p>- "domain term" error rate. Identify a list of key words that important to the domain and that must be recognized well, such as location names, products to buy, drug names, terms of art. For every transcribed utterance, measure the F1 score for those terms, and track alternatives created in confusion matrix. This results in distilled list of words the system gets wrong and what is heard instead.<p>- overall word error rate, to provide a general view of model performance.