At Doximity, we go to great lengths to ensure the quality of our products aligns with the standards physicians require. Across various industries, Large Language Models (LLMs) have become the backbone of numerous applications, driving advancements in everything from natural language processing to automated content creation. As we continue to develop products that make use of these LLMs, the need for rigorous and comprehensive evaluation of their outputs has never been more critical. Strap in as we explore the process for evaluating our Doximity GPT product, Doximity’s HIPAA-compliant medical writing assistant, focusing on the importance of using "ground truths" to establish baseline metrics and the relative performance of contender models.