As someone that built an EMR that sold to Epic, I think I can say with some authority these studies don't suggest this is ready for the real world.<p>While tech workers are unregulated, clinicians are <i>highly</i> regulated. Ultimately the clinician takes on the responsibility and risk relying on these computer systems to treat a patient, tech workers and their employers aren't. Clinicians <i>do not take risks</i> with patients because they have to contend with malpractice lawsuits and licensing boards.<p>In my experience, anything that is <i>slightly</i> inaccurate permanently reduces a clinician's trust in the system. This matters when it comes time to renew your contracts in one, three, or five years.<p>You can train the clinicians on your software and modify your UI to make it clear that a heuristic should be only taken as a suggestion, but that will also result in a support request <i>every time</i>. Those support requests have be resolved pretty quickly because they're part of the SLA.<p>I just can't imagine any hospital renewing a contract when their support requests is some form of "LLMs hallucinate sometimes." I used to hire engineers from failed companies that built non-deterministic healthcare software.
> <i>We find strong evidence that accuracy on today's medical benchmarks is not the most significant factor when analyzing real-world patient data, an insight with implications for future medical LLMs.</i><p>I interpreted this as challenging whether answering PubMedQA questions as well as a physician is correlated to recommending successful care paths based on the results (and other outcomes) shown in the sample corpus of medical records.<p>The analogy is a joke I used to make about ML where it made for crappy self-driving cars but surprisingly good pedestrian and cyclist hunter-killer robots.<p>Really, LLMs aren't expert system reasoners (yet) and if the medical records all contain the same meta-errors that ultimately kill patients, there's a GIGO problem where the failure mode of AI medical opinions makes the same errors faster and at greater scale. LLMs may be really good at finding how internally consistent an ontology made of language is, where the quality of its results is the effect of that internal logical consistency.<p>There's probably a pareto distribution of cases where AI is amazing for basic stuff like, "see a doctor" and then conspicuously terrible in some cases where a human is obviously better.
>LLMD-8B achieves state of the art responses on PubMedQA over all models<p>Hang on -- while this is a cool result, beating a limited number of models that you chose to include in your comparison does not qualify LLMD-8B as SOTA. (For example, Claude 3 Sonnet scores 10 percentage points higher.)<p>>This result confirms the power of continued pretraining and suggests that records themselves have content useful for improving benchmark performance.<p>In support of this conclusion, it would be informative to include an ablation study, e.g. evaluating a continued pre-training data set of the same total size but omitting medical record content from the data mix.
Interesting they don't compare to open-bio. Page 7 charts are quite weak.<p><a href="https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B" rel="nofollow">https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B</a>