Setting aside the sensationalist headline, the entire premise of the article is flawed. It's a case of not even being wrong. Of course you're going to get spurious results using poor data.<p>The author's attempt at using structured EMR data is the root cause. We have found that structured data, which the author attempted to use, is at best 35% accurate. Sure it's better than claims, but it does not reach the level of quality necessary to inform clinical decision-making. The reason for this is that almost everything clinically relevant is captured in freeform text fields--clinical notes. To build proper models from information in EMRs, you have to start with processing the narrative data, which is a hard problem.<p>Training models to interpret clinical notes requires clinical expertise. Clinicians record facts differently in different locations, and there are many different ways to say the same things, and sometimes they skip underlying facts because some other fact implies the rest. Different specialties record things differently too. You really cannot just throw some data into a notebook and hope it works. Even with clinician input, we still find that high quality results require ensemble models with multiple techniques; plain NLP doesn't work either.<p>Take for example, non-alcoholic steatohepatitis (NASH), the leading cause of liver failure requiring liver transplant. NASH is a complication of non-alcoholic fatty liver disease, in which your liver has unusually large deposits of fat. NAFLD is not coded in structured data. To identify it from unstructured data, you have to extract concepts related to liver cancer, pre-diabetes, alcohol use, liver fibrosis, cirrhosis, jaundice, fatigue, and loss of appetite. To make a long story short, you cannot do these things using structured data or naive NLP approaches. F1 is zero.<p>So maybe his point, "Data encodes clinical expertise" is worthwhile, but the rest of the article...not so much.<p>Source: My company, Verantos <a href="https://verantos.com" rel="nofollow">https://verantos.com</a> , specializes in the generation of high-validity evidence from data we abstract from EHRs using machine techniques.