HealthBench – An evaluation for AI systems and human health

174 pointsby mfiguiere9 days ago

19 comments

I have no doubt that a lot of garden-variety diagnoses and treatments can be done by an AI system that is fine-tuned and vetted to accomplish the task. I recently had to pay $93 to have a virtual session with a physician to get prescription for a cough syrup, which I already knew what to take before talking to her because I did some research/reading. Some may argue, "Doctors studied years in med school and you shouldn't trust Google more than them", but knowing human's fallibility and knowing that a lot of doctors do look things up on places like <a href="https://www.wolterskluwer.com/en/solutions/uptodate" rel="nofollow">https://www.wolterskluwer.com/en/solutions/uptodate</a> to refresh/reaffirm their knowledge, I'd argue that if we are willing to take the risk, why shouldn't we be allowed to take that risk on our own? Why do I have to pay $93 (on top of the cough syrup that costed ~$44) just so that the doctor can see me on Zoom for less than 5 mins and submit an order for the med?With the healthcare prices increasing at the breakneck speed, I am sure AI will take more and more role in diagnosing and treating people's common illnesses, and hopefully (doubt it), the some of that savings will be transferred to the patients.P.S. In contrast to the US system, in my home city (Rangoon, Burma/Myanmar), I have multiple clinics near my home and a couple of pharmacy within two bus stops distance. I can either go buy most of the medications I need from the pharmacy (without prescription) and take them on my own (why am I not allowed to take that risk?) OR I can go see a doctor at one of these clinics to confirm my diagnosis, pay him/her $10-$20 for the visit, and then head down to the pharmacy to buy the medication. Of course, some of the medications that include opioids will only be sold to me with the doctor's prescription, but a good number of other meds are available as long as I can afford them.

评论 #43968235 未加载

评论 #43967778 未加载

评论 #43967164 未加载

评论 #43967742 未加载

评论 #43969722 未加载

评论 #43967507 未加载

评论 #43967673 未加载

dcreater9 days ago

Isn't there an obvious conflict of interest when the model maker is also the creator of a benchmark? I think at the very least it should be from a separate business entity under the non profit or from the non profit holding entity itself

评论 #43966968 未加载

imiric9 days ago

Good lord. The idea that a system that produces pseudo-random output without any semantic understanding can be relied on to give accurate health-related information is deeply flawed and troubling. It's one thing to use these things for finding patterns in data, for entertainment purposes, and for producing nonsensical code a human has to fix, but entirely different to rely on them for health diagnosis or advice. I shudder at the thought that a medical practitioner I go to will parrot whatever an LLM told them.This insanity needs to be regulated yesterday.

评论 #43968186 未加载

评论 #43968467 未加载

评论 #43967628 未加载

iNic9 days ago

I like that they include the "worst case score at k samples". This is a much more realistic view of what will happen, because someone will get that 1/100 response.

Zaheer9 days ago

Impressive how well Grok performs in these tests. Grok feels 'underrated' in terms of how much other models (gemini, llama, etc) are in the news.

评论 #43966228 未加载

评论 #43971772 未加载

mrcwinn9 days ago

Happy to see this. I've struggled with an injury for the past five years. I've been to multiple sports-focused physicians, had various scans. Responses from doctors have ranged from "everything seems fine, can't really figure this out" to [completely wrong hypothesis]. Tried acupuncture. Tried a chiropractor. I remember one doctor, though, had an interesting thought that seemed to make sense - but I've been so discouraged from so many false starts or misplaced hope, I didn't bother following up.Finally I typed in my entire history into o3-deep-research and let it rip for a while. It came back with a theory for the injury that matched that one doctor, diagrams of muscle groups and even illustrations of proposed exercises. I'm not out of the woods yet, but I am cautiously optimistic for the first time in a long time.

评论 #43966862 未加载

评论 #43967842 未加载

评论 #43968150 未加载

评论 #43967331 未加载

评论 #43967458 未加载

评论 #43966656 未加载

pants29 days ago

This appears to be a very thoughtful and helpful study. It's also impressive to see the improvement in performance in just the last year of model development - almost double.I've found o3 & deep research to be very effective in guiding my health plan. One interesting anecdote - I got hit in the chest (right over the heart) quite hard a month or so ago. I prompted o3 with my ensuing symptoms and heart rate / oxygenation data from my Apple watch, and it already knew my health history from previous conversations. It gave very good advice and properly diagnosed me with a costochondral sprain. It gave me a timeline to expect (which ended up being 100% accurate) and treatments / ointments to help.IMO - it's a good idea to have a detailed prompt ready to go with your health history, height/weight, medications and supplements, etc. if anything's happening to you you've got it handy to give to o3 to help in a diagnosis.

评论 #43967857 未加载

评论 #43968412 未加载

andy999 days ago

My sense is that these benchmarks are not realistic in terms of the way the model is used. People building specialized AI systems are not, in my experience, letting users just chat with a base model, they would have some variant of RAG plus some guardrails plus other stuff (like routing to pre-written answers for common question).So what use case does this test setup reflect? Is there a relevant commercial use case here?

评论 #43968309 未加载

评论 #43968885 未加载

pizzathyme9 days ago

Non-clinicians are using ChatGPT every day now to try to find assistance (right or wrong) to real-life medical problems. This is a great evaluation set that could prevent a lot of harm

unsupp0rted9 days ago

Recently I uploaded a lab report to chatGPT and asked it to summarize it.It hallucinated serious cancer, along with all the associated details you’d normally find on a lab report. It had an answer to every question I had pre-asked about the report.The report said the opposite: no cancer detected.

评论 #43966645 未加载

评论 #43966733 未加载

评论 #43967237 未加载

评论 #43967840 未加载

评论 #43967103 未加载

simianwords9 days ago

I would really rather like a benchmark purely focusing on diagnosis. Symptoms, patient history vs the real diagnosis. Maybe name this model House M.D 1.0 or something.The other stuff is good to have but ultimately a model that focuses on diagnosing medical conditions is going to be the most useful. Look - we aren't going to replace doctors anytime soon but it is good to have a second opinion from an LLM purely for diagnosis. I would hope it captures patterns that weren't observed before. This is exactly the sort of thing game that AI can beat a human at - large scale pattern recognition.

ziofill9 days ago

If your condition can easily be resolved by waiting a little and letting your body recover, an honest doctor will tell you so. I wonder if an AI will ever risk not recommending you to see a doctor.

评论 #43969624 未加载

评论 #43969599 未加载

评论 #43972733 未加载

kypro9 days ago

Why are all the label colours for the "Worst-case HealthBench score at k samples" chart the same colour and the same shape? Completely unreadable.

评论 #43967580 未加载

srameshc9 days ago

Is the Med-PaLM model that Google's has been working on meant to be considered for comparison ? If I'm not mistaken, it isn't publicly available.> <a href="https://sites.research.google/med-palm/" rel="nofollow">https://sites.research.google/med-palm/</a>

评论 #43966197 未加载

GuinansEyebrows9 days ago

i have zero trust in openai's ability to do anything impartially. why should we leave the judgement of a private tool up to the makers of the tool especially when human lives are at stake?

评论 #43966506 未加载

评论 #43966409 未加载

评论 #43966595 未加载

NKosmatos9 days ago

Most probably I’m going to get downvoted, but I’m gonna say it…It’s a pity they don’t support Greek language, keeping in mind that almost all medical terminology has Greek origins.Anyhow, this is a step in the good direction and for sure it will aid many people looking for medical assistance via ChatGPT.

yapyap9 days ago

Sam Altman does not care about “improving human health”

Quenby9 days ago

I walked away from this with a feeling I can't quite put into words. I'm not a doctor, and I don’t really understand medical AI, but a friend of mine who is a doctor has been relying more and more on ChatGPT lately—to look up guidelines, organize his thoughts. He says it’s not that it knows more than he does, but that it’s fast, clear, and saves him time. That got me thinking. I used to assume AI in healthcare was about replacing people. Now, it feels more like an extension. Like doctors are getting an extra pair of hands—or a second consultation room that's always online. Maybe that’s what progress looks like: human judgment is still there, but increasingly shaped by structured input. I don’t know if that’s good or bad. It just feels complicated.

评论 #43969210 未加载

ramon1569 days ago

I don't want to be a conspiracy theorist, but could this be in preparation for Amazon's (to be) health branch?