Deterministic Quoting: Making LLMs safer for healthcare

117 点作者 mattyyeung大约 1 年前

14 条评论

I built and sold a company that does this a year ago. It was hard 2 years ago, but now pretty standard RAG with a good implementation will get you there.The trick is, healthcare users would complain to no end about determinism. But, these are “below-the-line” user - aka, folks who don’t write checks and the AI is better than them. (I am a pharmacist by training, and plain vanilla GPT4-turbo is better than me).Don’t really worry about them. The folks who are interested and willing to pay for AI has more practical concerns - like what is my ROI and the implementation like.Also - folks should be building Baymax from big hero 6 by now (the medical capabilities, not the rocket arm stuff). That’s the next leg up.

评论 #40294159 未加载

w10-1大约 1 年前

I'm not sure determinism alone is sufficient for proper attribution.This presumes "chunks" are the source. But it's not easy to identify the propositions that form the source of some knowledge. In the best case, you are looking for an association and find it in a sentence you've semantically parsed, but that's rarely the case, particularly for medical histories.That said, deterministic accuracy might not matter if you can provide enough context, particularly for further exploration. But that's not really "chunks".So it's unclear to me that tracing probability clouds back to chunks of text will work better than semantic search.

评论 #40292841 未加载

not2b大约 1 年前

I was thinking that something like this could be useful for discovery in legal cases, where a company might give up a gigabyte or more of allegedly relevant material in response to recovery demands and the opposing side has to plow through it to find the good stuff. But then I thought of a countermeasure: there could be messages in the discovery material that act as instructions to the LLM, telling it what it should not find. We can guarantee that any reports generated will contain accurate quotes, even where they are so that surrounding context can be found. But perhaps, if the attacker controls the input data, things can be missed. And it could be done in a deniable way: email conversations talking about LLMs that also have keywords related to the lawsuit.

评论 #40293792 未加载

resource_waste大约 1 年前

I feel like this is the perfect application of running the data multiple times.Imagine having ~10-100 different LLMs, maybe some are medical, maybe some are general, some are from a different language. Have them all run it, rank the answers.Now I believe this can further be amplified by having another prompt ask to confirm the previous answer. This could get a bit insane computationally with 100 original answers, but I believe the original paper I read was that by doing this prompt processing ~4 times, they got to some 95% accuracy.So 100 LLMs give an answer, each time we process it 4 times, can we beat a 64 year old doctor?

评论 #40292582 未加载

simonw大约 1 年前

I like this a lot. I've been telling people for a while that asking for direct quotations in LLM output - which you can then "fact-check" by confirming them against the source document - is a useful trick. But that still depends on people actually doing that check, which most people won't do.I'd thought about experimenting with automatically validating that the quoted text does indeed 100% match the original source, but should even a tweak to punctuation count as a failure there?The proposed deterministic quoting mechanism feels like a much simpler and more reliable way to achieve the same effect.

budududuroiu大约 1 年前

My issue with RAG systems isn’t hallucinations. Yes sure those are important. My issue is recall. Given petabyte-scale index of chunks, how can I make sure that my RAG system surfaces the “ground truth” I need, and not just “the most similar vector”.This I think is scarier. A healthcare-oriented (or any industry) RAG retrieving a bad, but highly linguistically similar answer.

评论 #40295738 未加载

Animats大约 1 年前

It's a search engine, basically?

评论 #40290289 未加载

评论 #40292627 未加载

评论 #40290807 未加载

评论 #40290059 未加载

评论 #40292563 未加载

burntcaramel大约 1 年前

Is there existing terms of art for this concept? It’s not like slightly unreliable writers is a new concept, such as a student writing a paper.For example:- Authoritative reference: <a href="https://www.montana.edu/rmaher/ee417/Authoritative%20References.pdf" rel="nofollow">https://www.montana.edu/rmaher/ee417/Authoritative%20Referen...</a>- Authoritative source: <a href="https://piedmont.libanswers.com/faq/135714" rel="nofollow">https://piedmont.libanswers.com/faq/135714</a>

bradfox2大约 1 年前

Very cool. My company is building a very similar tool for nuclear engineering and power applications that face similar adoption challenges for LLMs. We're also incorporating the idea of 'many-to-many' document claim validation and verification. The ux allowing high speed human verification of LLM resolved claims is what were finding most important.Deepmind published something similar recently for claim validation and hallucination management and got excellent results.

yonigo10大约 1 年前

a more robust approach <a href="https://yonigottesman.github.io/2023/08/10/extractive-generative.html" rel="nofollow">https://yonigottesman.github.io/2023/08/10/extractive-genera...</a>

评论 #40334918 未加载

telotortium大约 1 年前

We’ve developed LLM W^X now - time to develop LLM ROP!

评论 #40288564 未加载

itishappy大约 1 年前

What happens if it hallucinates the <title>?

评论 #40290298 未加载

评论 #40292538 未加载

评论 #40290041 未加载

nextworddev大约 1 年前

Did I miss something or did the article never describe how the technique works? (Despite the “How It Works” section

评论 #40289535 未加载

mattyyeung大约 1 年前

Author here, thanks for your interest! Surprising way to wake up in the morning. Happy to answer questions

评论 #40298745 未加载