Launch HN: Talc AI (YC S23) – Test Sets for AI

132 pointsby maxrmkover 1 year ago

Hey all! Max and Matt here from Talc AI. We do automated QA for anything built on top of an LLM. Check out our demo: <a href="https://talc.ai/demo">https://talc.ai/demo</a>We’ve found that it's very difficult to know how well LLM applications (and especially RAG systems) are going to work in the wild. Many companies tackle this by having developers or contractors run tests manually. It’s a slow process that holds back development, and often results in unexpected behavior when the application ships.We’ve dealt with similar problems before; Max was a staff engineer working on systematic technical solutions for privacy problems at facebook, and Matt worked on ML ops on facebooks’ election integrity team, helping run classifiers that handled trillions of data points. We learned that even the best predictive systems need to be deeply understood and trusted to be useful to product teams, and set out to build the same understanding in AI.To solve this, we take ideas from academia on how to benchmark the general capabilities of language models, and apply them to generating domain specific test cases that run against your actual prompts and code.Consider an analogy: If you’re a lawyer, we don’t need to be lawyers to open up a legal textbook and test your knowledge of the content. Similarly if you’re building a legal AI application, we don’t need to build your application to come up with an effective set of tests that can benchmark your performance.To make this more concrete - when you pick a topic in the demo, we grab the associated wikipedia page and extract a bunch of facts from it using a classic NLP technique called “named entity recognition”. For example if you picked FreeBASIC, we might extract the following line from it:<pre><code> Source of truth: "IDEs specifically made for FreeBASIC include FBide and FbEdit,[5] while more graphical options include WinFBE Suite and VisualFBEditor." </code></pre> This line is our source of truth. We then use an LLM to work backwards from this fact into a question and answer:<pre><code> Question: "What programming language are the IDEs WinFBE Suite and FbEdit designed to support?" Reference Answer: "FreeBasic" </code></pre> We can then evaluate accurately by comparing the reference answer and the original source of truth– this is how we generate “simple” questions in the demo.In production we’re building this same functionality on our customers' knowledge base instead of wikipedia. We then employ a few different strategies to generate questions – these range from simple factual questions like “how much does the 2024 chevy tahoe cost”, to complex questions like “What would a mechanic have to do to fix the recall on my 2018 Golf?” These questions are based on facts extracted from your knowledge base and real customer examples.This testing and grading process is fast – it’s driven by a mixture of LLMs and traditional algorithms, and can turn around in minutes. Our business model is pretty simple - we charge for each test created. If you opt to use our grading product as well we charge for each example graded against the test.We’re excited to hear what the HN community thinks – please let us know in the comments if you have any feedback, questions or concerns!

20 comments

Imnimoover 1 year ago

I tried the demo with Cal Ripken Jr. I was surprised by some of the complex questions:>Which MLB player won the Sporting News MLB Rookie of the Year Award as a pitcher in 1980, and who did Cal Ripken Jr. surpass to hold the record for most home runs hit as a shortstop?>What team did Britt Burns play for in the minor leagues before making his MLB debut, and in what year did Cal Ripken Jr. break the consecutive games played record?>Who was the minor league pitching coordinator for the Houston Astros until 2010, and what significant baseball record did Cal Ripken Jr. break in 1995?All five questions are a combination of a question about a Britt Burns fact and an unrelated Cal Ripken fact.Why is this? Britt Burns doesn't seem to appear on the live Wikipedia page for Ripken. Does he appear on a cached version? Or is it forming complex questions by finding another page in the same category as Ripken and pulling more facts?

评论 #39048156 未加载

koengover 1 year ago

I love your demo for this. It's one of the best demos I've ever come across in a launch HN. Very easy to understand and use. It seems to suffer with more complex questions though. For example:Question: Why does the pUC19 plasmid have a high copy number in bacterial cells?Expected answer: The pUC19 plasmid has a high copy number due to the lack of the rop gene and a single point mutation in the origin of replication (ori) derived from the plasmid pMB1.GPT response: The pUC19 plasmid has a high copy number in bacterial cells due to the presence of the pUC origin of replication, which allows for efficient and rapid replication of the plasmid.Both are technically correct - the expected answer is simply more detailed about the pUC origin, but both would be considered correct. It seems difficult to test things like this, but maybe that's just not possible to really get correct.I wonder how well things like FutureHouse's wikicrow will work for summarizing knowledge better - <a href="https://www.futurehouse.org/wikicrow" rel="nofollow">https://www.futurehouse.org/wikicrow</a> - and how that could be benchmarked against Talc

评论 #39047086 未加载

typpoover 1 year ago

Congrats on the launch!I've been interested in automatic testset generation because I find that the chore of writing tests is one of the reasons people shy away from evals. Recently landed eval testset generation for promptfoo (<a href="https://github.com/typpo/promptfoo">https://github.com/typpo/promptfoo</a>), but it is non-RAG so more simplistic than your implementation.Was also eyeballing this paper <a href="https://arxiv.org/abs/2401.03038" rel="nofollow">https://arxiv.org/abs/2401.03038</a>, which outlines a method for generating asserts from prompt version history that may also be useful for these eval tools.

评论 #39044521 未加载

logiduckover 1 year ago

For the chevy tahoe example, you are referencing the dealership, but in that case it wasn't a case of the implementation failing to do a positive test for fact extraction, but to test the guardrails.Aren't the guardrail tests much harder since they are open-ended and have to guard against unknown prompt injections and the test of facts much simpler?I think a test suite that guards against the infinite surface area is more valuable then testing if a question matches a reference answer.Interested to how you view testing against giving a wrong answer outside of the predefined scope as opposed to testing that all the test questions match a reference.

评论 #39043500 未加载

andy99over 1 year ago

The first thing that popped into my head is what do you do with the test results? Specifically, how do they feed back into model improvement in a way that avoids overfitting? Do you think having some kind of classical "holdout" question set is enough? Especially with RAG, I'd wonder with the levers that are available (prompt, chunking strategy, ...) if you define a bunch of test questions do you end up overfitting to them, or to the current data set. How can findings be extrapolated to new situations?

评论 #39043622 未加载

nicolewhiteover 1 year ago

Pretty neat!I have a question about how you intend to deal with LLM applications where the output is more creative, e.g. an app where the user input is something like "write me a story about X" and the LLM app is using a higher temperature to get more creative responses. In these cases I don't think it's possible to represent the ideal output as a single string -- it would need to be a more complicated schema, like a list of constraints for the output, e.g. that it contains certain substrings.TIA!

评论 #39052269 未加载

评论 #39051802 未加载

moinismover 1 year ago

Congrats on the launch! Just tried the demo and it looks impressive. Good luck.Are you by any chance hiring global-remote, full-stack/front-end devs? Would love to work with you guys.

评论 #39045397 未加载

bestaiover 1 year ago

I think for you idea to have traction (1) the questions should be selected by their importance and (2) the questions should be chained to allow new results. Just for inspiration or example you could create a quiz for solving a puzzle and at the same time solving the puzzle by answering the questions. >The big idea is using your tool to enhance step by step rationing in LLM.I think you could use a text area for the user to indicate if the quiz is about getting the main idea or if it is about testing the details.And for big clients, the system could be tailored so that the questions and structure reflect user intentions.

julesvrover 1 year ago

Congrats on the launch!On your pricing model, as it's usage based, don't you incentivize your customers to use your product as little as possible? Wouldn't it be better to have limited tiers with fixed annual/monthly recurring rates? Also, do you sell to enterprise? I assume these would like this setup even more as the rates are predefined and they have a budget they have to deal with.I'm currently developing my own pricing model and these are some issues I'm struggling with, so curious what you think.

评论 #39056058 未加载

评论 #39046924 未加载

tommykinsover 1 year ago

As someone who uses Machine Learning to predict the presence of Talc I approve of this, even if I have no use case for it whatsoever.

tikkunover 1 year ago

I like the chevy tahoe callback - I'm assuming that's a reference to the chevy dealership that used an LLM and had people doing prompt tricks to get the chatbot to offer them a chevy tahoe for $1.The specificity in your writing above "to make this more concrete" about how it works was also helpful for understanding the product.

评论 #39043041 未加载

pchunduri6over 1 year ago

I just tried the demo, and it looks great! Congrats on the launch!I have a couple of questions:1) How often do you find that the LLM fails to generate the correct question-answer pairs? The biggest challenge I'm facing with LLM-based evaluation is the variability in LLM performance. I've found that the same prompt results in different LLM responses over multiple runs. Do you have any insights on this issue and how to address it?2) Sometimes, the domain expert generating the test set might not be well-equipped to grade the answers. Consider a customer-facing chatbot application. The RAG app might be focused on very specific user information that might be hard to verify or attest by the test set creator. Do you think there are ways to make this grading process easier?

评论 #39046646 未加载

sherlock_hover 1 year ago

Looks interesting. How do you rate the correctness? Some complex LLM answers seemed to be correct but not in as much detail as the expected answer.How do you generate the answers? Does the model have access to the original source of truth (like in RAG apps)?And in your examples what model do you actually use?

评论 #39042957 未加载

mahoover 1 year ago

Is there a way I can give feedback on wrong labels? The easy questions seem to be correct most (all?) of the time, but I noticed a few errors in the labelling of the complex question/answers. I would love to see this improve even further!

评论 #39058684 未加载

dkindlerover 1 year ago

Here's an example where the GPT response was correct, but was marked as incorrect: <a href="https://ibb.co/tMGxcf3" rel="nofollow">https://ibb.co/tMGxcf3</a>

评论 #39044154 未加载

Robotenomicsover 1 year ago

Very, very impressive.. I ran a couple of tests and on the complex it received 80% although I would say it was harsh as the answer could be said to be correct- although I found the questions generated rather simple not complex.The 2nd test it was 100% incorrect for the complex questions! However when I checked directly with gpt-4 based upon the questions rendered it answered 100% correct. Could that be due to my custom settings in gpt4? Will run it with university students. Fascinating work

评论 #39046751 未加载

ore0sover 1 year ago

Congrats on the launch! How does this compare to <a href="https://www.patronus.ai/" rel="nofollow">https://www.patronus.ai/</a> ? They seem to offer a very similar solution for getting on top of unpredictable LLM output

评论 #39051882 未加载

quadcoreover 1 year ago

Now someone has to test talc AI. I can do it.Impressive demo and business idea, congrats, good luck!

评论 #39046205 未加载

bestaiover 1 year ago

I think allowing other languages besides English could be a good idea.

评论 #39047700 未加载

Departed7405over 1 year ago

Maybe it was already said, but I have a weird bug where system said last answer is incorrect but in the overview it still says "100% correct".

20 comments

Imnimoover 1 year ago

评论 #39048156 未加载

koengover 1 year ago

评论 #39047086 未加载

typpoover 1 year ago

评论 #39044521 未加载

logiduckover 1 year ago

评论 #39043500 未加载

andy99over 1 year ago

评论 #39043622 未加载

nicolewhiteover 1 year ago

评论 #39052269 未加载

评论 #39051802 未加载

moinismover 1 year ago

Congrats on the launch! Just tried the demo and it looks impressive. Good luck.Are you by any chance hiring global-remote, full-stack/front-end devs? Would love to work with you guys.

评论 #39045397 未加载

bestaiover 1 year ago

julesvrover 1 year ago

评论 #39056058 未加载

评论 #39046924 未加载

tommykinsover 1 year ago

As someone who uses Machine Learning to predict the presence of Talc I approve of this, even if I have no use case for it whatsoever.

tikkunover 1 year ago

评论 #39043041 未加载

pchunduri6over 1 year ago

评论 #39046646 未加载

sherlock_hover 1 year ago

评论 #39042957 未加载

mahoover 1 year ago

评论 #39058684 未加载

dkindlerover 1 year ago

Here's an example where the GPT response was correct, but was marked as incorrect: <a href="https://ibb.co/tMGxcf3" rel="nofollow">https://ibb.co/tMGxcf3</a>

评论 #39044154 未加载

Robotenomicsover 1 year ago

评论 #39046751 未加载

ore0sover 1 year ago

评论 #39051882 未加载

quadcoreover 1 year ago

Now someone has to test talc AI. I can do it.Impressive demo and business idea, congrats, good luck!

评论 #39046205 未加载

bestaiover 1 year ago

I think allowing other languages besides English could be a good idea.

评论 #39047700 未加载

Departed7405over 1 year ago

Maybe it was already said, but I have a weird bug where system said last answer is incorrect but in the overview it still says "100% correct".