TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Test Driven Development (TDD) for your LLMs? Yes please, more of that please

86 点作者 lewq6 个月前

8 条评论

satisfice6 个月前
Posts like this make me sick at heart as a tester. This is a video of someone with no respect or understanding of the mission, complexities and subtleties of testing who tramples all over any concept of engineering ethics with his sleek tools and detached, impassive English accent.<p>How do you test things? Easy, he implies: Tell an LLM to test them and then assume everything will be okay! Also, STOP ASKING QUESTIONS!<p>There is zero critical thinking in this video beyond a speedo-level of coverage given by the first test idea that drifts into this guy&#x27;s head. He&#x27;s not testing, he&#x27;s not engineering, he&#x27;s just developing excuses to release a product.
heavyarms6 个月前
Whenever I see one of these posts, I click just to see if the proposed solution to testing the output of an LLM is to use the output of an LLM... and in almost all cases it is. It doesn&#x27;t matter how many buzzwords and acronyms you use to describe what you&#x27;re doing, at the end of the day it&#x27;s turtles all the way down.<p>The issue is not the technology. When it comes to natural language (LLM responses that are sentences, prose, etc.) there is no actual standard by which you can even judge the output. There is no gold standard for natural language. Otherwise language would be boring. There is also no simple method for determining truth... philosophers have been discussing this for thousands of years and after all that effort we now know that... ¯\_(ツ)_&#x2F;¯... and also, Earth is Flat and Birds Are Not Real.<p>Take, for example, the first sentence of my comment: &quot;Whenever I see one of these posts, I click just to see if the proposed solution to testing the output of an LLM is to use the output of an LLM... and in almost all cases it is.&quot; This is absolutely true, in my own head, as my selective memory is choosing to remember that one time I clicked on a similar post on HN. But beyond the simple question of if it is true or not, even an army of human fact checkers and literature majors could probably not come up with a definitive and logical analysis regarding the quality and veracity of my prose. Is it even a grammatically correct sentence structure... with the run-on ellipsis and what not... ??? Is it meant to be funny? Or snarky? Who knows ¯\_(ツ)_&#x2F;¯ WFT is that random pile of punctuation marks in the middle of that sentence... does the LLM even have a token for that?
评论 #42322794 未加载
评论 #42324141 未加载
评论 #42325633 未加载
bdangubic6 个月前
I just mock the answers and assert on the mock of the answer - never fails!
评论 #42326720 未加载
jmathai6 个月前
I&#x27;ve been working on a prompt to application product[1] and one of the approaches we tried was test driven development. We would have the LLM write tests based on a detailed description of the application. Then give the LLM the tests and the requirements and ask it to write the application.<p>The thinking is we could run the tests to verify that the requirements are functional (assuming it wrote the tests correctly in the first place - in many cases it did, fyi).<p>The problem was that it was too fickle. Sometimes the failing tests caught application bugs. But too often the LLM just couldn&#x27;t get the tests to pass even though sometimes the application was working fine.<p>It resulted in a terrible user experience (they only see latency of getting the application correctly written or a failure if it gives up).<p>That being said, I think a lot of the issues folks like us find with LLMs are because we haven&#x27;t figured out how and what to ask.<p>Ultimately, we found an alternative approach which gets at least 95% of the application working 100% of the time. And this is actually a MUCH better user experience than waiting forever to sometimes just get &quot;Sorry, we couldn&#x27;t create your application.&quot;.<p>[1] <a href="https:&#x2F;&#x2F;withlattice.com" rel="nofollow">https:&#x2F;&#x2F;withlattice.com</a>
评论 #42324918 未加载
benatkin6 个月前
I read this blog post on my iPhone and when I went to the top to try and find out more about Helix, it had a giant link to install the Substack iOS app, which detracted from the experience. It might be a good idea to use a real CMS.<p>Here’s the website: <a href="https:&#x2F;&#x2F;tryhelix.ai&#x2F;" rel="nofollow">https:&#x2F;&#x2F;tryhelix.ai&#x2F;</a>
throwawaymaths6 个月前
Its inherently difficult because LLMs are necessarily probabilistic and even worse, for any practical use the key step irreversibly discards most of the probabilities
justanotheratom6 个月前
There is a real need for this. I have to admit most of my testing right now is vibes based. Problem is, these LLM evaluation platforms get in-between me and my LLM.
评论 #42321750 未加载
评论 #42322989 未加载
评论 #42324040 未加载
jasfi6 个月前
I&#x27;m working on an AI agents platform that intends to reduce the amount of code you need to write to get high performing prompts working correctly.<p>The wait-list is at <a href="https:&#x2F;&#x2F;aiconstrux.com" rel="nofollow">https:&#x2F;&#x2F;aiconstrux.com</a>