TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How to generate tested software packages using LLMs, a sandbox and a while loop

133 pointsby pierremenardalmost 2 years ago

10 comments

hellodanyloalmost 2 years ago
This should give a second life to Test-Driven Development.<p>One of the under-appreciated wisdoms of TDD is that there is a complexity asymmetry in many problems between finding a solution and (fully or partially) verifying it. Examples of asymmetric problems: inverting matrices, sorting an array, computing a function&#x27;s gradient, compressing a byte stream, etc.<p>Human writes the easier part -- the test suite, the language model writes the harder part -- the solution. This can be a net gain in productivity.
评论 #37041683 未加载
评论 #37041397 未加载
评论 #37037399 未加载
评论 #37037479 未加载
评论 #37038458 未加载
评论 #37045649 未加载
choegeralmost 2 years ago
Program synthesis by a language model is certainly cool, but I don&#x27;t think it&#x27;s really novel. After all, the mutator could as well just be a random AST transformer.<p>So what&#x27;s the real deal here? Does the mutator end up with a usable program much quicker? Or does it cheat by looking at the test cases?<p>As an example, if I specify a md5 hash function on strings, what will that tool produce?
Joker_vDalmost 2 years ago
Ah, so we&#x27;re turning &quot;a million monkeys at a million keyboards could produce the complete works of Shakespeare&quot; into actual, real thing? What a truly disruptive synergy.
评论 #37037549 未加载
DylanSpalmost 2 years ago
This definitely seems like a potentially powerful approach. The said, maybe I&#x27;m missing something obvious, but is the LLM generating both the tests and the implementation? If that&#x27;s the case, then it seems like there could be issues caused by the generated tests not matching what&#x27;s specified in the initial prompt. Manually writing highly focused unit tests doesn&#x27;t seem like the best way to work with this, but being able to manually write some sort of high-level, machine-checkable specs might be useful.
评论 #37043598 未加载
seeknotfindalmost 2 years ago
LLMs when given code mixing languages will typically execute as if it worked anyway. I have a theory this is because correct examples are the majority on the web. Since LLMs don&#x27;t have the data with the feedback all the humans have seen constructing these examples, there is no way it could understand. We need to feed LLMs these types of input&#x2F;error loops explicitly to improve performance or reduce iterations in this area. Otherwise it&#x27;ll be stuck making many small mistakes forever. Again, this is my theory.
politicianalmost 2 years ago
Immediately after seeing ChatGPT, I felt that evolving programs in an OODA loop was going to be a big deal, so it&#x27;s great to see someone iterating on the concept.<p>I hope they try it with QuickCheck &#x2F; property-based testing.
ameliusalmost 2 years ago
No need to fix things, just make it search for leaks or null pointer references and explain them to the user, and you have a million dollar product.
luis_choalmost 2 years ago
This takes programming by coincidence to a whole new level.
评论 #37043275 未加载
mike_hearnalmost 2 years ago
Very cool. I&#x27;ve played around with something similar a few months ago (without the sandbox). The tricky part is to find a diff format that the AI can use intuitively to patch code in place, as otherwise you can blow out the token limits quite fast.<p>I wonder to what extent AI is going to kill comprehensibility of the infrastructure. Over the weekend I did a bit of hacking with a friend. He wants to move out of AWS to Hetzner to save money, but wasn&#x27;t experienced with Linux sysadmin so I was teaching him some stuff, and simultaneously pitching a little prototype I made of a server config tool. It&#x27;s sort of like a higher level Ansible: you specify what packages and containers you want, the URLs they should be served on, and it goes ahead and configures unattended upgrades, Docker and Traefik to deliver that outcome. The idea was to extend it to configuring backups with BorgBackup and other common sysadmin-y tasks.<p>He was enthused! The next day though, things changed. He&#x27;d signed up for GPT-4 and it&#x27;d just blasted out a pile of Ansible configs for him. He doesn&#x27;t know Ansible but it didn&#x27;t matter, he just iterated with the AI a few times and now the problem is solved.<p>This makes me wonder if there&#x27;s much point anymore in improving the usability of systems software (programming languages, databases, operating systems, clouds, anything driven by textual configuration). A basic assumption that underlies making better tools is that user&#x27;s time is valuable, that intuitive and simple systems therefore have value. But the AI&#x27;s time isn&#x27;t valuable. It can read some poorly written docs, then the source code of the system, then people getting frustrated on StackOverflow, synthesize all that together and spit out whatever pile of inscrutable config files are needed and it can do so within seconds.<p>Given this experience, if we extrapolate it forwards, then maybe within a couple of decades many organizations will be running on infrastructure that nobody understands at all. You get cases today where maybe individual programs or workflows are only understood by one guy, but those are extremes and it&#x27;s understood that it&#x27;s a bad situation. Maybe in future it&#x27;ll be normal and to do anything companies will have to ask an AI to do it for them. We already see skills atrophy around things like UNIX sysadmin because nowadays everyone learns AWS instead, which is one reason they can charge so much money for it, so I think as people retire knowledge of how to run Linux systems will steadily disappear. But conceivably people will stop learning the cloud too, and then to do anything with servers it&#x27;ll be AI or the highway.<p>You can also apply this to programming languages. Why engage in a high effort project like inventing a Java or Rust if an AI can spit out correct C++ for you, given enough tries at writing tests? Does the motivation to produce anything beyond incremental improvements to existing tools disappear now?<p>I keep flipping back and forth on the answer to this. On one hand, it seems pretty pointless to further develop this prototype now. Maybe Ansible or AI written shell scripts is the end of the road for making Linux easier to use. On the other hand, AI is remarkably human like. It can also make mistakes, and also benefits from well written docs, simple interfaces and good error messages. So maybe usability still has a purpose, albeit maybe we now need to master a new field of &quot;usability studies for AI&quot;.
评论 #37037347 未加载
评论 #37038654 未加载
评论 #37040984 未加载
评论 #37037024 未加载
d4rkp4tternalmost 2 years ago
Modal-labs seems like an interesting cloud compute solution that I hadn’t heard of before. Could anyone summarize what are its value props, e.g is it like GCP but better because …