Hallucinations in code are the least dangerous form of LLM mistakes

371 点作者 ulrischa3 个月前

64 条评论

Terr_3 个月前

[Recycled from an older dupe submission]As much as I've agreed with the author's other posts/takes, I find myself resisting this one:> I'll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people.No, that does not follow.1. Reviewing depends on what you know about the expertise (and trust) of the person writing it. Spending most of your day reviewing code written by familiar human co-workers is very different from the same time reviewing anonymous contributions.2. Reviews are not just about the code's potential mechanics, but inferring and comparing the intent and approach of the writer. For LLMs, that ranges between non-existent and schizoid, and writing it yourself skips that cost.3. Motivation is important, for some developers that means learning, understanding and creating. Not wanting to do code reviews all day doesn't mean you're bad at them. Also, reviewing an LLM's code has no social aspect.However you do it, somebody else should still be reviewing the change afterwards.

评论 #43241581 未加载

评论 #43244380 未加载

评论 #43241052 未加载

评论 #43243749 未加载

评论 #43243540 未加载

评论 #43240863 未加载

notepad0x903 个月前

My fear is that LLM generated code will look great to me, I won't understand it fully but it will work. But since I didn't author it, I wouldn't be great at finding bugs in it or logical flaws. Especially if you consider coding as piecing together things instead of implementing a well designed plan. Lots of pieces making up the whole picture but a lot of those pieces are now put there by an algorithm making educated guesses.Perhaps I'm just not that great of a coder, but I do have lots of code where if someone took a look it, it might look crazy but it really is the best solution I could find. I'm concerned LLMs won't do that, they won't take risks a human would or understand the implications of a block of code beyond its application in that specific context.Other times, I feel like I'm pretty good at figuring out things and struggling in a time-efficient manner before arriving at a solution. LLM generated code is neat but I still have to spend similar amounts of time, except now I'm doing more QA and clean up work instead of debugging and figuring out new solutions, which isn't fun at all.

评论 #43237043 未加载

评论 #43238763 未加载

评论 #43236847 未加载

评论 #43237162 未加载

评论 #43238722 未加载

评论 #43237387 未加载

评论 #43241112 未加载

评论 #43238978 未加载

评论 #43239372 未加载

评论 #43237101 未加载

评论 #43239665 未加载

评论 #43237956 未加载

评论 #43237808 未加载

layer83 个月前

> Just because code looks good and runs without errors doesn’t mean it’s actually doing the right thing. No amount of meticulous code review—or even comprehensive automated tests—will demonstrably prove that code actually does the right thing. You have to run it yourself!I would have stated this a bit differently: No amount of running or testing can prove the code correct. You actually have to reason through it. Running/testing is merely a sanity/spot check of your reasoning.

评论 #43236756 未加载

评论 #43236195 未加载

评论 #43235856 未加载

评论 #43235828 未加载

atomic1283 个月前

Last week, The Primeagen and Casey Muratori carefully review the output of a state-of-the-art LLM code generator.They provide a task well-represented in the LLM's training data, so development should be easy. The task is presented as a cumulative series of modifications to a codebase:<a href="https://www.youtube.com/watch?v=NW6PhVdq9R8" rel="nofollow">https://www.youtube.com/watch?v=NW6PhVdq9R8</a>This is the actual reality of LLM code generators in practice: iterative development converging on useless code, with the LLM increasingly unable to make progress.

评论 #43238531 未加载

bigstrat20033 个月前

> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.This seems like a very flawed assumption to me. My take is that people look at hallucinations and say "wow, if it can't even get the easiest things consistently right, no way am I going to trust it with harder things".

评论 #43236953 未加载

评论 #43237304 未加载

t_mann3 个月前

Hallucinations themselves are not even the greatest risk posed by LLMs. A much greater risk (in simple terms of probability times severity) I'd say is that chat bots can talk humans into harming themselves or others. Both of which have already happened, btw [0,1]. Still not sure if I'd call that the greatest overall risk, but my ideas for what could be even more dangerous I don't even want to share here.[0] <a href="https://www.qut.edu.au/news/realfocus/deaths-linked-to-chatbots-show-we-must-urgently-revisit-what-counts-as-high-risk-ai" rel="nofollow">https://www.qut.edu.au/news/realfocus/deaths-linked-to-chatb...</a>[1] <a href="https://www.theguardian.com/uk-news/2023/jul/06/ai-chatbot-encouraged-man-who-planned-to-kill-queen-court-told" rel="nofollow">https://www.theguardian.com/uk-news/2023/jul/06/ai-chatbot-e...</a>

评论 #43235623 未加载

评论 #43236225 未加载

评论 #43238379 未加载

评论 #43238746 未加载

AndyKelley3 个月前

> Chose boring technology. I genuinely find myself picking libraries that have been around for a while partly because that way it’s much more likely that LLMs will be able to use them.This is an appeal against innovation.> I’ll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.As someone who has spent [an incredible amount of time reviewing other people's code](<a href="https://github.com/ziglang/zig/pulls?q=is%3Apr+is%3Aclosed">https://github.com/ziglang/zig/pulls?q=is%3Apr+is%3Aclosed</a>), my perspective is that reviewing code is fundamentally slower than writing it oneself. The purpose of reviewing code is mentorship, investing in the community, and building trust, so that those reviewees can become autonomous and eventually help out with reviewing.You get none of that from reviewing code generated by an LLM.

评论 #43236302 未加载

verbify3 个月前

An anecdote: I was working for a medical centre, and had some code that was supposed to find the 'main' clinic of a patient.The specification was to only look at clinical appointments, and find the most recent appointment. However if the patient didn't have a clinical appointment, it was supposed to find the most recent appointment of any sort.I wrote the code by sorting the data (first by clinical-non-clinical and then by date). I asked chatgpt to document it. It misunderstood the code and got the sorting backwards.I was pretty surprised, and after testing with foo-bar examples eventually realised that I had called the clinical-non-clinical column "Clinical", which confused the LLM.This is the kind of mistake that is a lot worse than "code doesn't run" - being seemingly right but wrong is much worse than being obviously wrong.

评论 #43238787 未加载

tombert3 个月前

I use ChatGPT to generate code a lot, and it's certainly useful, but it has given me issues that are not obvious.For example, I had it generate some C code to be used with ZeroMQ a few months ago. The code looked absolutely fine, and it mostly worked fine, but it made a mistake with its memory allocation stuff that caused it to segfault sometimes, and corrupt memory other times.Fortunately, this was such a small project and I already know how to write code, so it wasn't too hard for me to find and fix, though I am slightly concerned that some people are copypasting large swaths of code from ChatGPT that looks mostly fine but hides subtle bugs.

评论 #43238803 未加载

评论 #43235969 未加载

not2b3 个月前

If the hallucinated code doesn't compile (or in an interpreted language, immediately throws exceptions), then yes, that isn't risky because that code won't be used. I'm more concerned about code that appears to work for some test cases but solves the wrong problem or inadequately solves the problem, and whether we have anyone on the team who can maintain that code long-term or document it well enough so others can.

评论 #43237349 未加载

评论 #43235865 未加载

henning3 个月前

> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.If I have to spend lots of time learning how to use something, fix its errors, review its output, etc., it may just be faster and easier to just write it myself from scratch.The burden of proof is not on me to justify why I choose not to use something. It's on the vendor to explain why I should turn the software development process into perpetually reviewing a junior engineer's hit-or-miss code.It is nice that the author uses the word "assume" -- there is mixed data on actual productivity outcomes of LLMs. That is all you are doing -- making assumptions without conclusive data.This is not nearly as strong an argument as the author thinks it is.> As a Python and JavaScript programmer my favorite models right now are Claude 3.7 Sonnet with thinking turned on, OpenAI’s o3-mini-high and GPT-4o with Code Interpreter (for Python).This is similar to Neovim users who talk about "productivity" while ignoring all the time spent tweaking dofiles that could be spent doing your actual job. Every second I spend toying with models is me doing something that does not directly accomplish my goals.> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.You have no idea how much code I read, so how can you make such claims? Anyone who reads plenty of code knows that it often feels like reading other people's code is often harder than just writing it yourself.The level of hostility towards just sitting down and thinking through something without having an LLM insert text into your editor is unwarranted and unreasonable. A better policy is: if you like using coding assistants, great. If you don't and you still get plenty of work done, great.

评论 #43239165 未加载

sevensor3 个月前

> you have to put a lot of work in to learn how to get good results out of these systemsThat certainly punctures the hype. What are LLMs good for, if the best you can hope for is to spend years learning to prompt it for unreliable results?

评论 #43241184 未加载

评论 #43241373 未加载

评论 #43242647 未加载

fumeux_fume3 个月前

Least dangerous only within the limited context you defined of compilation errors. If I hired a programmer and I found whole libraries they invented to save themselves the effort of finding a real solution, I would be much more upset than if I found subtle logical errors in their code. If you take the cynical view that hallucinations are just speed bumps that can be iterated away then I would argue you are under-valuing the actual work I want the LLM to do for me. One time I was trying to get help with the AWS CLI or boto3 and no matter how many times I pasted the traceback to Claude or ChatGPT, it would apologize and then hallucinate the non-existent method or command. At least with logical errors I can fix those! But all in all, I still agree with a lot in this post.

nojs3 个月前

> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.If you’re writing code in Python against well documented APIs, sure. But it’s an issue for less popular languages and frameworks, when you can’t immediately tell if the missing method is your fault due to a missing dependency, version issue, etc.

评论 #43238769 未加载

jccalhoun3 个月前

I am not a programmer and i don't use Linux. I've been working on a python script for a raspberry pi for a few months. Chatgpt has been really helpful in showing me how to do things or debug errors.Now I am at the point that I am cleaning up the code and making it pretty. My script is less than 300 lines and Chatgpt regularly just leaves out whole chunks of the script when it suggests improvements. The first couple times this led to tons of head scratching over why some small change to make one thing more resilient would make something totally unrelated break.Now I've learned to take Chatgpt's changes and diff it with the working version before I try to run it.

评论 #43235336 未加载

评论 #43236057 未加载

评论 #43235422 未加载

评论 #43245186 未加载

评论 #43236160 未加载

burningion3 个月前

I think another category of error that Simon skips over that breaks this argument entirely: the hallucination where the model forgets a feature.Rather than the positive (code compiles), the negative (forgets about a core feature), can be extremely difficult to tell. Worse still, the feature can slightly drift, based upon code that's expected to be outside of the dialogue / context window.I've had multiple times where the model completely forgot about features in my original piece of code, after it makes a modification. I didn't notice these missing / subtle changes until much later.

评论 #43243011 未加载

fzeroracer3 个月前

> I’ll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.Not only is this a massive bundle of assumptions but it's also just wrong on multiple angles. Maybe if you're only doing basic CRUDware you can spend five seconds and give a thumbs up but in any complex system you should be spending time deeply reading code. Which is naturally going to take longer than using what knowledge you already have to throw out a solution.

greybox3 个月前

I've not yet managed to successfully write any meaningful contribution to a codebase with an llm, faster than I could have written it myself.Ok sure it writes test code boiler plate for me.Honestly the kind of work im doing requires that I understand the code im reading, more than have the ability to quickly churn out more of it.I think probably an llm is going to greatly speed up Web development, or anything else where the impetus is on adding to a codebase quickly, as for maintaining older code, performing precise upgrades, and fixing bugs, so far ive seen zero benefits. And trust me, I would like my job to be easier! Its not like I've not tried to use these

评论 #43242978 未加载

cratermoon3 个月前

Increasingly I see apologists for LLMs sounding like people justifying fortune tellers and astrologists. The confidence games are in force, where the trick involves surreptitiously eliciting all the information the con artist needs from the mark, then playing it back to them as if it involves some deep and subtle insights.

chad1n3 个月前

The idea is correct, a lot of people (including myself sometimes) just let an "agent" run and do some stuff and then check later if it finished. This is obviously more dangerous than just the LLM hallucinating functions, since at least you can catch the latter, but the first one depends on the tests of the project or your reviewer skills.The real problem with hallucination is that we started using LLMs as search engines, so when it invents a function, you have to go and actually search the API on a real search engine.

评论 #43238850 未加载

jchw3 个月前

> The moment you run LLM generated code, any hallucinated methods will be instantly obvious: you’ll get an error. You can fix that yourself or you can feed the error back into the LLM and watch it correct itself.Interestingly though, this only works if there is an error. There are cases where you will not get an error; consider a loosely typed programming language like JS or Python, or simply any programming language when some of the API interface is unstructured, like using stringly-typed information (e.g. Go struct tags.) In some cases, this will just silently do nothing. In other cases, it might blow up at runtime, but that does still require you to hit the code path to trigger it, and maybe you don't have 100% test coverage.So I'd argue hallucinations are not always safe, either. The scariest thing about LLMs in my mind is just the fact that they have completely different failure modes from humans, making it much harder to reason about exactly how "competent" they are: even humans are extremely difficult to compare with regards to competency, but when you throw in the alien behavior of LLMs, there's just no sense of it.And btw, it is not true that feeding an error into an LLM will always result in it correcting the error. I've been using LLMs experimentally and even trying to guide it towards solving problems I know how to solve, sometimes it simply can't, and will just make a bigger and bigger mess. Due to the way LLMs confidently pretend to know the exact answer ahead of time, presumably due to the way they're trained, they will confidently do things that would make more sense to try and then undo when they don't work, like trying to mess with the linker order or add dependencies to a target to fix undefined reference errors (which are actually caused by e.g. ABI issues.) I still think LLMs are a useful programming tool, but we could use a bit more reality. If LLMs were as good as people sometimes imply, I'd expect an explosion in quality software to show up. (There are exceptions of course. I believe the first versions of Stirling PDF were GPT-generated so long ago.) I mean, machine-generated illustrations have flooded the Internet despite their shortcomings, but programming with AI assistance remains tricky and not yet the force multiplier it is often made out to be. I do not believe AI-assisted coding has hit its Stable Diffusion moment, if you will.Now whether it will or not, is another story. Seems like the odds aren't that bad, but I do question if the architectures we have today are really the ones that'll take us there. Either way, if it happens, I'll see you all at the unemployment line.

alexashka3 个月前

> My less cynical side assumes that nobody ever warned them that you have to put a lot of work in to learn how to get good results out of these systemsWhy am I reminded of people who say you first have to become a biblical scholar before you can criticize the bible?

loxs3 个月前

The worst for me so far has been the following:1. I know that a problem requires a small amount of code, but I also know it's difficult to write (as I am not an expert in this particular subfield) and it will take me a long time, like maybe a day. Maybe it's not worth doing at all, as the effort is not worth the result.2. So why not ask the LLM, right?3. It gives me some code that doesn't do exactly what is needed, and I still don't understand the specifics, but now I have a false hope that it will work out relatively easily.4. I spend a day until I finally manage to make it work the way it's supposed to work. Now I am also an expert in the subfield and I understand all the specifics.5. After all I was correct in my initial assessment of the problem, the LLM didn't really help at all. I could have taken the initial version from Stack Overflow and it would have been the same experience and would have taken the same amount of time. I still wasted a whole day on a feature of questionable value.

gojomo3 个月前

Such "hallucinations" can also be plausible & useful APIs that oughtta exist – de facto feature requests.

评论 #43235497 未加载

评论 #43237117 未加载

objectified3 个月前

> The moment you run LLM generated code, any hallucinated methods will be instantly obvious: you’ll get an error. You can fix that yourself or you can feed the error back into the LLM and watch it correct itself.But that's for methods. For libraries, the scenario is different, and possibly a lot more dangerous. For example, the LLM generates code that imports a library that does not exist. An attacker notices this too while running tests against the LLM. The attacker decides to create these libraries on the public package registry and injects malware. A developer may think: "oh, this newly generated code relies on an external library, I will just install it," and gets owned, possibly without even knowing for a long time (as is the case with many supply chain attacks).And no, I'm not looking for a way to dismiss the technology, I use LLMs all the time myself. But what I do think is that we might need something like a layer in between the code generation and the user that will catch things like this (or something like Copilot might integrate safety measures against this sort of thing).

评论 #43239345 未加载

9999000009993 个月前

I've probably spent about 25$ on Claude code so far.I'm tempted to pay someone in Poland or whatever another 500$ to just finish the project. Claude code is like a temp that has a code quota to reach. After they reach it, they're done. You've reached the context limit.A lot of stuff is just weird. For example I'm basically building a website with Supabase. Claude does not understand the concept of shared style sheets, instead it will just re-implement the same style sheets over and over again on like every single page and subcomponent.Multiple incorrect implementations of relatively basic concepts. Over engineering all over the place.A part of this might be on Supabase though. I really want to create a FOSS project, so firebase, while probably being a better fit, is out.Not wanting to burn out, I took a break after a 4 hour Claude session. It's like reviewing code for a living.However, I'm optimistic soon a competitor will emerge with better pricing. I would absolutely love to run three coding agents at once, maybe it even a fourth that can run integration tests against the first three.

评论 #43242620 未加载

dzaima3 个月前

> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.Even if one is very good at code review, I'd assume the vast majority of people would still end up with pretty different kinds of bugs they are better at finding while writing vs reviewing. Writing code and having it reviewed by a human gets both classes, whereas reviewing LLM code gets just one half of that. (maybe this can be compensated-ish by LLM code review, maybe not)And I'd be wary of equating reviewing human vs LLM code; sure, the explicit goal of LLMs is to produce human-like text, but they also have prompting to request being "correct" over being "average human" so they shouldn't actually "intentionally" reproduce human-like bugs from training data, resulting in the main source of bugs being model limitations, thus likely producing a bug type distribution potentially very different to that of humans.

krupan3 个月前

Reading this article and then through the comments here, the overall argument I'm hearing here is that we should let the AI write the code and we should focus on reviewing it and testing it. We should work towards becoming good at specify a problem, and then validating the solutionShould we even be asking AI to write code? Shouldn't we just be building and training AI to solve these problems without writing any code at all? Replace every app with some focused, trained, and validated AI. Want to find the cheapest flights? Who cares what algorithm the AI uses to find them, just let it do that. Want to track your calorie intake, process payroll every two weeks, do your taxes, drive your car, keep airplanes from crashing into each other, encrypt your communications, predict the weather? Don't ask AI to clumsily write code to do these things. Just tell it to do them!Isn't that the real promise of AI?

评论 #43246207 未加载

xlii3 个月前

> With code you get a powerful form of fact checking for free. Run the code, see if it works.Um. No.This is oversimplification that falls apart in any at minimum level system.Over my career I’ve encountered plenty of reliability caused consequences. Code that would run but side effects of not processing something, processing it too slow or processing it twice would have serious consequences - financial and personal ones.And those weren’t „nuclear power plant management” kind of critical. I often reminisce about educational game that was used at school and cost of losing a single save progress meant couple thousand dollars of reimbursement.<a href="https://xlii.space/blog/network-scenarios/" rel="nofollow">https://xlii.space/blog/network-scenarios/</a>This a cheatsheet I made for my colleagues. This is the thing we need to keep in mind when designing system I’m working on. Rarely any LLM thinks about it. It’s not a popular engineering by any sort, but it it’s here.As for today I’ve yet to name single instance where any of ChatGPT produced code actually would save me time. I’ve seen macro generation code recommendation for Go (Go doesnt have macros), object mutations for Elixir (Elixir doesn’t have objects but immutable structs), list splicing in Fennel (Fennel doesn’t have splicing), language feature pragma ported from another or pure byte representation of memory in Rust and the code used UTF-8 string parsing to do it. My trust toward any non-ephemeral generated code is sub zero.It’s exhausting and annoying. It feels like interacting with Calvin’s (of Calvin and Hobbes) dad but with all the humor taken away.

nottorp3 个月前

> I asked Claude 3.7 Sonnet "extended thinking mode" to review an earlier draft of this post [snip] It was quite helpful, especially in providing tips to make that first draft a little less confrontational!So he's also using LLMs to steer his writing style towards the lowest common denominator :)

dhbradshaw3 个月前

The more leverage a piece of code has, the more good or damage it can do.The more constraints we can place on its behavior, the harder it is to mess up.If it's riskier code, constrain it more with better typing, testing, design, and analysis.Constraints are to errors (including hallucinations) as water is to fire.

noodletheworld3 个月前

If you want to use LLMs for code, use them.If you don't, don't.However, this 'lets move past hallucinations' discourse is just disingenuous.The OP is conflating hallucinations, which are a fact, and undisputed failure mode of LLMs that no one has any solution for....and people not spending enough time and effort learning to use the tools.I don't like it. It feels bad. It feels like a rage bait piece, cast out of frustration that the OP doesn't have an answer for hallucinations, because there isn't one.> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.People aren't stupid.If they use a tool and it sucks, they'll stop using it and say "this sucks".If people are saying "this sucks" about AI, it's because the LLM tool they're using sucks, not because they're idiots, or there's a grand 'anti-AI' conspiracy.People are lazy; if the tool is good (eg. cursor), people will use it.If they use it, and the first thing it does is hallucinate some BS (eg. intellij full line completion), then you'll get people uninstalling it and leaving reviews like "blah blah hallucination blah blah. This sucks".Which is literally what is happening. Right. Now.To be fair 'blah blah hallucinations suck' is a common 'anti-AI' trope that gets rolled out....but that's because it is a real problemPretending 'hallucinations are fine, people are the problem' is... it's just disingenuous and embarrassing from someone of this caliber.

tippytippytango3 个月前

Yep. LLMs can get all the unit tests to pass. But not the acceptance tests. The discouraging thing is you might have all green checks on the unit tests, but you can’t get the acceptance tests to pass without starting over.

tanepiper3 个月前

One thing I've found is that while I work with a LLM and it can do things way faster than me, the other side of it is I'm quickly loosing understand of the deeper code.If someone asks me a question about something I've worked on, I might be able to give an answer about some deep functionality.At the moment I'm working with a LLM on a 3D game and while it works, I would need to rebuild it to understand all the elements of it.For me this is my biggest fear - not that LLMs can code, but that they do so at such a volume that in a generation or two no one will understand how the code works.

throwaway3141553 个月前

> The real risk from using LLMs for code is that they’ll make mistakes that aren’t instantly caught by the language compiler or interpreter. And these happen all the time!Are these not considered hallucinations still?

评论 #43235072 未加载

评论 #43237891 未加载

评论 #43235140 未加载

simonw3 个月前

I really like this theory from Kellan Elliott McCrae: <a href="https://fiasco.social/@kellan/114092761910766291" rel="nofollow">https://fiasco.social/@kellan/114092761910766291</a>> I think a simpler explanation is that hallucinating a non-existent library is a such an inhuman error it throws people. A human making such an error would be almost unforgivably careless.This might explain why so many people see hallucinations in generated code as an inexcusable red flag.

internet_points3 个月前

Even with boring tech that's been in the training set for ages (rails), you can get some pretty funny hallucinations: <a href="https://bengarcia.dev/making-o1-o3-and-sonnet-3-7-hallucinate-for-everyone" rel="nofollow">https://bengarcia.dev/making-o1-o3-and-sonnet-3-7-hallucinat...</a> (fortunately this one was the very non-dangerous kind, making it very obvious; though I wonder how many non-obvious hallucinations entered the training set by the same process)

marcofloriano3 个月前

"Proving to yourself that the code works is your job. This is one of the many reasons I don’t think LLMs are going to put software professionals out of work."Good point

intrasight3 个月前

> You can fix that yourself or you can feed the error back into the LLM and watch it correct itself.Well, those types of errors won't be happening next year will they?> No amount of meticulous code review—or even comprehensive automated tests—will demonstrably prove that code actually does the right thing. You have to run it yourself!What rot. The test is the problem definition. If properly expressed, the code passing the test means the code is good.

why-el3 个月前

I am not so sure. Code by one LLM can be reviewed by another. Puppeteer like solutions will exist pretty soon. "Given this change, can you confirm this spec".Even better, this can carry on for a few iterations. And both LLMs can be:1. Budgeted ("don't exceed X amount")2. Improved (another LLM can improve their prompts)and so on. I think we are fixating on how _we_ do things, not how this new world will do their _own_ thing. That to me is the real danger.

评论 #43238560 未加载

评论 #43238020 未加载

011000113 个月前

Timely article. I really, really want AI to be better at writing code, and hundreds of reports suggest it works great if you're a web dev or a python dev. Great! But I'm a C/C++ systems guy(working at a company making money off AI!) and the times I've tried to get AI to write the simplest of test applications against a popular API it mostly failed. The code was incorrect, both using the API incorrectly and writing invalid C++. Attempts to reason with the LLMs(grokv3, deepseek-r1) led further and further away from valid code. Eventually both systems stopped responding.I've also tried Cursor with similar mixed results.But I'll say that we are getting tremendous pressure at work to use AI to write code. I've discussed it with fellow engineers and we're of the opinion that the managerial desire is so great that we are better off keeping our heads down and reporting success vs saying the emperor wears no clothes.It really feels like the billionaire class has fully drunk the kool-aid and needs AI to live up to the hype.

评论 #43239570 未加载

svaha17283 个月前

If X, AWS, Meta, and Google would just dump their code into a ML training set we could really get on with disrupting things.

zeroCalories3 个月前

I've definitely had these types of issues while writing code with LLMs. When relying on an LLM to write something I don't fully understand I will basically default to a form of TDD, making sure that the code behaves according to some spec. If I can't write a spec, then that's an issue.

sublinear3 个月前

> Compare this to hallucinations in regular prose, where you need a critical eye, strong intuitions and well developed fact checking skills to avoid sharing information that’s incorrect and directly harmful to your reputationAh so you mean... actually doing work. Yeah writing code has the same difficulty, you know. It's not enough to merely get something to compile and run without errors.> With code you get a powerful form of fact checking for free. Run the code, see if it works.No, this would be coding by coincidence. Even the most atrociously bad prose writers don't exactly go around just saying random words from a dictionary or vaguely (mis)quoting Shakespeare hoping to be understood.

评论 #43235273 未加载

评论 #43235630 未加载

myaccountonhn3 个月前

Another danger is spotted in the later paragraphs:> I genuinely find myself picking libraries that have been around for a while partly because that way it’s much more likely that LLMs will be able to use them.People will pick solutions that have a lot of training data, rather than the best solution.

评论 #43244447 未加载

Ozzie_osman3 个月前

I'm excited to see LLMs get much better at testing. They are already good at writing unit tests (as always, you have to review them carefully). But imagine an LLM that can see your code changes _and_ can generate and execute automated and manual tests based on the change.

AdieuToLogic3 个月前

Software is the manifestation of a solution to a problem.Any entity, human or otherwise, lacking understanding of the problem being solved will, by definition, produce systems which contain some combination of defects, logic errors, and inapplicable functionality for the problem at hand.

antfarm3 个月前

LLM generated code is legacy code.

tigerlily3 个月前

When you go from the adze to the chainsaw, be mindful that you still need to sharpen the chainsaw, top up the chain bar oil, and wear chaps.Edit: oh and steel capped boots.Edit 2: and a face shield and ear defenders. I'm all tuckered out like Grover in his own alphabet.

评论 #43237521 未加载

评论 #43239622 未加载

mediumsmart3 个月前

As a non programmer I only get little programs or scripts that do something from the LLM. If they do the thing it means the code is tested, flawless and done. I would never let them have to deal with other humans Input of course.

Ozzie_osman3 个月前

Great article, but doesn't talk about the potentially _most_ dangerous form of mistakes: an adversarial LLM trying to inject vulnerabilities. I expect this to become a vector soon as people figure out ways to accomplish this

davesque3 个月前

I thought he was going to say the really danger is hallucination of facts, but no.

amelius3 个月前

I don't agree. What if the LLM takes a two-step approach, where it first determines a global architecture, and then it fills in the code? (Where it hallucinates in the first step).

DeathArrow3 个月前

I agree with the author. But can't the risk be minimized somehow by asking LLM A to generate code and LLM B to write integration tests?

al2o3cr3 个月前

<pre><code> My cynical side suspects they may have been looking for a reason to dismiss the technology and jumped at the first one they found. </code></pre> MY cynical side suggests the author is an LLM fanboi who prefers not to think that hallucinating easy stuff strongly implies hallucinating harder stuff, and therefore jumps at the first reason to dismiss the criticism.

评论 #43235138 未加载

评论 #43237917 未加载

devmor3 个月前

I don’t really understand what the point or tone of this article is.It says that Hallucinations are not a big deal, that there’s great dangers that are hard to spot in LLM-generated code… and then presents tips on fixing hallucinations with the general theme of positivity towards using LLMs to generate code, with no more time dedicated to the other dangers.It sure gives the impression that the article itself was written by an LLM and barely edited by a human.

TheRealPomax3 个月前

> No amount of meticulous code review—or even comprehensive automated tests—will demonstrably prove that code actually does the right thing. You have to run it yourself!Absolutely not. If your testing requires a human to do testing, your testing has already failed. Your tests do need to include both positive and negative tests, though. If your tests don't include "things should crash and burn given ..." your tests are incomplete.> If you’re using an LLM to write code without even running it yourself, what are you doing?Running code through tests is literally running the code. Have code coverage turned on, so that you get yelled at for LLM code that you don't have tests for, and CI/CD that refuses to accept code that has no tests. By all means push to master on your own projects, but for production code, you better have checks in place that don't allow not-fully-tested code (coverage, unit, integration, and ideally, docs) to land.The real problem comes from LLMs happily not just giving you code but also test cases. The same prudence applies as with test cases someone added to a PR/MR: just because there are tests doesn't mean they're good tests, or enough tests, review them in the assumption that they're testing the wrong thing entirely.

ggm3 个月前

I'm just here to whine, almost endlessly, that the word "hallucination" is a term of art chosen deliberately because it helps promote a sense AGI exists, by using language which implies reasoning and consciousness. I personally dislike this. I think we were mistaken allowing AI proponents to repurpose language in that way.It's not hallucinating Jim, it's statistical coding errors. It's floating point rounding mistakes. It's the wrong cell in the excel table.

评论 #43239126 未加载

cenriqueortiz3 个月前

Code testing is “human in the loop” for LLM generated code.

marcofloriano3 个月前

"If you’re using an LLM to write code without even running it yourself, what are you doing?"Hallucinating

0dayz3 个月前

Personally I believe the worst with llm is it's abysmal ability to architect code, it's why I use llms more like a Google than a so called coding buddy, because there was so many times I had to rewrite the entire file because the llm had added in so much extra unmanageable functions,even deciding to solve problems I hadn't asked it to do.

tiberriver2563 个月前

Wait until he hears about yolo mode and 'vibe' coding.Then the biggest mistake it could make is running `gh repo delete`

cryptoegorophy3 个月前

Just ask another LLM to proof read?

评论 #43239693 未加载

sunami-ai3 个月前

I asked o3-mini-high (investor paying for Pro, I personally would not) to critique the Developer UX of D3's "join" concept (how when you select an empty set then when you update you enter/exit lol) and it literally said "I'm sorry. I can't help you with that." The only thing missing was calling me Dave.

64 条评论

Terr_3 个月前

评论 #43241581 未加载

评论 #43244380 未加载

评论 #43241052 未加载

评论 #43243749 未加载

评论 #43243540 未加载

评论 #43240863 未加载

notepad0x903 个月前

评论 #43237043 未加载

评论 #43238763 未加载

评论 #43236847 未加载

评论 #43237162 未加载

评论 #43238722 未加载

评论 #43237387 未加载

评论 #43241112 未加载

评论 #43238978 未加载

评论 #43239372 未加载

评论 #43237101 未加载

评论 #43239665 未加载

评论 #43237956 未加载

评论 #43237808 未加载

layer83 个月前

评论 #43236756 未加载

评论 #43236195 未加载

评论 #43235856 未加载

评论 #43235828 未加载

atomic1283 个月前

评论 #43238531 未加载

bigstrat20033 个月前

> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.This seems like a very flawed assumption to me. My take is that people look at hallucinations and say "wow, if it can't even get the easiest things consistently right, no way am I going to trust it with harder things".

评论 #43236953 未加载

评论 #43237304 未加载

t_mann3 个月前

评论 #43235623 未加载

评论 #43236225 未加载

评论 #43238379 未加载

评论 #43238746 未加载

AndyKelley3 个月前

评论 #43236302 未加载

verbify3 个月前

评论 #43238787 未加载

tombert3 个月前

评论 #43238803 未加载

评论 #43235969 未加载

not2b3 个月前

评论 #43237349 未加载

评论 #43235865 未加载

henning3 个月前

> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.If I have to spend lots of time learning how to use something, fix its errors, review its output, etc., it may just be faster and easier to just write it myself from scratch.The burden of proof is not on me to justify why I choose not to use something. It's on the vendor to explain why I should turn the software development process into perpetually reviewing a junior engineer's hit-or-miss code.It is nice that the author uses the word "assume" -- there is mixed data on actual productivity outcomes of LLMs. That is all you are doing -- making assumptions without conclusive data.This is not nearly as strong an argument as the author thinks it is.> As a Python and JavaScript programmer my favorite models right now are Claude 3.7 Sonnet with thinking turned on, OpenAI’s o3-mini-high and GPT-4o with Code Interpreter (for Python).This is similar to Neovim users who talk about "productivity" while ignoring all the time spent tweaking dofiles that could be spent doing your actual job. Every second I spend toying with models is me doing something that does not directly accomplish my goals.> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.You have no idea how much code I read, so how can you make such claims? Anyone who reads plenty of code knows that it often feels like reading other people's code is often harder than just writing it yourself.The level of hostility towards just sitting down and thinking through something without having an LLM insert text into your editor is unwarranted and unreasonable. A better policy is: if you like using coding assistants, great. If you don't and you still get plenty of work done, great.

评论 #43239165 未加载

sevensor3 个月前

评论 #43241184 未加载

评论 #43241373 未加载

评论 #43242647 未加载

fumeux_fume3 个月前

nojs3 个月前

> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.If you’re writing code in Python against well documented APIs, sure. But it’s an issue for less popular languages and frameworks, when you can’t immediately tell if the missing method is your fault due to a missing dependency, version issue, etc.

评论 #43238769 未加载

jccalhoun3 个月前

评论 #43235336 未加载

评论 #43236057 未加载

评论 #43235422 未加载

评论 #43245186 未加载

评论 #43236160 未加载

burningion3 个月前

评论 #43243011 未加载

fzeroracer3 个月前

greybox3 个月前

评论 #43242978 未加载

cratermoon3 个月前

chad1n3 个月前

评论 #43238850 未加载

jchw3 个月前

alexashka3 个月前

loxs3 个月前

gojomo3 个月前

Such "hallucinations" can also be plausible & useful APIs that oughtta exist – de facto feature requests.

评论 #43235497 未加载

评论 #43237117 未加载

objectified3 个月前

> The moment you run LLM generated code, any hallucinated methods will be instantly obvious: you’ll get an error. You can fix that yourself or you can feed the error back into the LLM and watch it correct itself.But that's for methods. For libraries, the scenario is different, and possibly a lot more dangerous. For example, the LLM generates code that imports a library that does not exist. An attacker notices this too while running tests against the LLM. The attacker decides to create these libraries on the public package registry and injects malware. A developer may think: "oh, this newly generated code relies on an external library, I will just install it," and gets owned, possibly without even knowing for a long time (as is the case with many supply chain attacks).And no, I'm not looking for a way to dismiss the technology, I use LLMs all the time myself. But what I do think is that we might need something like a layer in between the code generation and the user that will catch things like this (or something like Copilot might integrate safety measures against this sort of thing).

评论 #43239345 未加载

9999000009993 个月前

评论 #43242620 未加载

dzaima3 个月前

krupan3 个月前

评论 #43246207 未加载

xlii3 个月前

nottorp3 个月前

dhbradshaw3 个月前

noodletheworld3 个月前

tippytippytango3 个月前

tanepiper3 个月前

throwaway3141553 个月前

评论 #43235072 未加载

评论 #43237891 未加载

评论 #43235140 未加载

simonw3 个月前

internet_points3 个月前

marcofloriano3 个月前

"Proving to yourself that the code works is your job. This is one of the many reasons I don’t think LLMs are going to put software professionals out of work."Good point

intrasight3 个月前

why-el3 个月前

评论 #43238560 未加载

评论 #43238020 未加载

011000113 个月前

评论 #43239570 未加载

svaha17283 个月前

If X, AWS, Meta, and Google would just dump their code into a ML training set we could really get on with disrupting things.

zeroCalories3 个月前

sublinear3 个月前

评论 #43235273 未加载

评论 #43235630 未加载

myaccountonhn3 个月前

评论 #43244447 未加载

Ozzie_osman3 个月前

AdieuToLogic3 个月前

antfarm3 个月前

LLM generated code is legacy code.

tigerlily3 个月前

评论 #43237521 未加载

评论 #43239622 未加载

mediumsmart3 个月前

Ozzie_osman3 个月前

davesque3 个月前

I thought he was going to say the really danger is hallucination of facts, but no.

amelius3 个月前

I don't agree. What if the LLM takes a two-step approach, where it first determines a global architecture, and then it fills in the code? (Where it hallucinates in the first step).

DeathArrow3 个月前

I agree with the author. But can't the risk be minimized somehow by asking LLM A to generate code and LLM B to write integration tests?

al2o3cr3 个月前

评论 #43235138 未加载

评论 #43237917 未加载

devmor3 个月前

TheRealPomax3 个月前

ggm3 个月前

评论 #43239126 未加载

cenriqueortiz3 个月前

Code testing is “human in the loop” for LLM generated code.

marcofloriano3 个月前