LLMs understand nullability

170 pointsby mattmarcusabout 1 month ago

19 comments

lsyabout 1 month ago

The article puts scare quotes around "understand" etc. to try to head off critiques around the lack of precision or scientific language, but I think this is a really good example of where casual use of these terms can get pretty misleading.Because code LLMs have been trained on the syntactic form of the program and not its execution, it's not correct — even if the correlation between variable annotations and requested completions was perfect (which it's not) — to say that the model "understands nullability", because nullability means that under execution the variable in question can become null, which is not a state that it's possible for a model trained only on a million programs' syntax to "understand". You could get the same result if e.g. "Optional" means that the variable becomes poisonous and checking "> 0" is eating it, and "!= None" is an antidote. Human programmers can understand nullability because they've hopefully run programs and understand the semantics of making something null.The paper could use precise, scientific language (e.g. "the presence of nullable annotation tokens correlates to activation of vectors corresponding to, and emission of, null-check tokens with high precision and accuracy") which would help us understand what we can rely on the LLM to do and what we can't. But it seems like there is some subconscious incentive to muddy how people see these models in the hopes that we start ascribing things to them that they aren't capable of.

评论 #43614508 未加载

评论 #43614384 未加载

评论 #43614470 未加载

评论 #43614352 未加载

评论 #43616059 未加载

评论 #43614723 未加载

评论 #43616871 未加载

评论 #43614302 未加载

评论 #43615651 未加载

sega_saiabout 1 month ago

One thing that is exciting in the text is an attempt to go away from describing whether LLM 'understands' which I would argue an ill posed question, but instead rephrase it in terms of something that can actually be measured.It would be good to list a few possible ways of interpreting 'understanding of code'. It could possibly include: 1) Type inference for the result 2) nullability 3) runtime asymptotics 4) What the code does

评论 #43613930 未加载

评论 #43613939 未加载

btownabout 1 month ago

There seems to be a typo in OP's "Visualizing Our Results" - but things make perfect sense if red is non-nullable, green is nullable.I'd be really curious to see where the "attention" heads of the LLM look when evaluating the nullability of any given token. Does just it trust the Optional[int] return type signature of the function, or does it also skim through the function contents to understand whether that's correct?It's fascinating to me to think that the senior developer skillset of being able to skim through complicated code, mentally make note of different tokens of interest where assumptions may need to be double-checked, and unravel that cascade of assumptions to track down a bug, is something that LLMs already excel at.Sure, nullability is an example where static type checkers do well, and it makes the article a bit silly on its own... but there are all sorts of assumptions that aren't captured well by type systems. There's been a ton of focus on LLMs for code generation; I think that LLMs for debugging makes for a fascinating frontier.

评论 #43616432 未加载

gopiandcodeabout 1 month ago

The visualisation of how the model sees nullability was fascinating.I'm curious if this probing of nullability could be composed with other LLM/ML-based python-typing tools to improve their accuracy.Maybe even focusing on interfaces such as nullability rather than precise types would work better with a duck-typed language like python than inferring types directly (i.e we don't really care if a variable is an int specifically, but rather that it supports _add or _sub etc. that it is numeric).

评论 #43612793 未加载

评论 #43612696 未加载

kazinatorabout 1 month ago

LLMs "understand" nullability to the extent that texts they have been trained on contain examples of nullability being used in code, together with remarks about it in natural language. When the right tokens occur in your query, other tokens get filled in from that data in a clever way. That's all there is to it.The LLM will not understand, and is incapable of developing an understanding, of a concept not present in its training data.If try to teach it the basics of the misunderstood concept in your chat, it will reflect back a verbal acknowledgement, restated in different words, with some smoothly worded embellishments which looks like the external trappings of understanding. It's only a mirage though.The LLM will code anything, no matter how novel, if you give it detailed enough instructions and clarifications. That's just a a language translation task from pseudo-code to code. Being a language model, it's designed for that.LLM is like the bar waiter who has picked up on economics and politics talk, and is able to interject with something clever sounding, to the surprise of the patrons. Gee, how does he or she understand the workings of the international monetary fund, and what the hell are they doing working in this bar?

评论 #43621625 未加载

staredabout 1 month ago

Once LLMs fully understand nullability, they will cease to use that.Tony Hoare called it "a billion-dollar mistake" (<a href="https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions" rel="nofollow">https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retra...</a>), Rust had made core design choices precisely to avoid this mistake.In practical AI-assisted coding in TypeScript I have found that it is good to add in Cursor Rules to avoid anything nullable, unless it is a well-designed choice. In my experience, it makes code much better.

评论 #43613926 未加载

gwernabout 1 month ago

> Interestingly, for models up to 1 billion parameters, the loss actually starts to increase again after reaching a minimum. This might be because as training continues, the model develops more complex, non-linear representations that our simple linear probe can’t capture as well. Or it might be that the model starts to overfit on the training data and loses its more general concept of nullability.Double descent?

nonameiguessabout 1 month ago

As every fifth thread becomes some discussion of LLM capabilities, I think we need to shift the way we talk about this to be less like how we talk about software and more like how we talk about people."LLM" is a valid category of thing in the world, but it's not a thing like Microsoft Outlook that has well-defined capabilities and limitations. It's frustrating reading these discussions that constantly devolve into one person saying they tried something that either worked or didn't, then 40 replies from other people saying they got the opposite result, possibly with a different model, different version, slight prompt altering, whatever it is.LLMs possibly have the capability to understand nullability, but that doesn't mean every instance of every model will consistently understand that or anything else. This is the same way humans operate. Humans can run a 4-minute mile. Humans can run a 10-second 100 meter dash. Humans can develop and prove novel math theorems. But not all humans, not all the time, performance depends upon conditions, timing, luck, and there has probably never been a single human who can do all three. It takes practice in one specific discipline to get really good at that, and this practice competes with or even limits other abilities. For LLMs, this manifests in differences with the way they get fine-tuned and respond to specific prompt sequences that should all be different ways of expressing the same command or query but nonetheless produce different results. This is very different from the way we are used to machines and software behaving.

评论 #43612819 未加载

评论 #43614110 未加载

apples_orangesabout 1 month ago

Sounds like the process to update/jailbreak llms in a way that they don’t deny requests and always answer. There is also this direction of denial. (Article about it: <a href="https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction" rel="nofollow">https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in...</a>)Would be fun if they also „cancelled the nullability direction“.. the llms probably would start hallucinating new explanations for what is happening in the code.

plaineyjaneyabout 1 month ago

This is really interesting! Intuitively it's hard to grasp that you can just subtract two average states and get a direction describing the model's perception of nullability.

评论 #43613461 未加载

ameliusabout 1 month ago

I'm curious what happens if you run the LLM with variable names that occur often with nullable variables, but then use them with code that has a non-nullable variable.

评论 #43616764 未加载

kmodabout 1 month ago

I found this overly handwavy, but I discovered that there is a non-"gentle" version of this page which is more explicit:<a href="https://dmodel.ai/nullability/">https://dmodel.ai/nullability/</a>

评论 #43616753 未加载

timewizardabout 1 month ago

"Validate a phone number."The code is entirely wrong. That validates something that's close to a NAPN number but isn't actually a NAPN number. In particular the area code cannot start with 0 nor can the central office code. There are several numbers, like 911, which have special meaning, and cannot appear in either position.You'd get better results if you went to Stack Overflow and stole the correct answer yourself. Would probably be faster too.This is why "non technical code writing" is a terrible idea. The underlying concept is explicitly technical. What are we even doing?

tanvachabout 1 month ago

Dear future authors: please run multiple iterations and report the probability.From: ‘Keep training it, though, and eventually it will learn to insert the None test’To: ‘Keep training it, though, and eventually the probability of inserting the None test goes up to xx%’The former is just horse poop, we all know LLMs generate big variance in output.

评论 #43614733 未加载

ashoeafootabout 1 month ago

nstate programming is an antipattern use railway orientated programming instead.

thmorrissabout 1 month ago

very cool.

EncomLababout 1 month ago

This is like claiming a photorestor controlled night light "understands when it is dark" or that a bimetallic strip thermostat "understands temperature". You can say those words, and it's syntactically correct but entirely incorrect semantically.

评论 #43612607 未加载

评论 #43612629 未加载

评论 #43612691 未加载

评论 #43612689 未加载

评论 #43612764 未加载

评论 #43612767 未加载

nativeitabout 1 month ago

We’re all just elementary particles being clumped together in energy gradients, therefore my little computer project is sentient—this is getting absurd.

评论 #43613636 未加载

评论 #43613040 未加载

评论 #43613242 未加载

评论 #43613675 未加载

casenmgreenabout 1 month ago

LLMs do understand nothing.They are not reasoning.