> LLMs (which are paid by the token)<p>Straight-forwardly true and yet I'd never thought about it like this before. i.e. that there's a perverse incentive for LLM vendors to tune for verbose outputs. We rightly raise eyebrows at the idea of developers being paid per volume of code, but it's the default for LLMs.
It's funny because I wouldn't consider the comment that they highlight in their post as a nitpick.<p>Something that has an impact on the long term maintainability of code is definitely not nitpikcky, and in the majority of cases define a type fits this category as it makes refactors and extensions MUCH easier.<p>On top of that, I think the approach they went with is a huge mistake. The same comment can be a nitpick on one CR but crucial on another, clustering them is destined to result in false-positives and false-negatives.<p>I'm not sure I'd want to use a product to review my code for which 1) I cannot customize the rules, 2) it seems like the rules chosen by the creators are poor.<p>To be honest I wouldn't want to use any AI-based code reviewer at all. We have one at work (FAANG, so something with a large dedicated team) and it has not once produced a useful comment and instead has been factually wrong many times.
After failing with three reasonable ideas, they solved the problem with an idea that previously would have seemed unlikely to work (get similarity to past known nits and use a threshold of 3 similar nits above a certain cutoff similarity score for filtering). A lot of applications of ML have a similar hacky trial and error flavor like that. Intuition is often built in retrospect and may not transfer well to other domains. I would have guessed that finetuning would also work, but agree with the author that it would be more expensive and less portable across different models.
Our AI code review bot (Codacy) is just an LLM that compiles all linter rules and might be the most annoying useless thing. For example it will ding your PR for not considering Opera browser limitations on a backend NodeJS PR.<p>Furthermore most of the code reviews I perform, rarely do I ever really leave commentary. There are so many frameworks and libraries today that solve whatever problem, unless someone adds complex code or puts a file in a goofy spot, it’s an instant approval. So an AI bot doesn’t help something which is a minimal non-problem task.
Unless they were using a model too stupid for this operation, the fundamental problem is usually solvable via prompting.<p>Usually the issue is that the models have a bias for action, so you need to give it an accetpable action when there isn't a good comment. Some other output/determination.<p>I've seen this in many other similar applications.
I'm in the market for PR review bots, as the nitpicking issue is real. So far, I have tried Coderabbit, but it adds too much noise to PRs, and only a very small percentage of comments are actually useful. I specifically instructed it to ignore nitpicks, but it still added such comments. Their cringy ASCII art comments make it even harder to take them seriously.<p>I recently signed up for Korbit AI, but it's too soon to provide feedback. Honestly, I’m getting a bit fed up with experimenting with different PR bots.<p>Question for the author: In what ways is your solution better than Coderabbit and Korbit AI?
"We picked the latter, which also gave us our performance metric - percentage of generated comments that the author actually addresses."<p>This metric would go up if you leave almost no comments. Would it not be better to find a metric that rewards you for generating many comments which are addressed, not just having a high relevance?<p>You even mention this challenge yourselves: "Sadly, even with all kinds of prompting tricks, we simply could not get the LLM to produce fewer nits without also producing fewer critical comments."<p>If that was happening, that doesn't sound like it would be reflected in your performance metric.
I'm surprised the authors didn't try the "dumb" version of the solution they went with: instead of using fancy cosine similarity create to implicit clusters, just ask it to classify the comment along a few dimensions and then do your own filtering on that (or create your own 0-100 scoring!) Seems like you would have more control that way and actually derive some rich(er) data to fine tune on. It seems they are already almost doing this: all the examples in the article start with "style"!<p>I have seen this pattern a few times actually, where you want the AI to mimic some heuristic humans use. You never want to ask it for the heuristic directly, just create the constitute data so you can do some simple regression or whatever on top of it and control the cutoff yourself.
We do something internally[0] but specifically for security concerns.<p>We’ve found that having the LLM provide a “severity” level (simply low, medium, high), we’re able to filter out all the nitpicky feedback.<p>It’s important to note that this severity level should be specified at the end of the LLM’s response, not the beginning or middle.<p>There’s still an issue of context, where the LLM will provide a false positive due to unseen aspects of the larger system (e.g. make sure to sanitize X input).<p>We haven’t found the bot to be overbearing, but mostly because we auto-delete past comments when changes are pushed.<p>[0] <a href="https://magicloops.dev/loop/3f3781f3-f987-4672-8500-bacbeefca6db/view">https://magicloops.dev/loop/3f3781f3-f987-4672-8500-bacbeefc...</a>
I can’t fathom a world where an LLM would be able to review code in any meaningful way -at all-.<p>It should not substitute a human, and probably wasted more effort than it solves by a wide margin.
We found that in general it's pretty hard to make the llm just stop without human intervention. You can see with things like Cline that if the llm has to check its own work, it'll keep making 'improvements' in a loop; removing all comments, adding all comments etc. It needs to generate something and seems overly helpful to give you something.
Prompt:<p>If the comment could be omitted without affecting the codes functionality but is stylistic or otherwise can be ignored then preface the comment with<p>NITPICK<p>I'm guessing you've tried something like the above and then filtering for the preface, as you mentioned the llm being bad at understanding what is and isn't important.
I am not sure about using UPPERCASE in prompts for emphasis. I feel intuitively that uppercase is less “understandable” for LLMs because it is more likely to be tokenized as a sequence of characters. I have no data to back this up, though.
I’d be curious to hear more on Attempt 2. It sounds like the approach was basically to ask an llm for a score for each comment. Adding specifics to this prompt might go a long way? Like, what specifically is the rationale for this change, is this likely to be a functional bug, is it a security issue, how does it impact maintainability over the long run, etc.; basically I wonder if asking about more specific criteria and trying to define what you mean by nits can help the LLM give you more reliable scores.
"As a last resort we tried machine learning ..."<p>- Hilarious that a cutting edge solution (document embedding and search) from 5-6 years ago was their last resort.<p>- Doubly hilarious that "throw more AI at it" surprised them when it didn't work.
> Attempt 2: LLM-as-a-judge<p>Wouldnt this be achievable with a classifier model? Maybe even a combo of getting the embedding and then putting it through a classifier? Kind of like how Gans work.<p>Edit: I read the article before the comment section, silly me lol
Here's an idea: have the LLM output each comment with a "severity" score ranging from 0-100 or maybe a set of possible values ("trivial", "small", "high"). Let it get everything off of its chest outputting the nitpicks but recognizing they're minor. Filter the output to only contain comments above a given threshold.<p>It's hard to avoid thinking of a pink elephant, but easy enough to consciously recognize it's not relevant to the task at hand.
Is there a name for the activity of trying out different strategies to improve the output of AI?<p>I found this article surprisingly enjoyable and interesting and if like to find more like it.
A whole article without mentioning the name of the LLM. It's not like Sonnet and O1 have the same modalities.<p>Anything you do today might become irrelevant tomorrow.
What about false positives?<p>As I see it, the solution assumes the embeddings only capture the form: say, if developers previously downvoted suggestions to wrap code in unnecessary try..catch blocks, then similar suggestions will be successfully blocked in the future, regardless of the module/class etc. (i.e. a kind of generalization)<p>But what if enough suggestions regarding class X (or module X) get downvoted, and then the mechanism starts assuming class X/module X doesn't need review at all? I mean the case when a lot of such embeddings end up clustering around the class itself (or a function), not around the general form of the comment.<p>How do you prevent this? Or it's unlikely to happen? The only metric I've found in the article is the percentage of addressed suggestions that made it to the end user.
> Essentially we needed to teach LLMs (which are paid by the token) to only generate a small number of high quality comments.<p>The solution of filtering after the comment is generated doesn’t seem to address the “paid by the token” piece.
I find these retro blog posts from people building LLM solutions super useful for helping me try out new prompting, eval, etc. techniques. Which blog posts have you all found most useful? Would love a link.
Think it would initially have gone better had they not used „nits“ but rather nitpicks. ie something that’s in the dictionary that the chatbot is likely to understand
Why review bots. Why not a bot that runs as part of the validation suite? And then you dismiss and follow up on what you want?<p>You can run that locally.
> $0.45/file capped at $50/dev/month<p>wow. this is really expensive... especially given core of this technology is open source and target customers can set it up themselves self-hosted
> Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.<p><a href="https://news.ycombinator.com/newsguidelines.html">https://news.ycombinator.com/newsguidelines.html</a>
Reading from the other comments here, I'm the only one thinking that this is just busywork...? Just get rid of the thing, it's a solution without a problem.
TLDR:<p>> Giving few-shot examples to the generator didn't work.<p>> Using an LLM-judge (with no training) didn't work.<p>> Using an embedding + KNN-classifier (lots of training data) worked.<p>I don't know why they didn't try fine-tuning the LLM-judge, or at least give it some few-shot examples.<p>But it shows that embeddings can make very simple classifiers work well.