How we made our AI code review bot stop leaving nitpicky comments

257 pointsby dakshgupta6 months ago

37 comments

Retr0id6 months ago

> LLMs (which are paid by the token)Straight-forwardly true and yet I'd never thought about it like this before. i.e. that there's a perverse incentive for LLM vendors to tune for verbose outputs. We rightly raise eyebrows at the idea of developers being paid per volume of code, but it's the default for LLMs.

评论 #42485972 未加载

评论 #42487940 未加载

评论 #42487078 未加载

评论 #42489339 未加载

评论 #42486845 未加载

iLoveOncall6 months ago

It's funny because I wouldn't consider the comment that they highlight in their post as a nitpick.Something that has an impact on the long term maintainability of code is definitely not nitpikcky, and in the majority of cases define a type fits this category as it makes refactors and extensions MUCH easier.On top of that, I think the approach they went with is a huge mistake. The same comment can be a nitpick on one CR but crucial on another, clustering them is destined to result in false-positives and false-negatives.I'm not sure I'd want to use a product to review my code for which 1) I cannot customize the rules, 2) it seems like the rules chosen by the creators are poor.To be honest I wouldn't want to use any AI-based code reviewer at all. We have one at work (FAANG, so something with a large dedicated team) and it has not once produced a useful comment and instead has been factually wrong many times.

评论 #42485087 未加载

评论 #42483750 未加载

评论 #42488460 未加载

pama6 months ago

After failing with three reasonable ideas, they solved the problem with an idea that previously would have seemed unlikely to work (get similarity to past known nits and use a threshold of 3 similar nits above a certain cutoff similarity score for filtering). A lot of applications of ML have a similar hacky trial and error flavor like that. Intuition is often built in retrospect and may not transfer well to other domains. I would have guessed that finetuning would also work, but agree with the author that it would be more expensive and less portable across different models.

评论 #42482035 未加载

评论 #42483194 未加载

righthand6 months ago

Our AI code review bot (Codacy) is just an LLM that compiles all linter rules and might be the most annoying useless thing. For example it will ding your PR for not considering Opera browser limitations on a backend NodeJS PR.Furthermore most of the code reviews I perform, rarely do I ever really leave commentary. There are so many frameworks and libraries today that solve whatever problem, unless someone adds complex code or puts a file in a goofy spot, it’s an instant approval. So an AI bot doesn’t help something which is a minimal non-problem task.

评论 #42485940 未加载

评论 #42488977 未加载

评论 #42484747 未加载

评论 #42484690 未加载

XenophileJKO6 months ago

Unless they were using a model too stupid for this operation, the fundamental problem is usually solvable via prompting.Usually the issue is that the models have a bias for action, so you need to give it an accetpable action when there isn't a good comment. Some other output/determination.I've seen this in many other similar applications.

评论 #42483257 未加载

评论 #42483246 未加载

评论 #42487412 未加载

评论 #42482864 未加载

评论 #42512933 未加载

jerrygoyal6 months ago

I'm in the market for PR review bots, as the nitpicking issue is real. So far, I have tried Coderabbit, but it adds too much noise to PRs, and only a very small percentage of comments are actually useful. I specifically instructed it to ignore nitpicks, but it still added such comments. Their cringy ASCII art comments make it even harder to take them seriously.I recently signed up for Korbit AI, but it's too soon to provide feedback. Honestly, I’m getting a bit fed up with experimenting with different PR bots.Question for the author: In what ways is your solution better than Coderabbit and Korbit AI?

评论 #42484744 未加载

评论 #42484552 未加载

utdiscant6 months ago

"We picked the latter, which also gave us our performance metric - percentage of generated comments that the author actually addresses."This metric would go up if you leave almost no comments. Would it not be better to find a metric that rewards you for generating many comments which are addressed, not just having a high relevance?You even mention this challenge yourselves: "Sadly, even with all kinds of prompting tricks, we simply could not get the LLM to produce fewer nits without also producing fewer critical comments."If that was happening, that doesn't sound like it would be reflected in your performance metric.

评论 #42489130 未加载

评论 #42489336 未加载

extr6 months ago

I'm surprised the authors didn't try the "dumb" version of the solution they went with: instead of using fancy cosine similarity create to implicit clusters, just ask it to classify the comment along a few dimensions and then do your own filtering on that (or create your own 0-100 scoring!) Seems like you would have more control that way and actually derive some rich(er) data to fine tune on. It seems they are already almost doing this: all the examples in the article start with "style"!I have seen this pattern a few times actually, where you want the AI to mimic some heuristic humans use. You never want to ask it for the heuristic directly, just create the constitute data so you can do some simple regression or whatever on top of it and control the cutoff yourself.

评论 #42485339 未加载

jumploops6 months ago

We do something internally[0] but specifically for security concerns.We’ve found that having the LLM provide a “severity” level (simply low, medium, high), we’re able to filter out all the nitpicky feedback.It’s important to note that this severity level should be specified at the end of the LLM’s response, not the beginning or middle.There’s still an issue of context, where the LLM will provide a false positive due to unseen aspects of the larger system (e.g. make sure to sanitize X input).We haven’t found the bot to be overbearing, but mostly because we auto-delete past comments when changes are pushed.[0] <a href="https://magicloops.dev/loop/3f3781f3-f987-4672-8500-bacbeefca6db/view">https://magicloops.dev/loop/3f3781f3-f987-4672-8500-bacbeefc...</a>

评论 #42483699 未加载

hsbauauvhabzb6 months ago

I can’t fathom a world where an LLM would be able to review code in any meaningful way -at all-.It should not substitute a human, and probably wasted more effort than it solves by a wide margin.

评论 #42486075 未加载

评论 #42486455 未加载

评论 #42486615 未加载

panarchy6 months ago

I wonder how long until we end up with questionable devs making spurious changes just to try and game the LLM output to give them a pass.

评论 #42483444 未加载

评论 #42483734 未加载

anonzzzies6 months ago

We found that in general it's pretty hard to make the llm just stop without human intervention. You can see with things like Cline that if the llm has to check its own work, it'll keep making 'improvements' in a loop; removing all comments, adding all comments etc. It needs to generate something and seems overly helpful to give you something.

评论 #42485500 未加载

dbetteridge6 months ago

Prompt:If the comment could be omitted without affecting the codes functionality but is stylistic or otherwise can be ignored then preface the comment withNITPICKI'm guessing you've tried something like the above and then filtering for the preface, as you mentioned the llm being bad at understanding what is and isn't important.

评论 #42483122 未加载

评论 #42483721 未加载

评论 #42485504 未加载

评论 #42485158 未加载

untech6 months ago

I am not sure about using UPPERCASE in prompts for emphasis. I feel intuitively that uppercase is less “understandable” for LLMs because it is more likely to be tokenized as a sequence of characters. I have no data to back this up, though.

评论 #42483725 未加载

AgentOrange12346 months ago

I’d be curious to hear more on Attempt 2. It sounds like the approach was basically to ask an llm for a score for each comment. Adding specifics to this prompt might go a long way? Like, what specifically is the rationale for this change, is this likely to be a functional bug, is it a security issue, how does it impact maintainability over the long run, etc.; basically I wonder if asking about more specific criteria and trying to define what you mean by nits can help the LLM give you more reliable scores.

评论 #42487042 未加载

throw3108226 months ago

Seems easier than getting the same from my colleagues.

评论 #42483505 未加载

iandanforth6 months ago

"As a last resort we tried machine learning ..."- Hilarious that a cutting edge solution (document embedding and search) from 5-6 years ago was their last resort.- Doubly hilarious that "throw more AI at it" surprised them when it didn't work.

tayo426 months ago

> Attempt 2: LLM-as-a-judgeWouldnt this be achievable with a classifier model? Maybe even a combo of getting the embedding and then putting it through a classifier? Kind of like how Gans work.Edit: I read the article before the comment section, silly me lol

pedrovhb6 months ago

Here's an idea: have the LLM output each comment with a "severity" score ranging from 0-100 or maybe a set of possible values ("trivial", "small", "high"). Let it get everything off of its chest outputting the nitpicks but recognizing they're minor. Filter the output to only contain comments above a given threshold.It's hard to avoid thinking of a pink elephant, but easy enough to consciously recognize it's not relevant to the task at hand.

评论 #42483777 未加载

评论 #42483645 未加载

profsummergig6 months ago

Is there a name for the activity of trying out different strategies to improve the output of AI?I found this article surprisingly enjoyable and interesting and if like to find more like it.

评论 #42485252 未加载

pcwelder6 months ago

A whole article without mentioning the name of the LLM. It's not like Sonnet and O1 have the same modalities.Anything you do today might become irrelevant tomorrow.

callamdelaney6 months ago

Does it work on real engineers?

评论 #42483979 未加载

评论 #42483955 未加载

kgeist6 months ago

What about false positives?As I see it, the solution assumes the embeddings only capture the form: say, if developers previously downvoted suggestions to wrap code in unnecessary try..catch blocks, then similar suggestions will be successfully blocked in the future, regardless of the module/class etc. (i.e. a kind of generalization)But what if enough suggestions regarding class X (or module X) get downvoted, and then the mechanism starts assuming class X/module X doesn't need review at all? I mean the case when a lot of such embeddings end up clustering around the class itself (or a function), not around the general form of the comment.How do you prevent this? Or it's unlikely to happen? The only metric I've found in the article is the percentage of addressed suggestions that made it to the end user.

评论 #42483746 未加载

just-another-se6 months ago

And how do you guys deal with a cold start problem then? Suppose the repo is new and has very few previous comments?

评论 #42484388 未加载

评论 #42484568 未加载

planetpluta6 months ago

> Essentially we needed to teach LLMs (which are paid by the token) to only generate a small number of high quality comments.The solution of filtering after the comment is generated doesn’t seem to address the “paid by the token” piece.

评论 #42485984 未加载

aarondia6 months ago

I find these retro blog posts from people building LLM solutions super useful for helping me try out new prompting, eval, etc. techniques. Which blog posts have you all found most useful? Would love a link.

Havoc6 months ago

Think it would initially have gone better had they not used „nits“ but rather nitpicks. ie something that’s in the dictionary that the chatbot is likely to understand

评论 #42488819 未加载

wzdd6 months ago

Previously: <a href="https://news.ycombinator.com/item?id=42465374">https://news.ycombinator.com/item?id=42465374</a>

lupire6 months ago

Instead of training an LLM to make comments, train the LLM to generate test cases and then guess the line of code that breaks the test.

评论 #42489154 未加载

keybored6 months ago

Why review bots. Why not a bot that runs as part of the validation suite? And then you dismiss and follow up on what you want?You can run that locally.

nikolayasdf1236 months ago

> $0.45/file capped at $50/dev/monthwow. this is really expensive... especially given core of this technology is open source and target customers can set it up themselves self-hosted

评论 #42484858 未加载

Kwpolska6 months ago

> Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity.<a href="https://news.ycombinator.com/newsguidelines.html">https://news.ycombinator.com/newsguidelines.html</a>

评论 #42488259 未加载

Falimonda6 months ago

fwiw, not all the mobile site's menu items work when clicked

fnqi8ckfek6 months ago

Reading from the other comments here, I'm the only one thinking that this is just busywork...? Just get rid of the thing, it's a solution without a problem.

评论 #42485336 未加载

评论 #42485654 未加载

评论 #42485524 未加载

评论 #42485594 未加载

评论 #42485206 未加载

thomasahle6 months ago

TLDR:> Giving few-shot examples to the generator didn't work.> Using an LLM-judge (with no training) didn't work.> Using an embedding + KNN-classifier (lots of training data) worked.I don't know why they didn't try fine-tuning the LLM-judge, or at least give it some few-shot examples.But it shows that embeddings can make very simple classifiers work well.

Nullabillity6 months ago

[flagged]

评论 #42482644 未加载

评论 #42482444 未加载

dcreater6 months ago

Are you still looking for slaves, err I mean employees, who are going to work 80+hrs a week? Do you sponsor visas?