> I think it's as close to the perfect benchmark as you can get.<p>Depends what you are benchmarking for... If you are benchmarking the ability of the solution to solve LEETCODE challenges, that is different to the ability of GPT4 to assist everyday programmers knock out business logic or diagnose bugs.<p>My experience of GPT4 is that it's significantly better at the latter than GPT3.5.<p>Additionally, the real test is for me is "Can an average programmer using GPT4 as a tool solve Advent of Code faster than an equally-skilled programmer without an LLM?".
> ChatGPT never did this: its debugging skills are completely non-existent. If it encounters an error it will simply rewrite entire functions, or more often the entire program, from scratch.<p>Well, it's going to need to rewrite functions to add debug due to not having edit capabilities, but I tried this and it absolutely added debug info which it then used to debug issues:<p>In an inner part it added:<p><pre><code> debug_info.append({
'Hand': hand,
'Bid': bid,
'Type': hand_info[0],
'Sorted Hand': hand_info[1],
'Rank': rank,
'Score': score
})
</code></pre>
I don't know quite what's happening but I feel like people constantly say it can't do something and the very first thing I try (just asking it to do the thing) usually works.<p>I gave it the hands in the problem statement and the expected result, and the explanation as to why (copy pasted). It ran the code, looked at the debug output, identified the problem and rewrote the function. I'm not saying it immediately solved the problem, but it easily added debug information, ran the code, looked at the output and interpreted it.
Of the few I attempted this year I actually used ChatGPT to help me decipher just what the hell the waffling, rambling, windy, unclear, and ultimately superfluous challenge requirements were.<p>I didn’t have the solutions generated by ChatGPT to be clear, I used a prompt along the lines of “take this text and extract it’s requirements and generate bullet points, also make the example inputs and outputs clear”.<p>I did the same for the previous year too, when ChatGPT hadn’t been out for long.<p>I find that a lot more enjoyable and less tedious.
What strikes me about ChatGPT is the blatantly wrong answers it can give. I asked ChatGPT to solve a augmented matrix using gaussian elimination, and it failed in this straightforward task spectacularly.
<a href="https://m.youtube.com/@mzikmund" rel="nofollow">https://m.youtube.com/@mzikmund</a><p>This creator has some excellent videos of chat GPT attempting advent of code.<p>He uses it in a more generous format where he is often giving it multiple attempts and trying to coax the correct answer out of it. He definitely has more success with it than the article, but it is hard to tell how much of that success is due generous assistance and prompting, and so it is hard to know how much the model has actually improved year over year.
I have an observation not related to chatgpt but to the debugging skills mentioned in the article: indeed I've always felt that most teaching is done on perfect working code. I've never seen an exercise for developing debugging skills. For example: "This Djikstra implementation is finding the wrong path." and work with students to pinpoint the off-by-one error. I think it would reveal so much more about the implementation details than just explaining how it works. It could be its own topic to explore race conditions, edge case studies and so on.
I believe this presents a good challenge, as the article mentioned: the problems are unique :) None of them are found in ChatGPT's training dataset.
Are these problems unique enough that they couldn't have circulated before the AoC where they appear now?<p>I notice that for "novel" code that CoPilot hasn't seen before, it's mostly useless, but when writing a hobby project which is a path tracer (of which there are probably 1000 implementations on GitHub) it's excellent. Which isn't surprising. It has seen the exact same function I'm writing, written 100 times in every language imaginable. There are books on the topic etc.
Seeing the other comments, it still leaves me the question if this is a limit of scale or a limit of technology? How much more can LLMs scale up over the next decade?
What would be really interesting would be to repeated attempt to get get ChatGPT to solve the problems.<p>By that I mean try to solve the Day 1 problem at the point of release, then try to solve it in a fresh ChatGPT session the next day, and then get it to solve it again the day after that, and so for the next couple of months.<p>Do that for each day that it runs for and then see what patterns emerge. Part of me expects it to get better at solving each problem as the rest of us write our solutions and then make them available on the web in some form for it harvest - but it would be interesting to see if that is the case.
Claim chowder:<p>> The model can unquestionably code. [<a href="https://news.ycombinator.com/item?id=38205052">https://news.ycombinator.com/item?id=38205052</a>]<p>Seems pretty clear most of the fantastic results from before was overfitting. Like every other case. It's amazing to me how these models can be caught red handed being overfitted again and again and again, and people don't get the memo.
> Problems start very easy on day 1 (sometimes as easy as just asking for a program that sums all numbers in the input) and progress towards more difficult ones, but they never get very hard: a CS graduate should be able to solve all problems, except maybe 1 or 2, in a couple of hours each.<p>I think this wildly overestimates the programming skills of the average CS graduate. My estimate of the fraction of CS graduates able to do that is closer to 1%.
> I don't pay for ChatGPT Plus, I only have a paid API key so I used instead a command line client, chatgpt-cli and manually ran the output programs.<p>But this is much different, and likely much worse, than paid ChatGPT. Code Interpreter does its own debugging-and-revising loop.<p>It's bizarre to me that this author wouldn't pay $20 one time to evaluate the higher quality product, the one most people would use if they cared about code quality, am I missing something?
AoC questions this year were _deliberately written_ to be confusing to LLMs, it's not failing because it's worse it's failing because the questions were written to make it hader for models :]<p>Edit: apparently not, the author is just really good at coming up with ai adverse puzzles. When testing ChatGPT did much better on last year’s puzzles.
The comments about debugging make me sad.<p>It shows that people fundamentally do not understand the tools they are using.<p>Given a book of numbers, here are two tasks:<p>1) copy out the entire book, but replace every prime number with 7.<p>2) write down the list of prime numbers in the book.<p>Which one is easier?<p>LLMs have to generate tokens one at a time, and it’s very very difficult to perfectly generate a set of input tokens <i>except for some tokens</i>.<p>Since you are almost certainly randomising the probabilities to some degree (that’s what temperature does), you’re also asking for both deterministic <i>and random</i> outputs.<p>TLDR: ask LLMs what is wrong with the code.<p>Ask for a diff.<p>Don’t ask for an LLM to refactor, bug fix or annotate code…<p>That’s extremely naive usage.<p>Back to my stupid analogy: “please copy out this book, but fix the numbers which are ‘wrong’”<p>I can hardly complain when I get terrible results can I?
My impression of the 2023 AoC was that Wastl has spent considerable effort on making it less LLM-friendly this year. Some days, this seems to have been done by adding extra conditions and complications which make the task more difficult for LLMs to parse. Other tasks required studying the input data which is difficult to achieve with an unsupervised LLM. Finally, the first couple of days seemed a lot more difficult this year than previous years, possibly to deter chatgpt users from filling up the leaderboard right away (though this year December started with a weekend which could also be a contributing factor).