This kind of mirror my experience with LLMs. If I ask them non-original problems (make this API, write this test, update this function (that must be written 100s of time by develoeprs around the world, etc), it works very well. Some minors changes here and there but it saves time.<p>When I ask them to code things that they never heard of (I am working on a online sport game), it fails catastrophically. The LLM should know the sport, and what I ask is pretty clear for anyone who understand the game (I tested against actual people and it was obvious what to expect), but the LLM failed miserably. Even worse when I ask them to write some designs in CSS for the game. It seems if you take them outside the 3-columsn layout or bootstrap or the overused landing page, LLMs fails miserably.<p>It works very well for the known cases, but as soon as you want them to do something original, they just can't.
The article and comments here _really_ underestimate the current state of LLMs (or overestimate how hard AoC 2024 was)<p>Here's a much better analysis from someone who got 45 stars using LLMs. <a href="https://www.reddit.com/r/adventofcode/comments/1hnk1c5/results_of_a_multiyear_llm_experiment/" rel="nofollow">https://www.reddit.com/r/adventofcode/comments/1hnk1c5/resul...</a><p>All the top 5 players on the final leaderboard <a href="https://adventofcode.com/2024/leaderboard" rel="nofollow">https://adventofcode.com/2024/leaderboard</a> used LLMs for most of their solutions.<p>LLMs can solve all days except 12, 15, 17, 21, and 24
After looking at the charts I was like "Whoa, damn, that Jerpint model seems amazing. Where do I get that??" I spent some time trying to find it on Huggingface before I realized...
Since you did not give the models a chance to test their code and correct any mistakes, I think a more accurate comparison would be if you compared them against you submitting answers without testing (or even running!) your code first
I’m adjacent to some people who do AoC competitively and it’s clear many of the top 10 and maybe 1/2 of the top 100 this year were heavily LLM assisted or wholly done by LLMs in a loop. They won first place on many days. It was disappointing to the community that people cheated and went against the community’s wishes but it’s clear LLMs can do much better than described here
Half the time I try to use gemini questions about the c++ std library, it fabricates non-existent types and functions. I'm honestly impressed it was able to solve any of the AoC problems.
I’m both surprised and not surprised. I’m surprised because these sort of problems with very clear prompts and fairly clear algorithmic requirements are exactly what I’d expect LLMs to perform best at.<p>But I’m not surprised because I’ve seen them fail on many problems even with lots of prompt engineering and test cases.
With no prompt engineering this seems like a weird comparison. I wouldn’t expect anyone to be able to one-shot most of the AOC problems. A fair fight would at least use something like cursor’s agent on YOLO mode that can review a command’s output, add logs, etc
I'm a bit of an AI skeptic, and i think i had the opposite reaction of the author. Even though this is far from welcoming our AI overlords, I am surprised that they are this good.
I'd be interested to know how o1 compares. On may days after I completed the AoC puzzles I was putting them question into o1 and it seemed to do really well.
At first I was like "What is this jerpint model that's beating the competition so soundly?" then it hit me, lol.<p>Anyhow this is like night and day compared to last year, and it's impressive that Sonnet is now apparently 50% as good as a professional human at this sort of thing.
I like the idea, but I feel like the execution left a bit to be desired.<p>My gut tells me you can get much better results from the models with better prompting. The whole "You are solving the 2024 advent of code challenge." form of prompting is just adding noise with no real value. Based on my empirical experience, that likely hurts performance instead of helping.<p>The time limit feels arbitrary and adds nothing to the benchmark. I don't understand why you wouldn't include o1 in the list of models.<p>There's just a lot here that doesn't feel very scientific about this analysis...
Wanted to try with o1 and o1-mini but looks like there's no code available, although I guess I could just ask 3.5 Sonnet/o1 to make the evaluation suite ;)
I think a major mistake was giving parts 1 and 2 all at once. I had great results having it solve 1, then 2. I think I got 4o to one shot parts 1 then 2 up to about day 12. It started to struggle a bit after that and I got bored with it at day 18. It did way better than I expected, I don't understand why the author is disappointed. This shit is magic.
I think this is a terrible analysis with a weak conclusion.<p>There's zero mention of how long it took the LLM to write the code vs the human. You have a 300 second runtime limit, but what was your coding time limit? The machine spat out code in, what, a few seconds? And how long did your solutions take to write?<p>Advent of code problems take me longer to just <i>read</i> than it takes an LLM to have a proposed solution ready for evaluation.<p>> <i>they didn’t perform nearly as well as I’d expect</i><p>Is this a joke, though? A machine takes a problem description written as floridly hyperventilated as advent problems are, and, without any opportunity for automated reanalysis, it understands the exact problem domain, it understands exactly what's being asked, correctly models the solution, and spits out a correct single-shot solution on 20 of them in no time flat, often with substantially better running time than your own solutions, and that's disappointing?<p>> <i>a lot of the submissions had timeout errors, which means that their solutions might work if asked more explicitly for efficient solutions. However the models should know very well what AoC solutions entail</i><p>You made up an arbitrary runtime limit and then kept that limit a secret, and you were surprised when the solutions didn't adhere to the secret limit?<p>> <i>Finally, some of the submissions raised some Exceptions, which would likely be fixed with a human reviewing this code and asking for changes.</i><p>How many of your solutions got the correct answer on the first try without going back and fixing something?
Genuinely terrible prompt. Not only in structure, but also contains grammatical errors. I'm confident you could at least double their score if you improve your prompting significantly.