Performance of LLMs on Advent of Code 2024

135 点作者 jerpint5 个月前

20 条评论

1010084 个月前

This kind of mirror my experience with LLMs. If I ask them non-original problems (make this API, write this test, update this function (that must be written 100s of time by develoeprs around the world, etc), it works very well. Some minors changes here and there but it saves time.When I ask them to code things that they never heard of (I am working on a online sport game), it fails catastrophically. The LLM should know the sport, and what I ask is pretty clear for anyone who understand the game (I tested against actual people and it was obvious what to expect), but the LLM failed miserably. Even worse when I ask them to write some designs in CSS for the game. It seems if you take them outside the 3-columsn layout or bootstrap or the overused landing page, LLMs fails miserably.It works very well for the known cases, but as soon as you want them to do something original, they just can't.

评论 #42557185 未加载

评论 #42558053 未加载

评论 #42557082 未加载

评论 #42558277 未加载

评论 #42556595 未加载

Recursing4 个月前

The article and comments here _really_ underestimate the current state of LLMs (or overestimate how hard AoC 2024 was)Here's a much better analysis from someone who got 45 stars using LLMs. <a href="https://www.reddit.com/r/adventofcode/comments/1hnk1c5/results_of_a_multiyear_llm_experiment/" rel="nofollow">https://www.reddit.com/r/adventofcode/comments/1hnk1c5/resul...</a>All the top 5 players on the final leaderboard <a href="https://adventofcode.com/2024/leaderboard" rel="nofollow">https://adventofcode.com/2024/leaderboard</a> used LLMs for most of their solutions.LLMs can solve all days except 12, 15, 17, 21, and 24

评论 #42558483 未加载

评论 #42558758 未加载

评论 #42558438 未加载

评论 #42558435 未加载

评论 #42558346 未加载

评论 #42568470 未加载

评论 #42562756 未加载

评论 #42558135 未加载

upghost5 个月前

After looking at the charts I was like "Whoa, damn, that Jerpint model seems amazing. Where do I get that??" I spent some time trying to find it on Huggingface before I realized...

评论 #42554955 未加载

评论 #42555461 未加载

评论 #42554919 未加载

bryan05 个月前

Since you did not give the models a chance to test their code and correct any mistakes, I think a more accurate comparison would be if you compared them against you submitting answers without testing (or even running!) your code first

评论 #42557227 未加载

评论 #42558218 未加载

评论 #42557397 未加载

zaptheimpaler4 个月前

I’m adjacent to some people who do AoC competitively and it’s clear many of the top 10 and maybe 1/2 of the top 100 this year were heavily LLM assisted or wholly done by LLMs in a loop. They won first place on many days. It was disappointing to the community that people cheated and went against the community’s wishes but it’s clear LLMs can do much better than described here

评论 #42556401 未加载

评论 #42557411 未加载

unclad59685 个月前

Half the time I try to use gemini questions about the c++ std library, it fabricates non-existent types and functions. I'm honestly impressed it was able to solve any of the AoC problems.

评论 #42555186 未加载

评论 #42557388 未加载

grumple5 个月前

I’m both surprised and not surprised. I’m surprised because these sort of problems with very clear prompts and fairly clear algorithmic requirements are exactly what I’d expect LLMs to perform best at.But I’m not surprised because I’ve seen them fail on many problems even with lots of prompt engineering and test cases.

yunwal5 个月前

With no prompt engineering this seems like a weird comparison. I wouldn’t expect anyone to be able to one-shot most of the AOC problems. A fair fight would at least use something like cursor’s agent on YOLO mode that can review a command’s output, add logs, etc

评论 #42555007 未加载

评论 #42557421 未加载

评论 #42555392 未加载

评论 #42554815 未加载

评论 #42555181 未加载

bawolff4 个月前

I'm a bit of an AI skeptic, and i think i had the opposite reaction of the author. Even though this is far from welcoming our AI overlords, I am surprised that they are this good.

jebarker5 个月前

I'd be interested to know how o1 compares. On may days after I completed the AoC puzzles I was putting them question into o1 and it seemed to do really well.

评论 #42554411 未加载

moffkalast5 个月前

At first I was like "What is this jerpint model that's beating the competition so soundly?" then it hit me, lol.Anyhow this is like night and day compared to last year, and it's impressive that Sonnet is now apparently 50% as good as a professional human at this sort of thing.

评论 #42557204 未加载

demirbey054 个月前

o1 is not included, I think each benchmark should include o1 and reasoning models. o-series is really changed the game.

airstrike4 个月前

I like the idea, but I feel like the execution left a bit to be desired.My gut tells me you can get much better results from the models with better prompting. The whole "You are solving the 2024 advent of code challenge." form of prompting is just adding noise with no real value. Based on my empirical experience, that likely hurts performance instead of helping.The time limit feels arbitrary and adds nothing to the benchmark. I don't understand why you wouldn't include o1 in the list of models.There's just a lot here that doesn't feel very scientific about this analysis...

Tiberium4 个月前

Wanted to try with o1 and o1-mini but looks like there's no code available, although I guess I could just ask 3.5 Sonnet/o1 to make the evaluation suite ;)

评论 #42556469 未加载

bongodongobob4 个月前

I think a major mistake was giving parts 1 and 2 all at once. I had great results having it solve 1, then 2. I think I got 4o to one shot parts 1 then 2 up to about day 12. It started to struggle a bit after that and I got bored with it at day 18. It did way better than I expected, I don't understand why the author is disappointed. This shit is magic.

antirez4 个月前

The most important thing is missing from this post: the performance of Jerpint+Claude. It's not a VS game.

guerrilla4 个月前

How far can LLMs get in Project Euler without messing up?

BugsJustFindMe5 个月前

I think this is a terrible analysis with a weak conclusion.There's zero mention of how long it took the LLM to write the code vs the human. You have a 300 second runtime limit, but what was your coding time limit? The machine spat out code in, what, a few seconds? And how long did your solutions take to write?Advent of code problems take me longer to just read than it takes an LLM to have a proposed solution ready for evaluation.> they didn’t perform nearly as well as I’d expectIs this a joke, though? A machine takes a problem description written as floridly hyperventilated as advent problems are, and, without any opportunity for automated reanalysis, it understands the exact problem domain, it understands exactly what's being asked, correctly models the solution, and spits out a correct single-shot solution on 20 of them in no time flat, often with substantially better running time than your own solutions, and that's disappointing?> a lot of the submissions had timeout errors, which means that their solutions might work if asked more explicitly for efficient solutions. However the models should know very well what AoC solutions entailYou made up an arbitrary runtime limit and then kept that limit a secret, and you were surprised when the solutions didn't adhere to the secret limit?> Finally, some of the submissions raised some Exceptions, which would likely be fixed with a human reviewing this code and asking for changes.How many of your solutions got the correct answer on the first try without going back and fixing something?

评论 #42556949 未加载

评论 #42555410 未加载

评论 #42555069 未加载

johnea5 个月前

LLMs are writing code for the coming of the lil' baby jesus?

评论 #42554424 未加载

cheevly4 个月前

Genuinely terrible prompt. Not only in structure, but also contains grammatical errors. I'm confident you could at least double their score if you improve your prompting significantly.

评论 #42557402 未加载