ChatGPT does Advent of Code 2023

185 pointsby luuover 1 year ago

19 comments

Closiover 1 year ago

> I think it's as close to the perfect benchmark as you can get.Depends what you are benchmarking for... If you are benchmarking the ability of the solution to solve LEETCODE challenges, that is different to the ability of GPT4 to assist everyday programmers knock out business logic or diagnose bugs.My experience of GPT4 is that it's significantly better at the latter than GPT3.5.Additionally, the real test is for me is "Can an average programmer using GPT4 as a tool solve Advent of Code faster than an equally-skilled programmer without an LLM?".

评论 #38999264 未加载

评论 #39000131 未加载

IanCalover 1 year ago

> ChatGPT never did this: its debugging skills are completely non-existent. If it encounters an error it will simply rewrite entire functions, or more often the entire program, from scratch.Well, it's going to need to rewrite functions to add debug due to not having edit capabilities, but I tried this and it absolutely added debug info which it then used to debug issues:In an inner part it added:<pre><code> debug_info.append({ 'Hand': hand, 'Bid': bid, 'Type': hand_info[0], 'Sorted Hand': hand_info[1], 'Rank': rank, 'Score': score }) </code></pre> I don't know quite what's happening but I feel like people constantly say it can't do something and the very first thing I try (just asking it to do the thing) usually works.I gave it the hands in the problem statement and the expected result, and the explanation as to why (copy pasted). It ran the code, looked at the debug output, identified the problem and rewrote the function. I'm not saying it immediately solved the problem, but it easily added debug information, ran the code, looked at the output and interpreted it.

评论 #38999721 未加载

评论 #39008897 未加载

评论 #39003185 未加载

lloydatkinsonover 1 year ago

Of the few I attempted this year I actually used ChatGPT to help me decipher just what the hell the waffling, rambling, windy, unclear, and ultimately superfluous challenge requirements were.I didn’t have the solutions generated by ChatGPT to be clear, I used a prompt along the lines of “take this text and extract it’s requirements and generate bullet points, also make the example inputs and outputs clear”.I did the same for the previous year too, when ChatGPT hadn’t been out for long.I find that a lot more enjoyable and less tedious.

评论 #38999731 未加载

tiscover 1 year ago

What strikes me about ChatGPT is the blatantly wrong answers it can give. I asked ChatGPT to solve a augmented matrix using gaussian elimination, and it failed in this straightforward task spectacularly.

评论 #39000658 未加载

评论 #39007079 未加载

评论 #38999796 未加载

评论 #38999904 未加载

adverblyover 1 year ago

<a href="https://m.youtube.com/@mzikmund" rel="nofollow">https://m.youtube.com/@mzikmund</a>This creator has some excellent videos of chat GPT attempting advent of code.He uses it in a more generous format where he is often giving it multiple attempts and trying to coax the correct answer out of it. He definitely has more success with it than the article, but it is hard to tell how much of that success is due generous assistance and prompting, and so it is hard to know how much the model has actually improved year over year.

评论 #39008983 未加载

评论 #39004248 未加载

amneover 1 year ago

I have an observation not related to chatgpt but to the debugging skills mentioned in the article: indeed I've always felt that most teaching is done on perfect working code. I've never seen an exercise for developing debugging skills. For example: "This Djikstra implementation is finding the wrong path." and work with students to pinpoint the off-by-one error. I think it would reveal so much more about the implementation details than just explaining how it works. It could be its own topic to explore race conditions, edge case studies and so on.

评论 #38999358 未加载

JacobiXover 1 year ago

I believe this presents a good challenge, as the article mentioned: the problems are unique :) None of them are found in ChatGPT's training dataset.

alkonautover 1 year ago

Are these problems unique enough that they couldn't have circulated before the AoC where they appear now?I notice that for "novel" code that CoPilot hasn't seen before, it's mostly useless, but when writing a hobby project which is a path tracer (of which there are probably 1000 implementations on GitHub) it's excellent. Which isn't surprising. It has seen the exact same function I'm writing, written 100 times in every language imaginable. There are books on the topic etc.

评论 #39001195 未加载

评论 #39002372 未加载

评论 #39005518 未加载

km3rover 1 year ago

Seeing the other comments, it still leaves me the question if this is a limit of scale or a limit of technology? How much more can LLMs scale up over the next decade?

评论 #38999332 未加载

scrapheapover 1 year ago

What would be really interesting would be to repeated attempt to get get ChatGPT to solve the problems.By that I mean try to solve the Day 1 problem at the point of release, then try to solve it in a fresh ChatGPT session the next day, and then get it to solve it again the day after that, and so for the next couple of months.Do that for each day that it runs for and then see what patterns emerge. Part of me expects it to get better at solving each problem as the rest of us write our solutions and then make them available on the web in some form for it harvest - but it would be interesting to see if that is the case.

评论 #38999372 未加载

评论 #38999462 未加载

jedbergover 1 year ago

I feel like ChatGPT is equivalent to an intern. I'd be curious to see how well interns do on AoC for comparison with the same level of help.

boxedover 1 year ago

Claim chowder:> The model can unquestionably code. [<a href="https://news.ycombinator.com/item?id=38205052">https://news.ycombinator.com/item?id=38205052</a>]Seems pretty clear most of the fantastic results from before was overfitting. Like every other case. It's amazing to me how these models can be caught red handed being overfitted again and again and again, and people don't get the memo.

评论 #39001269 未加载

评论 #39001047 未加载

评论 #39007228 未加载

评论 #39001663 未加载

评论 #39003030 未加载

trompover 1 year ago

> Problems start very easy on day 1 (sometimes as easy as just asking for a program that sums all numbers in the input) and progress towards more difficult ones, but they never get very hard: a CS graduate should be able to solve all problems, except maybe 1 or 2, in a couple of hours each.I think this wildly overestimates the programming skills of the average CS graduate. My estimate of the fraction of CS graduates able to do that is closer to 1%.

评论 #39001012 未加载

评论 #39002213 未加载

评论 #39000985 未加载

评论 #39003040 未加载

评论 #39002431 未加载

ddp26over 1 year ago

> I don't pay for ChatGPT Plus, I only have a paid API key so I used instead a command line client, chatgpt-cli and manually ran the output programs.But this is much different, and likely much worse, than paid ChatGPT. Code Interpreter does its own debugging-and-revising loop.It's bizarre to me that this author wouldn't pay $20 one time to evaluate the higher quality product, the one most people would use if they cared about code quality, am I missing something?

评论 #39002052 未加载

评论 #39001817 未加载

评论 #39002677 未加载

评论 #39002140 未加载

评论 #39001909 未加载

评论 #39001771 未加载

评论 #39002958 未加载

评论 #39003036 未加载

inglorover 1 year ago

AoC questions this year were _deliberately written_ to be confusing to LLMs, it's not failing because it's worse it's failing because the questions were written to make it hader for models :]Edit: apparently not, the author is just really good at coming up with ai adverse puzzles. When testing ChatGPT did much better on last year’s puzzles.

评论 #38999339 未加载

评论 #38999160 未加载

评论 #39000854 未加载

评论 #38999337 未加载

评论 #39003076 未加载

评论 #38999634 未加载

评论 #38999652 未加载

评论 #38999079 未加载

评论 #39003063 未加载

gumballindieover 1 year ago

"but it will get better" - person that parrots openai's pitch deck.

评论 #38999183 未加载

评论 #39003230 未加载

orzigover 1 year ago

What prompt engineering do people use in real programming situations?

评论 #39003238 未加载

wokwokwokover 1 year ago

The comments about debugging make me sad.It shows that people fundamentally do not understand the tools they are using.Given a book of numbers, here are two tasks:1) copy out the entire book, but replace every prime number with 7.2) write down the list of prime numbers in the book.Which one is easier?LLMs have to generate tokens one at a time, and it’s very very difficult to perfectly generate a set of input tokens except for some tokens.Since you are almost certainly randomising the probabilities to some degree (that’s what temperature does), you’re also asking for both deterministic and random outputs.TLDR: ask LLMs what is wrong with the code.Ask for a diff.Don’t ask for an LLM to refactor, bug fix or annotate code…That’s extremely naive usage.Back to my stupid analogy: “please copy out this book, but fix the numbers which are ‘wrong’”I can hardly complain when I get terrible results can I?

评论 #39000639 未加载

lancebeetover 1 year ago

My impression of the 2023 AoC was that Wastl has spent considerable effort on making it less LLM-friendly this year. Some days, this seems to have been done by adding extra conditions and complications which make the task more difficult for LLMs to parse. Other tasks required studying the input data which is difficult to achieve with an unsupervised LLM. Finally, the first couple of days seemed a lot more difficult this year than previous years, possibly to deter chatgpt users from filling up the leaderboard right away (though this year December started with a weekend which could also be a contributing factor).

评论 #38999503 未加载

评论 #38999158 未加载