Tried my standard go-to for testing, asked it to generate a voronoi diagram using p5js. For the sake of job security I'm relieved to see it still can't do a relatively simple task with ample representation in the Google search results. Granted, p5js is kind of niche, but not terribly so. It's arguably the most popular library for creating coding.<p>In case you're wondering, I tried o1-preview, and while it did work, I was also initially perplexed why the result looked pixelated. Turns out, that's because many of the p5js examples online use a relatively simple approach where they just see which cell-center each pixel is closest to, more or less. I mean, it works, but it's a pretty crude approach.<p>Now, granted, you're probably not doing creative coding at your job, so this may not matter that much, but to me it was an example of pretty poor generalization capabilities. Curiously, Claude has no problem whatsoever generating a voronoi diagram as an SVG, but writing a script to generate said diagrams using a particular library eluded it. It knows how to do one thing but generalizes poorly when attempting to do something similar.<p>Really hard to get a real sense of capabilities when you're faced with experiences like this, all the while somehow it's able to solve 46% of real-world python pull-requests from a certain dataset. In case you're wondering, one paper (<a href="https://cs.paperswithcode.com/paper/swe-bench-enhanced-coding-benchmark-for-llms" rel="nofollow">https://cs.paperswithcode.com/paper/swe-bench-enhanced-codin...</a>) found that 94% of the pull-requests on SWE-bench were created before the knowledge cutoff dates of the latest LLMs, so there's almost certainly a degree of data-leakage.