51 pointsby rvivek3 months ago

We help companies hire & upskill developers. A customer recently asked: What % of HackerRank problems can LLMs solve? That got us thinking—how should hiring evolve when AI can translate natural language to code?Our belief: AI will handle much of code generation, so developers will be assessed more on SDLC skills with AI assistants.To explore this, we’re benchmarking LLMs on real-world software dev scenarios—starting with 65 unseen problems across 10 domains. Beyond correctness, we evaluated consistency—an often overlooked aspect of AI reliability. We’re open-sourcing the dataset on Huggingface and expanding it to cover more domains, ambiguous specs, and harder challenges.Would love the HN community’s take on this!

5 comments

bobnamob3 months ago

Seems like a very limited subset of software development to be basing a benchmark onWhere’s the kernel dev? Where’s the embedded dev? Where’s the throwaway python script?

评论 #43025939 未加载

sosuke3 months ago

No huggingface models or did I just miss them? Edit: they mention doing open models at some point at the bottom of the page

danpalmer3 months ago

> To mimic real-world development, HackerRank’s ASTRA Benchmark Dataset includes, on average, 12 source code and configuration files per question as model inputs.How is 12 files "real-world development"? My hobby project currently has 142 files and most non-trivial changes would involve adding a new file. My small work project has 79 and similarly, any non-trivial changes will need to add a file. These are small codebases. My previous team was ~450k lines across ~thousands of files, and we managed that pretty effectively with 6 engineers.Getting the right answer out of an LLM for these sorts of tasks is fine if you give it little enough context that it's an effectively greenfield task as most of these problems end up being. But giving them a whole codebase and expecting the right answer, or the process of choosing the right subset to give them, are still big unsolved problems.At this point it honestly feels a bit like gaslighting, suggesting that a 12 file NodeJS server is representative of software engineering.

评论 #43022171 未加载

rushingcreek3 months ago

Would love to see how DeepSeek R1 compares to O1 here.

rokhayakebe3 months ago

How will programming change when we reach reach 99-100%?

评论 #43019504 未加载

评论 #43019891 未加载

评论 #43020366 未加载

评论 #43020980 未加载

ASTRA: HackerRank's coding benchmark for LLMs