We help companies hire & upskill developers. A customer recently asked: What % of HackerRank problems can LLMs solve? That got us thinking—how should hiring evolve when AI can translate natural language to code?<p>Our belief: AI will handle much of code generation, so developers will be assessed more on SDLC skills with AI assistants.<p>To explore this, we’re benchmarking LLMs on real-world software dev scenarios—starting with 65 unseen problems across 10 domains. Beyond correctness, we evaluated consistency—an often overlooked aspect of AI reliability. We’re open-sourcing the dataset on Huggingface and expanding it to cover more domains, ambiguous specs, and harder challenges.<p>Would love the HN community’s take on this!
Seems like a very limited subset of software development to be basing a benchmark on<p>Where’s the kernel dev? Where’s the embedded dev? Where’s the throwaway python script?
> To mimic real-world development, HackerRank’s ASTRA Benchmark Dataset includes, on average, 12 source code and configuration files per question as model inputs.<p>How is 12 files "real-world development"? My hobby project currently has 142 files and most non-trivial changes would involve adding a new file. My small work project has 79 and similarly, any non-trivial changes will need to add a file. These are <i>small codebases</i>. My previous team was ~450k lines across ~thousands of files, and we managed that pretty effectively with 6 engineers.<p>Getting the right answer out of an LLM for these sorts of tasks is fine if you give it little enough context that it's an effectively greenfield task as most of these problems end up being. But giving them a whole codebase and expecting the right answer, or the process of choosing the right subset to give them, are still big unsolved problems.<p>At this point it honestly feels a bit like gaslighting, suggesting that a 12 file NodeJS server is representative of software engineering.