TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

ASTRA: HackerRank's coding benchmark for LLMs

51 pointsby rvivek3 months ago
We help companies hire &amp; upskill developers. A customer recently asked: What % of HackerRank problems can LLMs solve? That got us thinking—how should hiring evolve when AI can translate natural language to code?<p>Our belief: AI will handle much of code generation, so developers will be assessed more on SDLC skills with AI assistants.<p>To explore this, we’re benchmarking LLMs on real-world software dev scenarios—starting with 65 unseen problems across 10 domains. Beyond correctness, we evaluated consistency—an often overlooked aspect of AI reliability. We’re open-sourcing the dataset on Huggingface and expanding it to cover more domains, ambiguous specs, and harder challenges.<p>Would love the HN community’s take on this!

5 comments

bobnamob3 months ago
Seems like a very limited subset of software development to be basing a benchmark on<p>Where’s the kernel dev? Where’s the embedded dev? Where’s the throwaway python script?
评论 #43025939 未加载
sosuke3 months ago
No huggingface models or did I just miss them? Edit: they mention doing open models at some point at the bottom of the page
danpalmer3 months ago
&gt; To mimic real-world development, HackerRank’s ASTRA Benchmark Dataset includes, on average, 12 source code and configuration files per question as model inputs.<p>How is 12 files &quot;real-world development&quot;? My hobby project currently has 142 files and most non-trivial changes would involve adding a new file. My small work project has 79 and similarly, any non-trivial changes will need to add a file. These are <i>small codebases</i>. My previous team was ~450k lines across ~thousands of files, and we managed that pretty effectively with 6 engineers.<p>Getting the right answer out of an LLM for these sorts of tasks is fine if you give it little enough context that it&#x27;s an effectively greenfield task as most of these problems end up being. But giving them a whole codebase and expecting the right answer, or the process of choosing the right subset to give them, are still big unsolved problems.<p>At this point it honestly feels a bit like gaslighting, suggesting that a 12 file NodeJS server is representative of software engineering.
评论 #43022171 未加载
rushingcreek3 months ago
Would love to see how DeepSeek R1 compares to O1 here.
rokhayakebe3 months ago
How will programming change when we reach reach 99-100%?
评论 #43019504 未加载
评论 #43019891 未加载
评论 #43020366 未加载
评论 #43020980 未加载