TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Genie: Best AI Software Engineer

23 pointsby 0xedb9 months ago

9 comments

pacomerh9 months ago
These tools remind me of the Dreamweaver era of solving problems no matter how the code looks
uzumak9 months ago
looks like they trained their model on SWE-bench and tried to submit <a href="https:&#x2F;&#x2F;github.com&#x2F;swe-bench&#x2F;experiments&#x2F;pull&#x2F;45">https:&#x2F;&#x2F;github.com&#x2F;swe-bench&#x2F;experiments&#x2F;pull&#x2F;45</a>
mronetwo9 months ago
Sorry for not discussing the product itself, but...<p>I&#x27;m just not seeing a machine that is &quot;likely correct&quot;, constantly interrupting the &quot;operator&quot; to be that much of a win. I have seen some software influencers reflect on how much more fun it is to code, after dropping the LLM assistant.<p>All of these feel like offerings to the Productivity God. As a salary guy I&#x27;ll never get excited that I can do more during my work day. It&#x27;s already easy to hit my capacity.
henning9 months ago
So most of the time it still gets it wrong. And then when it gets it right it will still be subtly wrong. What a waste of electrical power and time.
difosfor9 months ago
God I hate pages that hijack scrolling..
Y_Y9 months ago
Any external verification ofthe benchmark results?
评论 #41226803 未加载
评论 #41229072 未加载
ramon1569 months ago
I thought this was going to be a blog post and just turned out to be a &quot;use our product!&quot; jumpscare. I&#x27;ll gladly pass
Bjorkbat9 months ago
Something I&#x27;m kind of curious about is the degree to which eval performance might be due to parts of the SWE-bench dataset getting into the latest LLM models.<p>A while back someone on Twitter seemed to confirm that Claude-3.5 was aware of the Github issues inside the dataset by mentioning them, but I couldn&#x27;t find the original post.<p>30% performance on the full SWE-bench benchmark is quite the leap, but just how &quot;real&quot; of an achievement is this? Anecdotal reports mention that GPT-4o is marginally better than GPT-4 turbo at best, and yet agents leveraging the LLM did perform better.<p>What would happen if SWE-bench was updated, top to bottom, with completely new Github issues? Would all these agents just completely shit the bed?
log1019 months ago
“…a human reasoning lab”<p><i>closes the tab</i>