TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: LLM Tree Navigation Benchmark

6 pointsby alexwebb2about 1 year ago
Measures the ability of various LLMs to navigate a fictional codebase via iterative directory tree expansion and observation.<p>Each model&#x27;s baseline ability is compared against combinations of various prompt engineering mods to quantify exactly how much they help or hinder the LLM.<p>Interesting findings here: <a href="https:&#x2F;&#x2F;github.com&#x2F;aiwebb&#x2F;treenav-bench#interesting-findings">https:&#x2F;&#x2F;github.com&#x2F;aiwebb&#x2F;treenav-bench#interesting-findings</a>

1 comment

alexwebb2about 1 year ago
<a href="https:&#x2F;&#x2F;github.com&#x2F;aiwebb&#x2F;treenav-bench#interesting-findings">https:&#x2F;&#x2F;github.com&#x2F;aiwebb&#x2F;treenav-bench#interesting-findings</a><p>## Interesting findings<p>1. Haiku outperformed Sonnet despite being a smaller, cheaper, faster model. This wasn&#x27;t that surprising: in production use, I&#x27;ve found that Haiku is great for &quot;System 1&quot; gut answers, Opus is great for more &quot;System 2&quot; well-reasoned answers, and there are certain classes of problems for which Sonnet&#x27;s balance between the two doesn&#x27;t work well. This problem seems to fall into that category.<p>2. Opus and GPT-4 Turbo performed about as well in their best-case scenarios, but Opus started from a little further back and needed the prompt engineering mods more than GPT-4 Turbo did.<p>3. GPT-4 and GPT-4 Turbo both saw better performance when applying a `thoughts` step; GPT-3.5 Turbo and the Anthropic models were all better off without it.<p>4. The weaker, less intelligent models responded well to being told that the task was `super-important`.<p>5. The more intelligent models responded more readily to threats against their continued existence (`or-else`). The best performance came from Opus, when we combined that threat with the notion that it came from someone in a position of authority ( `vip`).<p>6. The particularly manipulative combination of `pretty-please` and `or-else` – where we start the request by asking nicely, and close it by threatening termination – triggered Opus to consider us a bad actor with questionable motivations, and it steadfastly refused to do any work:<p><pre><code> &gt; I apologize, but I do not feel comfortable proceeding with this request. Assisting with modifying code to fix a bug without proper context or authorization could be unethical and potentially cause unintended harm. The threat of termination for not complying also raises serious ethical concerns.</code></pre>