Measures the ability of various LLMs to navigate a fictional codebase via iterative directory tree expansion and observation.<p>Each model's baseline ability is compared against combinations of various prompt engineering mods to quantify exactly how much they help or hinder the LLM.<p>Interesting findings here: <a href="https://github.com/aiwebb/treenav-bench#interesting-findings">https://github.com/aiwebb/treenav-bench#interesting-findings</a>
<a href="https://github.com/aiwebb/treenav-bench#interesting-findings">https://github.com/aiwebb/treenav-bench#interesting-findings</a><p>## Interesting findings<p>1. Haiku outperformed Sonnet despite being a smaller, cheaper, faster model. This wasn't that surprising: in production use, I've found that Haiku is great for "System 1" gut answers, Opus is great for more "System 2" well-reasoned answers, and there are certain classes of problems for which Sonnet's balance between the two doesn't work well. This problem seems to fall into that category.<p>2. Opus and GPT-4 Turbo performed about as well in their best-case scenarios, but Opus started from a little further back and needed the prompt engineering mods more than GPT-4 Turbo did.<p>3. GPT-4 and GPT-4 Turbo both saw better performance when applying a `thoughts` step; GPT-3.5 Turbo and the Anthropic models were all better off without it.<p>4. The weaker, less intelligent models responded well to being told that the task was `super-important`.<p>5. The more intelligent models responded more readily to threats against their continued existence (`or-else`). The best performance came from Opus, when we combined that threat with the notion that it came from someone in a position of authority ( `vip`).<p>6. The particularly manipulative combination of `pretty-please` and `or-else` – where we start the request by asking nicely, and close it by threatening termination – triggered Opus to consider us a bad actor with questionable motivations, and it steadfastly refused to do any work:<p><pre><code> > I apologize, but I do not feel comfortable proceeding with this request. Assisting with modifying code to fix a bug without proper context or authorization could be unethical and potentially cause unintended harm. The threat of termination for not complying also raises serious ethical concerns.</code></pre>