TechEcho

1 comment

alexwebb2about 1 year ago

<a href="https://github.com/aiwebb/treenav-bench#interesting-findings">https://github.com/aiwebb/treenav-bench#interesting-findings</a>## Interesting findings1. Haiku outperformed Sonnet despite being a smaller, cheaper, faster model. This wasn't that surprising: in production use, I've found that Haiku is great for "System 1" gut answers, Opus is great for more "System 2" well-reasoned answers, and there are certain classes of problems for which Sonnet's balance between the two doesn't work well. This problem seems to fall into that category.2. Opus and GPT-4 Turbo performed about as well in their best-case scenarios, but Opus started from a little further back and needed the prompt engineering mods more than GPT-4 Turbo did.3. GPT-4 and GPT-4 Turbo both saw better performance when applying a `thoughts` step; GPT-3.5 Turbo and the Anthropic models were all better off without it.4. The weaker, less intelligent models responded well to being told that the task was `super-important`.5. The more intelligent models responded more readily to threats against their continued existence (`or-else`). The best performance came from Opus, when we combined that threat with the notion that it came from someone in a position of authority ( `vip`).6. The particularly manipulative combination of `pretty-please` and `or-else` – where we start the request by asking nicely, and close it by threatening termination – triggered Opus to consider us a bad actor with questionable motivations, and it steadfastly refused to do any work:<pre><code> > I apologize, but I do not feel comfortable proceeding with this request. Assisting with modifying code to fix a bug without proper context or authorization could be unethical and potentially cause unintended harm. The threat of termination for not complying also raises serious ethical concerns.</code></pre>

1 comment

alexwebb2about 1 year ago

Show HN: LLM Tree Navigation Benchmark

1 comment

Show HN: LLM Tree Navigation Benchmark

1 comment