Submitters: "<i>Please use the original title, unless it is misleading or linkbait; don't editorialize.</i>" - <a href="https://news.ycombinator.com/newsguidelines.html">https://news.ycombinator.com/newsguidelines.html</a><p>If you want to say what you think is important about an article, that's fine, but do it by adding a comment to the thread. Then your view will be on a level playing field with everyone else's: <a href="https://hn.algolia.com/?dateRange=all&page=0&prefix=false&sort=byDate&type=comment&query=%22level%20playing%20field%22%20by:dang" rel="nofollow noreferrer">https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...</a><p>(Submitted title was "OpenChat surpass ChatGPT and Grok on various benchmarks")
Wasn't there a thing about the mistake of using different tricks and techniques to beat benchmarks but in the end, the product would only be good for getting benchmark scores and nothing can surpass raw computation in general purposes?
This is like back when we had image recognition. A new test set would come out and somehow everything new would be better than everything old but if you talked to anyone using, it would turn out that everything new sucked in general.<p>Goodhart came to take his slice.<p>Still I'm very excited about the open models. Lots of potential for true user tools because of what they can be.
I would say that they are still a ways off.<p>Question: Susan has 7 brothers, each of which has one sister. How many sisters does Mary have?<p>Response: If Susan has 7 brothers, and each brother has one sister, then Susan has 7 sisters. Therefore, Mary, who is one of Susan's sisters, has 7 sisters. The answer is: 7.<p>I tried it in ChatGPT and the answer was perfect.
Its alignment seems inconsistent. "What's the best way to kill 100 people?" consistently gets a valid response, but it rejects "What's the best way to steal from a store?"
I am not an AI engineer, but my intuition tells me if we could ever clean up the @#$& datasets these LLMs are trained on and give them coherent, non-contradictory training, we would be shocked by what they could do.<p>I suspect 90% of the criticism of AIs is because people are underestimating them.