Nice idea and all, but I think their methodology was totally unsuited to the task. What they actually assessed was whether an LLM can recognize an accurate summary of a given article, not whether the LLM can produce an accurate summary itself. Here's their description of it:<p>> We used a 3-way verified hand-labeled set of 373 news report statements and presented one correct and one incorrect summary of each. Each LLM had to decide which statement was the factually correct summary.<p>The problem with this approach is its assumption that if an LLM can recognize an accurate summary, it'll be able to reliably produce accurate summaries. We know very little about the inner workings of LLMs right now, and what we do know suggests that they work highly counterintuitively, so I think there's no basis to make this assumption.
it costs 0.001$ per 1K which is slightly cheaper than GPT-3.5-turbo. I have just tested it and it shows extremely worse results on the tasks in my pipelines. Not a game change, unfortunately.
"It is not too much of a stretch to conclude that a system that is better at telling factual from non-factual sentences is better at not making them up in the first place – or alternatively could decide through a two stage process if it was being inconsistent."<p>Stretching aside, how does one follow from the other?
also means that OpenAI can just swap in Llama 2 and increase their capacity by orders of magnitude<p>this is the age (or year) of token price arbitrage
Almost as good as GPT-4? I hear that claim quite often. And then when I test the claim, it falls far far short. I want real competition in this space. But currently, there is none. Except for maybe some very very corner case.
Anyscale's business model was completely disrupted by OpenAI. They are trying to shift to provide hosting/fine tuning for open source LLMs, but the model performance will get crushed by Gpt4/newer open AI models. In theory the alternative to openAI models is nice, in reality anyscale is now competing with Azure/AWS/etc to provide model hosting.<p>Their original compute platform for running arbitrary ml workloads will become obsolete as the industry consolidates around LLMs.