Phi-3 Technical Report

411 pointsby varunvummadiabout 1 year ago

18 comments

modelessabout 1 year ago

Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into a high ranking on the LMSYS leaderboard, or usefulness in everyday tasks. Let's not dethrone Llama 3 until some real world testing can be done.That said, I don't think it's impossible for a small model to be very good. I see their "synthetic data" as essentially a way of distilling GPT-4 into smaller models. It would be exciting if a large fraction of the performance of huge models could be transferred to small ones! If true, then Chinchilla-optimal training could make sense again, as you could optimally train a ginormous model and then distill it afterward for efficient inference.

评论 #40128483 未加载

评论 #40133428 未加载

评论 #40130216 未加载

评论 #40131994 未加载

评论 #40128438 未加载

oerstedabout 1 year ago

Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.Source: <a href="https://chat.lmsys.org/?leaderboard" rel="nofollow">https://chat.lmsys.org/?leaderboard</a> (select English in the dropdown)So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones? Kinda? Wild.(I'm sure there's a lot of nuance to it, for one these benchmarks are not so hard to game, we'll see how the dust settles, but still...)Phi-3-mini 3.8b: 71.2Phi-3-small 7b: 74.9Phi-3-medium 14b: 78.2Phi-2 2.7b: 58.8Mistral 7b: 61.0Gemma 7b: 62.0Llama-3-In 8b: 68.0Mixtral 8x7b: 69.9GPT-3.5 1106: 75.3(these are averages across all tasks for each model, but looking at individual scores shows a similar picture)

评论 #40128229 未加载

评论 #40128105 未加载

评论 #40129792 未加载

评论 #40129742 未加载

评论 #40128060 未加载

评论 #40129308 未加载

评论 #40133127 未加载

评论 #40128075 未加载

评论 #40128290 未加载

评论 #40130883 未加载

visargaabout 1 year ago

This shows the power of synthetic content - 3.3 trillion tokens! This approach can make a model even smaller and more efficient than organic text training, and it will not be able to regurgitate NYT articles because it hasn't seen any of them. This is how copyright infringement claims can be placated.

pkoiralapabout 1 year ago

They have started putting some models in huggingface: <a href="https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3" rel="nofollow">https://huggingface.co/collections/microsoft/phi-3-6626e15e9...</a>

评论 #40135079 未加载

评论 #40133188 未加载

mythzabout 1 year ago

I'll believe it till I try it for myself, Phi-2 was the clear worst of the 20 LLMs we evaluated (was also smallest so was expected).But it was slow for its size, generated the longest responses with the most hallucinations, as well as generating the most empty responses. It was also the model ranked with the lowest quality answers.

ein0pabout 1 year ago

Tried it: as soon as you ask something outside the head of the likely training data distribution it starts hallucinating like crazy. This isn’t surprising to me as a researcher: you need the associative memories of a larger model to cover the tail with at least something. That said, it’ll likely work well at specific narrow tasks once fine tuned. Just don’t expect it to really “beat GPT-3.5” at the general chat use case

brcmthrowawayabout 1 year ago

If I was Apple I'd be quaking in my boots. They are getting too far behind to ever catch up. Nokia in 2010 vibes.

评论 #40128253 未加载

评论 #40128256 未加载

评论 #40128358 未加载

评论 #40128131 未加载

评论 #40128276 未加载

评论 #40128197 未加载

评论 #40128263 未加载

评论 #40128228 未加载

评论 #40128156 未加载

评论 #40128637 未加载

评论 #40128177 未加载

评论 #40128953 未加载

评论 #40128206 未加载

m3kw9about 1 year ago

Phi-2 was useless for practical purposes except if you want to show your friends that it can write a poem, llama3 8b was slightly better but is still same category, it’s complete trash with coding vs gpt4. Llama3 400b “iS OPen SoURce!” But no you will need to pay to access because most one can not practically afford an A100 and set it up properly.What I’m trying to say is that user experience is now as key as the model smarts and these barely touching gpt4 models cannot beat OpenAI right now as a whole package.

评论 #40133606 未加载

abidlabsabout 1 year ago

Hugging Face Paper Page and Discussion: <a href="https://huggingface.co/papers/2404.14219" rel="nofollow">https://huggingface.co/papers/2404.14219</a>

评论 #40129979 未加载

blackoilabout 1 year ago

Has anyone used these/similar with fine tune and RAG? How is the performance over a narrow domain for simple queries? Is it good enough for say an informational chat bot?

anticensorabout 1 year ago

This paper broke ArXiv's HTML generator: <a href="https://github.com/arXiv/html_feedback/issues/1090">https://github.com/arXiv/html_feedback/issues/1090</a>

ur-whaleabout 1 year ago

That's a whole lot of Zhangs!

smartmicabout 1 year ago

Hm, roundabout 84 authors of one "scientific" paper. I wonder if this says something about (a) the quality of its content, (b) the path were academic (?) paper publishing goes to, (c) nothing at all, or (d), something entirely else.

评论 #40129740 未加载

评论 #40134215 未加载

评论 #40129298 未加载

评论 #40129890 未加载

simonwabout 1 year ago

I'm getting a bit skeptical of MMLU at this point. As far as I can tell it's a set of multiple choice questions that hasn't been updated since 2020. We have to trust the model providers not to deliberately or accidentally train on it for those scores to be useful.

评论 #40128120 未加载

Havocabout 1 year ago

Both precious phi have been epic letdowns when I actually tried them myself so quite low confidence in this being reflective of real world. Will try it anyway though

评论 #40134744 未加载

homarpabout 1 year ago

the weights have been released, 4k <a href="https://huggingface.co/microsoft/Phi-3-mini-4k-instruct" rel="nofollow">https://huggingface.co/microsoft/Phi-3-mini-4k-instruct</a> and 128k context <a href="https://huggingface.co/microsoft/Phi-3-mini-128k-instruct" rel="nofollow">https://huggingface.co/microsoft/Phi-3-mini-128k-instruct</a>

hackerlightabout 1 year ago

Less tokens than Llama 3 (3.3T vs 15T) yet better outcome. No doubt more information dense training data. The interesting thing is the use of synthetic data which they don't talk about.

评论 #40128264 未加载

评论 #40128140 未加载

maximsicoraabout 1 year ago

insane