Refact Code LLM: 1.6B LLM for code that reaches 32% HumanEval

181 点作者 kateklink超过 1 年前

18 条评论

vikp超过 1 年前

This post is misleading, in a way that is hard to do accidentally.<pre><code> - They compare the performance of this model to the worst 7B code llama model. The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%. - They compare their instruct tuned model to non-instruct-tuned models. Instruction tuning can add 20% or more to humaneval performance. For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2]. - For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3] - Starcoder, when prompted properly, scores 40% on humaneval [4] - They do not report their base model performance (as far as I can tell) </code></pre> This is interesting work, and a good contribution, but it's important to compare similar models.[1] <a href="https://github.com/nlpxucan/WizardLM">https://github.com/nlpxucan/WizardLM</a>[2] <a href="https://huggingface.co/vikp/llama_coder" rel="nofollow noreferrer">https://huggingface.co/vikp/llama_coder</a>[3] <a href="https://stability.ai/blog/stablecode-llm-generative-ai-coding" rel="nofollow noreferrer">https://stability.ai/blog/stablecode-llm-generative-ai-codin...</a>[4] <a href="https://github.com/huggingface/blog/blob/main/starcoder.md">https://github.com/huggingface/blog/blob/main/starcoder.md</a>

评论 #37389983 未加载

Havoc超过 1 年前

That’s an impressive resultThe open rail license seems to reference some sort of limitations on safety and unethical use but I can’t see where in the repo that’s spelled out precisely what the authors have in mind?

评论 #37382479 未加载

brucethemoose2超过 1 年前

One misleading thing is the notion that you need a 1-2B model to run on commodity hardware.This is not really true. Llama 7B runs with Vulkan/llama.cpp on ~8GB smartphones and ~12GB laptops. That ease is going to get much better over time, as lower RAM hardware starts dropping out of the market and the Vulkan implementations get more widespread.For users trying to run LLMs on 8GB or less machines, the AI Horde approach of distributed models seems much more practical anyway.

评论 #37384051 未加载

评论 #37385152 未加载

评论 #37385744 未加载

评论 #37384386 未加载

评论 #37383895 未加载

评论 #37385296 未加载

评论 #37383967 未加载

mholubowski超过 1 年前

Hey, I have a genuine question:What is the point of a new model that isn’t better than the best possible model (example: OpenAI GPT-4)?What’s the point in having a smaller model? Who cares?—-This is a real, genuine question that I don’t have a clear answer to. Excuse my ignorance, plz enlighten your boi.

评论 #37382897 未加载

评论 #37382940 未加载

评论 #37382820 未加载

评论 #37383292 未加载

评论 #37383214 未加载

评论 #37382827 未加载

评论 #37383603 未加载

评论 #37382814 未加载

评论 #37383237 未加载

smcleod超过 1 年前

Just trying out the official container image for self-hosting along side the VSCode extension - I've got to say I'm really impressed with the scaffolding especially for an early stage project.The web interface for the LLM server is especially nice and clean compared to many of the others I've tried - and it "just works". Very interested to see how this evolves.

holoduke超过 1 年前

Whats the difference between 1% and 99% of HumanEval? What does it tell really?

评论 #37384315 未加载

ldjkfkdsjnv超过 1 年前

I dont trust any benchmarks for any LLM thats not coming from FB, Google, OpenAI, Anthropic, or Microsoft. These models are so dynamic, the simple benchmark numbers never tell the whole story of the quality of the model. Take for instance, a recent posting by anyscale, claiming their fine tuning of Llama 2 was competitive with OpenAI's model. The reality being their fined tuned model is basically worthless, and was competitive along a single metric/very narrow commoditized task. Its a great way to get clicks by posting these metrics though

评论 #37383236 未加载

评论 #37382753 未加载

评论 #37382863 未加载

howon92超过 1 年前

Congrats on your achievement! I'm curious about your end goal. Do you aim to beat GitHub Copilot's performance and convince devs to use Refact for code completion instead of GitHub Copilot? I want to understand the motivation behind these different code-completion models that are not solely for academic research.

评论 #37382679 未加载

评论 #37382695 未加载

umutisik超过 1 年前

The title is misleading This model is not "SOTA for the size", there are smaller models that do 10-18% better in absolute score. The text says it's SOTA "among similar models" where they probably compare with other models with permissive licensing.

评论 #37382734 未加载

评论 #37382828 未加载

glutamate超过 1 年前

License text: <a href="https://drive.google.com/file/d/16NqKiAkzyZ55NClubCIFup8pT2jnyVIo/view" rel="nofollow noreferrer">https://drive.google.com/file/d/16NqKiAkzyZ55NClubCIFup8pT2j...</a> [PDF]See last page for restrictions

评论 #37382582 未加载

评论 #37382506 未加载

acheong08超过 1 年前

Say I want to fine tune a Golang specific model. How much $ and effort would I have to put in? Would using this as a base help in any way compared to starting from llama?

评论 #37383355 未加载

palmer_fox超过 1 年前

All these LLMs are pretty general if I understand correctly. Are there any efforts to create specialized models (other than for coding)? Or, what would be even better, "extract" certain areas from existing LLMs as a way to specialize them? With the goal to drastically reduce model size to be able to run on less powerful devices.E.g. a model specializing in chemistry doesn't need to include data on world's history or to be able to write poetry.

评论 #37383972 未加载

评论 #37384510 未加载

Manjuuu超过 1 年前

Another model that we'll soon forget it ever existed.

igammarays超过 1 年前

For the sake of not giving Microsoft and a few other tech giants immense power over the world, I really do hope the cost and efficiency of LLMs improve dramatically, until we can get GPT-4-equivalent models trained on a few graphics cards and running offline on an iPhone. Really rooting for these kinds of projects until someone makes the breakthrough.

评论 #37382674 未加载

评论 #37382699 未加载

评论 #37382681 未加载

评论 #37383159 未加载

评论 #37382705 未加载

评论 #37382696 未加载

评论 #37383268 未加载

评论 #37383310 未加载

kateklink超过 1 年前

We’ve finished training a new code model Refact LLM which took us about a month. The main use-case is for blazing-fast code completion with fill-in-the-middle, additionally, the model could reply to chat prompts.It has much better performance than all of the code models of similar size, and almost reaches the same HumanEval as Starcoder being 10x smaller in size.With the small size, it can work with most modern GPUs requiring just 3GB Ram.You can try self-hosting it in Refact <a href="https://github.com/smallcloudai/refact/">https://github.com/smallcloudai/refact/</a> and get a local fast copilot alternative with decent suggestions.Weights and model card <a href="https://huggingface.co/smallcloudai/Refact-1_6B-fim" rel="nofollow noreferrer">https://huggingface.co/smallcloudai/Refact-1_6B-fim</a>.We would love to hear your feedback!

评论 #37383365 未加载

评论 #37382840 未加载

评论 #37382280 未加载

评论 #37465555 未加载

评论 #37382612 未加载

评论 #37383311 未加载

zcesur超过 1 年前

tangentially related: refact recently shared 4 bounties worth $9,000 to help improve their tech!<a href="https://algora.io/org/smallcloudai/bounties" rel="nofollow noreferrer">https://algora.io/org/smallcloudai/bounties</a>disclaimer: i'm a cofounder of algora, the platform enabling these bounties

评论 #37383919 未加载

iFire超过 1 年前

LICENSEbigscience-openrail-m<a href="https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/main/README.md" rel="nofollow noreferrer">https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/mai...</a>

评论 #37382421 未加载

notsahil超过 1 年前

Model Stats - Architecture: LLAMA-like model with multi-query attention - Objectives Fill-in-the-Middle, Chat - Tokens context: 4096 - Pretraining tokens: 1.2T - Finetuning tokens: 40B - Precision: bfloat16 - GPUs 64 NVidia A5000 - Training time 28 days

评论 #37382419 未加载