This post is misleading, in a way that is hard to do accidentally.<p><pre><code> - They compare the performance of this model to the worst 7B code llama model. The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.
- They compare their instruct tuned model to non-instruct-tuned models. Instruction tuning can add 20% or more to humaneval performance. For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2].
- For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]
- Starcoder, when prompted properly, scores 40% on humaneval [4]
- They do not report their base model performance (as far as I can tell)
</code></pre>
This is interesting work, and a good contribution, but it's important to compare similar models.<p>[1] <a href="https://github.com/nlpxucan/WizardLM">https://github.com/nlpxucan/WizardLM</a><p>[2] <a href="https://huggingface.co/vikp/llama_coder" rel="nofollow noreferrer">https://huggingface.co/vikp/llama_coder</a><p>[3] <a href="https://stability.ai/blog/stablecode-llm-generative-ai-coding" rel="nofollow noreferrer">https://stability.ai/blog/stablecode-llm-generative-ai-codin...</a><p>[4] <a href="https://github.com/huggingface/blog/blob/main/starcoder.md">https://github.com/huggingface/blog/blob/main/starcoder.md</a>
That’s an impressive result<p>The open rail license seems to reference some sort of limitations on safety and unethical use but I can’t see where in the repo that’s spelled out precisely what the authors have in mind?
One misleading thing is the notion that you need a 1-2B model to run on commodity hardware.<p>This is not really true. Llama 7B runs with Vulkan/llama.cpp on ~8GB smartphones and ~12GB laptops. That ease is going to get much better over time, as lower RAM hardware starts dropping out of the market and the Vulkan implementations get more widespread.<p>For users trying to run LLMs on 8GB or less machines, the AI Horde approach of distributed models seems much more practical anyway.
Hey, I have a genuine question:<p>What is the point of a new model that isn’t better than the best possible model (example: OpenAI GPT-4)?<p>What’s the point in having a smaller model? Who cares?<p>—-<p>This is a real, genuine question that I don’t have a clear answer to. Excuse my ignorance, plz enlighten your boi.
Just trying out the official container image for self-hosting along side the VSCode extension - I've got to say I'm really impressed with the scaffolding especially for an early stage project.<p>The web interface for the LLM server is especially nice and clean compared to many of the others I've tried - and it "just works". Very interested to see how this evolves.
I dont trust any benchmarks for any LLM thats not coming from FB, Google, OpenAI, Anthropic, or Microsoft. These models are so dynamic, the simple benchmark numbers never tell the whole story of the quality of the model. Take for instance, a recent posting by anyscale, claiming their fine tuning of Llama 2 was competitive with OpenAI's model. The reality being their fined tuned model is basically worthless, and was competitive along a single metric/very narrow commoditized task. Its a great way to get clicks by posting these metrics though
Congrats on your achievement! I'm curious about your end goal. Do you aim to beat GitHub Copilot's performance and convince devs to use Refact for code completion instead of GitHub Copilot? I want to understand the motivation behind these different code-completion models that are not solely for academic research.
The title is misleading This model is not "SOTA for the size", there are smaller models that do 10-18% better in absolute score. The text says it's SOTA "among similar models" where they probably compare with other models with permissive licensing.
Say I want to fine tune a Golang specific model. How much $ and effort would I have to put in? Would using this as a base help in any way compared to starting from llama?
All these LLMs are pretty general if I understand correctly. Are there any efforts to create specialized models (other than for coding)? Or, what would be even better, "extract" certain areas from existing LLMs as a way to specialize them? With the goal to drastically reduce model size to be able to run on less powerful devices.<p>E.g. a model specializing in chemistry doesn't need to include data on world's history or to be able to write poetry.
For the sake of not giving Microsoft and a few other tech giants immense power over the world, I really do hope the cost and efficiency of LLMs improve dramatically, until we can get GPT-4-equivalent models trained on a few graphics cards and running offline on an iPhone. Really rooting for these kinds of projects until someone makes the breakthrough.
We’ve finished training a new code model Refact LLM which took us about a month. The main use-case is for blazing-fast code completion with fill-in-the-middle, additionally, the model could reply to chat prompts.<p>It has much better performance than all of the code models of similar size, and almost reaches the same HumanEval as Starcoder being 10x smaller in size.<p>With the small size, it can work with most modern GPUs requiring just 3GB Ram.<p>You can try self-hosting it in Refact <a href="https://github.com/smallcloudai/refact/">https://github.com/smallcloudai/refact/</a> and get a local fast copilot alternative with decent suggestions.<p>Weights and model card <a href="https://huggingface.co/smallcloudai/Refact-1_6B-fim" rel="nofollow noreferrer">https://huggingface.co/smallcloudai/Refact-1_6B-fim</a>.<p>We would love to hear your feedback!
tangentially related: refact recently shared 4 bounties worth $9,000 to help improve their tech!<p><a href="https://algora.io/org/smallcloudai/bounties" rel="nofollow noreferrer">https://algora.io/org/smallcloudai/bounties</a><p>disclaimer: i'm a cofounder of algora, the platform enabling these bounties