Llamafile lets you distribute and run LLMs with a single file

1075 pointsby tfinchover 1 year ago

43 comments

simonwover 1 year ago

I think the best way to try this out is with LLaVA, the text+image model (like GPT-4 Vision). Here are steps to do that on macOS (which should work the same on other platforms too, I haven't tried that yet though):1. Download the 4.26GB llamafile-server-0.1-llava-v1.5-7b-q4 file from <a href="https://huggingface.co/jartine/llava-v1.5-7B-GGUF/blob/main/llamafile-server-0.1-llava-v1.5-7b-q4" rel="nofollow noreferrer">https://huggingface.co/jartine/llava-v1.5-7B-GGUF/blob/main/...</a>:<pre><code> wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llamafile-server-0.1-llava-v1.5-7b-q4 </code></pre> 2. Make that binary executable, by running this in a terminal:<pre><code> chmod 755 llamafile-server-0.1-llava-v1.5-7b-q4 </code></pre> 3. Run your new executable, which will start a web server on port 8080:<pre><code> ./llamafile-server-0.1-llava-v1.5-7b-q4 </code></pre> 4. Navigate to <a href="http://127.0.0.1:8080/" rel="nofollow noreferrer">http://127.0.0.1:8080/</a> to upload an image and start chatting with the model about it in your browser.Screenshot here: <a href="https://simonwillison.net/2023/Nov/29/llamafile/" rel="nofollow noreferrer">https://simonwillison.net/2023/Nov/29/llamafile/</a>

评论 #38469824 未加载

评论 #38470662 未加载

评论 #38466622 未加载

评论 #38472021 未加载

评论 #38467688 未加载

评论 #38473627 未加载

评论 #38470440 未加载

评论 #38466090 未加载

评论 #38475080 未加载

评论 #38466568 未加载

评论 #38469298 未加载

评论 #38475419 未加载

评论 #38466378 未加载

评论 #38469629 未加载

评论 #38491373 未加载

评论 #38473424 未加载

评论 #38472878 未加载

评论 #38472686 未加载

评论 #38466339 未加载

评论 #38475803 未加载

评论 #38468994 未加载

rgbrgbover 1 year ago

Extremely cool and Justine Tunney / jart does incredible portability work [0], but I'm kind of struggling with the use-cases for this one.I make a small macOS app [1] which runs llama.cpp with a SwiftUI front-end. For the first version of the app I was obsessed with the single download -> chat flow and making 0 network connections. I bundled a model with the app and you could just download, open, and start using it. Easy! But as soon as I wanted to release a UI update to my TestFlight beta testers, I was causing them to download another 3GB. All 3 users complained :). My first change after that was decoupling the default model download and the UI so that I can ship app updates that are about 5MB. It feels like someone using this tool is going to hit the same problem pretty quick when they want to get the latest llama.cpp updates (ggerganov SHIIIIPS [2]). Maybe there are cases where that doesn't matter, would love to hear where people think this could be useful.[0]: <a href="https://justine.lol/cosmopolitan/" rel="nofollow noreferrer">https://justine.lol/cosmopolitan/</a>[1]: <a href="https://www.freechat.run" rel="nofollow noreferrer">https://www.freechat.run</a>[2]: <a href="https://github.com/ggerganov/llama.cpp">https://github.com/ggerganov/llama.cpp</a>

评论 #38467879 未加载

评论 #38465981 未加载

评论 #38467841 未加载

评论 #38465855 未加载

评论 #38469349 未加载

评论 #38472809 未加载

评论 #38470247 未加载

epagaover 1 year ago

Hmmm no matter what image I upload, it's somehow not seeing it.Got a chuckle out of this, though: ---User: Describe the image in the most detailed way possibleLlama: The image is a close-up of an empty, dark room with no visible objects or people. It appears to be a black background photo that highlights the darkness and emptiness of the space. This could potentially represent an artistic representation or an actual setting where lighting has been used to emphasize the contrast between light and shadow. The focus on this particular area suggests that it might have some significance, whether symbolic or practical in nature. Overall, the image evokes a sense of mystery and intrigue due to its lack of context and content.User: is it just a black square?Llama: Yes, it is just a black square.

评论 #38478015 未加载

评论 #38474820 未加载

keybitsover 1 year ago

Simon Willison has a great post on this <a href="https://simonwillison.net/2023/Nov/29/llamafile/" rel="nofollow noreferrer">https://simonwillison.net/2023/Nov/29/llamafile/</a>

phil42over 1 year ago

It’s intriguing to observe how this model is hallucinating. I presented it with Albert Bierstadt’s painting “In the Mountains” (<a href="https://www.wikiart.org/en/albert-bierstadt/in-the-mountains-1867" rel="nofollow noreferrer">https://www.wikiart.org/en/albert-bierstadt/in-the-mountains...</a>). To my surprise, it maintains that cows are scattered throughout the scene, despite being told otherwise.User:Llama: The image features a serene landscape with mountains in the background and a river flowing through it. A picturesque scene unfolds as two trees are visible near the water, adding to the natural beauty of the environment. In addition to these elements, there is an interesting detail: several cows can be seen standing on or around the grassy area next to the body of water. The presence of these animals creates a sense of tranquility and harmony with nature in this idyllic setting.User: There are no cows. Could it be that you've mistaken rocks for cows?Llama: No, the rocks in this image are not mistaken for cows. They can be seen near a lake with mountains in the background and trees surrounding it. The painting captures a serene landscape where nature is beautifully displayed. This picturesque scene showcases both the majestic mountain range and the tranquil body of water, creating an idyllic atmosphere that evokes feelings of peacefulness and relaxation.User: There are no cows!Llama: The image depicts a serene scene of mountains and water. A beautiful lake is surrounded by majestic snow-capped mountains, creating an idyllic landscape. In the foreground, there are trees adding to the natural beauty of the environment.A few birds can be seen flying in the sky above the lake, further enhancing the peaceful atmosphere. The presence of a boat on the water adds another element of leisure and enjoyment to this picturesque scene. Overall, it is an inviting image that showcases nature's wonders with its combination of mountains, lake, trees, birds, and boats.

dangover 1 year ago

Related: <a href="https://hacks.mozilla.org/2023/11/introducing-llamafile/" rel="nofollow noreferrer">https://hacks.mozilla.org/2023/11/introducing-llamafile/</a> and <a href="https://twitter.com/justinetunney/status/1729940628098969799" rel="nofollow noreferrer">https://twitter.com/justinetunney/status/1729940628098969799</a>(via <a href="https://news.ycombinator.com/item?id=38463456">https://news.ycombinator.com/item?id=38463456</a> and <a href="https://news.ycombinator.com/item?id=38464759">https://news.ycombinator.com/item?id=38464759</a>, but we merged the comments hither)

abrinzover 1 year ago

I've been playing with various models in llama.cpp's GGUF format like this.<pre><code> git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make # M2 Max - 16 GB RAM wget -P ./models https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-16k-GGUF/resolve/main/openhermes-2.5-mistral-7b-16k.Q8_0.gguf ./server -m models/openhermes-2.5-mistral-7b-16k.Q8_0.gguf -c 16000 -ngl 32 # M1 - 8 GB RAM wget -P ./models https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-16k-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf ./server -m models/openhermes-2.5-mistral-7b.Q4_K_M.gguf -c 2000 -ngl 32</code></pre>

评论 #38467069 未加载

foruharover 1 year ago

Llaminate would be decent name for something like. Or the verb for the general wrapping of a llama compatible model into a ready to use blob.

评论 #38466274 未加载

modelessover 1 year ago

Wow, it has CUDA support even though it's built with Cosmopolitan? Awesome, I see Cosmopolitan just this month added some support for dynamic linking specifically to enable GPUs! This is amazing, I'm glad they found a way to do this. <a href="https://github.com/jart/cosmopolitan/commit/5e8c928f1a37349a8c72f0b6aae5e535eace3f41">https://github.com/jart/cosmopolitan/commit/5e8c928f1a37349a...</a>I see it unfortunately requires the CUDA developer toolkit to be installed. It's totally possible to distribute CUDA apps that run without any dependencies installed other than the Nvidia driver. If they could figure that out it would be a game changer.

patconover 1 year ago

> Stick that file on a USB stick and stash it in a drawer as insurance against a future apocalypse. You’ll never be without a language model ever again.<3

xnxover 1 year ago

> Windows also has a maximum file size limit of 2GB for executables. You need to have llamafile and your weights be separate files on the Windows platform.The 4GB .exe ran fine on my Windows 10 64-bit system.

评论 #38466137 未加载

评论 #38471314 未加载

mistrial9over 1 year ago

great! worked easily on desktop Linux, first try. It appears to execute with zero network connection. I added a 1200x900 photo from a journalism project and asked "please describe this photo" .. in 4GB of RAM, it took between two and three minutes to execute with CPU-only support. The response was of mixed value. On the one hand, it described "several people appear in the distance" but no, it was brush and trees in the distance, no other people. There was a single figure of a woman walking with a phone in the foreground, which was correctly described by this model. The model did detect 'an atmosphere suggesting a natural disaster' and that is accurate.thx to Mozilla and Justin Tunney for this very easy, local experiment today!

dekhnover 1 year ago

I get the desire to make self-contained things, but a binary that only runs one model with one set of weights seems awfully constricting to me.

评论 #38465975 未加载

评论 #38466303 未加载

评论 #38466157 未加载

评论 #38466897 未加载

评论 #38465902 未加载

dwsover 1 year ago

Can confirm that this runs on an ancient i3 NUC under Ubuntu 20.04. It emits a token every five or six seconds, which is "ask a question then go get coffee" speed. Still, very cool.

dmezzettiover 1 year ago

From a technical standpoint, this project is really fascinating. I can see a lot of use cases for getting something up fast locally for an individual user.But for anyone in a production/business setting, it would be tough to see this being viable. Seems like it would be a non-starter for most medium to large companies IT teams. The great thing about a Dockerfile is that it can be inspected and the install process is relatively easy to understand.

评论 #38469041 未加载

评论 #38468239 未加载

ameliusover 1 year ago

> you pass the --n-gpu-layers 35 flag (or whatever value is appropriate) to enable GPUThis is a bit like specifying how large your strings will be to a C program. That was maybe accepted in the old days, but not anymore really.

评论 #38465648 未加载

bjnewman85over 1 year ago

Justine is creating mind-blowing projects at an alarming rate.

Lucover 1 year ago

This is pretty darn crazy. One file runs on 6 operating systems, with GPU support.

评论 #38464340 未加载

评论 #38465775 未加载

AMICABoardover 1 year ago

This puts a super great evil happy grin on my face. I am going to add it in the next version of L2E OS! Thank you jart, thank you mozilla! Love you folks!

tatrajimover 1 year ago

Small field test: I uploaded a picture of a typical small Korean Buddhist temple, with a stone pagoda in front. Anyone at all familiar with East Asian Buddhism would instantly recognize both the pagoda and the temple behind it as Korean.Llamafile: "The image features a tall, stone-like structure with many levels and carved designs on it. It is situated in front of an Asian temple building that has several windows. In the vicinity, there are two cars parked nearby – one closer to the left side of the scene and another further back towards the right edge. . ."ChatGPT4:"The photo depicts a traditional Korean stone pagoda, exhibiting a tiered tower with multiple levels, each diminishing in size as they ascend. It is an example of East Asian pagodas, which are commonly found within the precincts of Buddhist temples. . . The building is painted in vibrant colors, typical of Korean temples, with green being prominent."No comparison, alas.

评论 #38468498 未加载

jokethrowawayover 1 year ago

Nice but you are leaving some performance on the table (if you have a GPU)Exllama + GPTQ is the way to gollama.cpp && GGUF are great on CPUsMore data: <a href="https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/" rel="nofollow noreferrer">https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacp...</a>

AMICABoardover 1 year ago

Which is a smaller model, that gives good output and that works best with this. I am looking to run this on lower end systems.I wonder if someone has already tried <a href="https://github.com/jzhang38/TinyLlama">https://github.com/jzhang38/TinyLlama</a>, could save me some time :)

zitterbewegungover 1 year ago

This is not to be dismissive but there is a security risk if we keep on using the abstraction with arbitrary objects being serialized to disk and being able to trace back and see if the model file (most commonly python pickle files) aren’t tampered with .

评论 #38466297 未加载

评论 #38466815 未加载

评论 #38466383 未加载

benatkinover 1 year ago

I like the idea of putting it in one file but not an executable file. Using CBOR (MessagePack has a 4gb bytestring limit) and providing a small utility to copy the executable portion and run it would be a win. No 4gb limit. It could use delta updates.

RecycledEleover 1 year ago

Fantastic.For those of who who swim in the Microsoft ecosystem, and do not compile Linux apps from code, what Linux dustro would run this without fixing a huge number of dependencies?It seems like someone would have included Llama.cpp in their distro, ready-to-run.Yes, I'm an idiot.

评论 #38466661 未加载

评论 #38468407 未加载

评论 #38471995 未加载

chunsjover 1 year ago

If my reading is correct, this literally just distribute an LLM model and code, and you need to do some tasks - like building - to make it actually run, right?And for this, you need to have additional tools installed?

评论 #38467037 未加载

throwaway_08932over 1 year ago

I want to replicate the ROM personality of McCoy Pauley that Case steals in Neuromancer by tuning an LLM to speak like him, and dumping a llamafile of him onto a USB stick.

rightbyteover 1 year ago

This is really impressive. I am glad locally hosted LLMs is a thing. It would be disastrous if e.g. "OpenAI" would get monopoly on these programs.The model seems worse than the original ChatGPT at coding. However the model is quite small. It certainly could be a NPC in some game.I guess I need to buy a new computer soon, to be able to run these in their big variants.

SnowingXIVover 1 year ago

Incredible, up and running offline at 104ms per token with no additional configurations. Worked with various permutations of questions and outputs. The fact this is so readily available is wonderful. Using xdg make a nice little shortcut to drop in to automatically fire this off, open up a web browser, and begin.

tannhaeuserover 1 year ago

Does it use Metal on Mac OS (Apple Silicon)? And if not, how does it compare performance-wise against regular llama.cpp? It's not necessarily an advantage to pack everything (huge quantified 4bit? model and code) into a single file, or at least it wasn't when llama.cpp was gaining speed almost daily.

评论 #38469384 未加载

m3kw9over 1 year ago

This is the first time I'm able to get a chat model to work this easily. Although I can't see myself using it as it is very limited in UI, quality and context length in and out vs ChatGPT

polyrandover 1 year ago

The technical details in the README are quite an interesting read:<a href="https://github.com/mozilla-Ocho/llamafile#technical-details">https://github.com/mozilla-Ocho/llamafile#technical-details</a>

zoe_dkover 1 year ago

Noob question - how might I call this from my Python script? Say as a replacement gpt3.5 turbo of sorts. Is there an option without GUI?This is great thank you, very user friendly (exhibit a: me)

评论 #38472893 未加载

m3kw9over 1 year ago

Is the context only 1024 tokens? it seem it will cut off more and more (which is weird) after I have longer conversation.

评论 #38637217 未加载

novaomnidevover 1 year ago

Why is this faster than running llama.cpp main directly? I’m getting 7 tokens/ sec with this. But 2 with llama.cpp by itself

hiAndrewQuinnover 1 year ago

Ah, so like SQLite but for model weights.Edit: No, actually a lot more than that, but not a bad tagline.

verdvermover 1 year ago

Can someone explain why we would want to use this instead of an OCI manifest?

评论 #38469256 未加载

estebarbover 1 year ago

Currently which are the minimum system requirements for running these models?

评论 #38465916 未加载

评论 #38465845 未加载

评论 #38465815 未加载

评论 #38466553 未加载

ionwakeover 1 year ago

Im sure this is great, but not screenshot of the GUI?

评论 #38471026 未加载

outside415over 1 year ago

Cool

victor9000over 1 year ago

I read xyz with a single file and already knew Justine was involved lol

OOPManover 1 year ago

Why does it feel like everyday I see some new example of stupidity on HN.

评论 #38473074 未加载

_pdp_over 1 year ago

A couple of steps away from getting weaponized.

评论 #38467701 未加载

43 comments

simonwover 1 year ago

评论 #38469824 未加载

评论 #38470662 未加载

评论 #38466622 未加载

评论 #38472021 未加载

评论 #38467688 未加载

评论 #38473627 未加载

评论 #38470440 未加载

评论 #38466090 未加载

评论 #38475080 未加载

评论 #38466568 未加载

评论 #38469298 未加载

评论 #38475419 未加载

评论 #38466378 未加载

评论 #38469629 未加载

评论 #38491373 未加载

评论 #38473424 未加载

评论 #38472878 未加载

评论 #38472686 未加载

评论 #38466339 未加载

评论 #38475803 未加载

评论 #38468994 未加载

rgbrgbover 1 year ago

评论 #38467879 未加载

评论 #38465981 未加载

评论 #38467841 未加载

评论 #38465855 未加载

评论 #38469349 未加载

评论 #38472809 未加载

评论 #38470247 未加载

epagaover 1 year ago

评论 #38478015 未加载

评论 #38474820 未加载

keybitsover 1 year ago

Simon Willison has a great post on this <a href="https://simonwillison.net/2023/Nov/29/llamafile/" rel="nofollow noreferrer">https://simonwillison.net/2023/Nov/29/llamafile/</a>

phil42over 1 year ago

dangover 1 year ago

abrinzover 1 year ago

评论 #38467069 未加载

foruharover 1 year ago

Llaminate would be decent name for something like. Or the verb for the general wrapping of a llama compatible model into a ready to use blob.

评论 #38466274 未加载

modelessover 1 year ago

patconover 1 year ago

> Stick that file on a USB stick and stash it in a drawer as insurance against a future apocalypse. You’ll never be without a language model ever again.<3

xnxover 1 year ago

评论 #38466137 未加载

评论 #38471314 未加载

mistrial9over 1 year ago

dekhnover 1 year ago

I get the desire to make self-contained things, but a binary that only runs one model with one set of weights seems awfully constricting to me.

评论 #38465975 未加载

评论 #38466303 未加载

评论 #38466157 未加载

评论 #38466897 未加载

评论 #38465902 未加载

dwsover 1 year ago

Can confirm that this runs on an ancient i3 NUC under Ubuntu 20.04. It emits a token every five or six seconds, which is "ask a question then go get coffee" speed. Still, very cool.

dmezzettiover 1 year ago

评论 #38469041 未加载

评论 #38468239 未加载

ameliusover 1 year ago

评论 #38465648 未加载

bjnewman85over 1 year ago

Justine is creating mind-blowing projects at an alarming rate.

Lucover 1 year ago

This is pretty darn crazy. One file runs on 6 operating systems, with GPU support.

评论 #38464340 未加载

评论 #38465775 未加载

AMICABoardover 1 year ago

This puts a super great evil happy grin on my face. I am going to add it in the next version of L2E OS! Thank you jart, thank you mozilla! Love you folks!

tatrajimover 1 year ago

评论 #38468498 未加载

jokethrowawayover 1 year ago

AMICABoardover 1 year ago

zitterbewegungover 1 year ago

评论 #38466297 未加载

评论 #38466815 未加载

评论 #38466383 未加载

benatkinover 1 year ago

RecycledEleover 1 year ago

评论 #38466661 未加载

评论 #38468407 未加载

评论 #38471995 未加载

chunsjover 1 year ago

评论 #38467037 未加载

throwaway_08932over 1 year ago

I want to replicate the ROM personality of McCoy Pauley that Case steals in Neuromancer by tuning an LLM to speak like him, and dumping a llamafile of him onto a USB stick.

rightbyteover 1 year ago

SnowingXIVover 1 year ago

tannhaeuserover 1 year ago

评论 #38469384 未加载

m3kw9over 1 year ago

This is the first time I'm able to get a chat model to work this easily. Although I can't see myself using it as it is very limited in UI, quality and context length in and out vs ChatGPT

polyrandover 1 year ago

zoe_dkover 1 year ago

Noob question - how might I call this from my Python script? Say as a replacement gpt3.5 turbo of sorts. Is there an option without GUI?This is great thank you, very user friendly (exhibit a: me)

评论 #38472893 未加载

m3kw9over 1 year ago

Is the context only 1024 tokens? it seem it will cut off more and more (which is weird) after I have longer conversation.

评论 #38637217 未加载

novaomnidevover 1 year ago

Why is this faster than running llama.cpp main directly? I’m getting 7 tokens/ sec with this. But 2 with llama.cpp by itself

hiAndrewQuinnover 1 year ago

Ah, so like SQLite but for model weights.Edit: No, actually a lot more than that, but not a bad tagline.

verdvermover 1 year ago

Can someone explain why we would want to use this instead of an OCI manifest?

评论 #38469256 未加载

estebarbover 1 year ago

Currently which are the minimum system requirements for running these models?

评论 #38465916 未加载

评论 #38465845 未加载

评论 #38465815 未加载

评论 #38466553 未加载

ionwakeover 1 year ago

Im sure this is great, but not screenshot of the GUI?

评论 #38471026 未加载

outside415over 1 year ago

Cool

victor9000over 1 year ago

I read xyz with a single file and already knew Justine was involved lol

OOPManover 1 year ago

Why does it feel like everyday I see some new example of stupidity on HN.

评论 #38473074 未加载

_pdp_over 1 year ago

A couple of steps away from getting weaponized.

评论 #38467701 未加载