I think the best way to try this out is with LLaVA, the text+image model (like GPT-4 Vision). Here are steps to do that on macOS (which should work the same on other platforms too, I haven't tried that yet though):<p>1. Download the 4.26GB llamafile-server-0.1-llava-v1.5-7b-q4 file from <a href="https://huggingface.co/jartine/llava-v1.5-7B-GGUF/blob/main/llamafile-server-0.1-llava-v1.5-7b-q4" rel="nofollow noreferrer">https://huggingface.co/jartine/llava-v1.5-7B-GGUF/blob/main/...</a>:<p><pre><code> wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llamafile-server-0.1-llava-v1.5-7b-q4
</code></pre>
2. Make that binary executable, by running this in a terminal:<p><pre><code> chmod 755 llamafile-server-0.1-llava-v1.5-7b-q4
</code></pre>
3. Run your new executable, which will start a web server on port 8080:<p><pre><code> ./llamafile-server-0.1-llava-v1.5-7b-q4
</code></pre>
4. Navigate to <a href="http://127.0.0.1:8080/" rel="nofollow noreferrer">http://127.0.0.1:8080/</a> to upload an image and start chatting with the model about it in your browser.<p>Screenshot here: <a href="https://simonwillison.net/2023/Nov/29/llamafile/" rel="nofollow noreferrer">https://simonwillison.net/2023/Nov/29/llamafile/</a>
Extremely cool and Justine Tunney / jart does incredible portability work [0], but I'm kind of struggling with the use-cases for this one.<p>I make a small macOS app [1] which runs llama.cpp with a SwiftUI front-end. For the first version of the app I was obsessed with the single download -> chat flow and making 0 network connections. I bundled a model with the app and you could just download, open, and start using it. Easy! But as soon as I wanted to release a UI update to my TestFlight beta testers, I was causing them to download another 3GB. All 3 users complained :). My first change after that was decoupling the default model download and the UI so that I can ship app updates that are about 5MB. It feels like someone using this tool is going to hit the same problem pretty quick when they want to get the latest llama.cpp updates (ggerganov SHIIIIPS [2]). Maybe there are cases where that doesn't matter, would love to hear where people think this could be useful.<p>[0]: <a href="https://justine.lol/cosmopolitan/" rel="nofollow noreferrer">https://justine.lol/cosmopolitan/</a><p>[1]: <a href="https://www.freechat.run" rel="nofollow noreferrer">https://www.freechat.run</a><p>[2]: <a href="https://github.com/ggerganov/llama.cpp">https://github.com/ggerganov/llama.cpp</a>
Hmmm no matter what image I upload, it's somehow not seeing it.<p>Got a chuckle out of this, though:
---<p>User: Describe the image in the most detailed way possible<p>Llama: The image is a close-up of an empty, dark room with no visible objects or people. It appears to be a black background photo that highlights the darkness and emptiness of the space. This could potentially represent an artistic representation or an actual setting where lighting has been used to emphasize the contrast between light and shadow. The focus on this particular area suggests that it might have some significance, whether symbolic or practical in nature. Overall, the image evokes a sense of mystery and intrigue due to its lack of context and content.<p>User: is it just a black square?<p>Llama: Yes, it is just a black square.
Simon Willison has a great post on this <a href="https://simonwillison.net/2023/Nov/29/llamafile/" rel="nofollow noreferrer">https://simonwillison.net/2023/Nov/29/llamafile/</a>
It’s intriguing to observe how this model is hallucinating. I presented it with Albert Bierstadt’s painting “In the Mountains” (<a href="https://www.wikiart.org/en/albert-bierstadt/in-the-mountains-1867" rel="nofollow noreferrer">https://www.wikiart.org/en/albert-bierstadt/in-the-mountains...</a>). To my surprise, it maintains that cows are scattered throughout the scene, despite being told otherwise.<p>User:<p>Llama: The image features a serene landscape with mountains in the background and a river flowing through it. A picturesque scene unfolds as two trees are visible near the water, adding to the natural beauty of the environment. In addition to these elements, there is an interesting detail: several cows can be seen standing on or around the grassy area next to the body of water. The presence of these animals creates a sense of tranquility and harmony with nature in this idyllic setting.<p>User: There are no cows. Could it be that you've mistaken rocks for cows?<p>Llama: No, the rocks in this image are not mistaken for cows. They can be seen near a lake with mountains in the background and trees surrounding it. The painting captures a serene landscape where nature is beautifully displayed. This picturesque scene showcases both the majestic mountain range and the tranquil body of water, creating an idyllic atmosphere that evokes feelings of peacefulness and relaxation.<p>User: There are no cows!<p>Llama: The image depicts a serene scene of mountains and water. A beautiful lake is surrounded by majestic snow-capped mountains, creating an idyllic landscape. In the foreground, there are trees adding to the natural beauty of the environment.<p>A few birds can be seen flying in the sky above the lake, further enhancing the peaceful atmosphere. The presence of a boat on the water adds another element of leisure and enjoyment to this picturesque scene. Overall, it is an inviting image that showcases nature's wonders with its combination of mountains, lake, trees, birds, and boats.
Related: <a href="https://hacks.mozilla.org/2023/11/introducing-llamafile/" rel="nofollow noreferrer">https://hacks.mozilla.org/2023/11/introducing-llamafile/</a> and <a href="https://twitter.com/justinetunney/status/1729940628098969799" rel="nofollow noreferrer">https://twitter.com/justinetunney/status/1729940628098969799</a><p>(via <a href="https://news.ycombinator.com/item?id=38463456">https://news.ycombinator.com/item?id=38463456</a> and <a href="https://news.ycombinator.com/item?id=38464759">https://news.ycombinator.com/item?id=38464759</a>, but we merged the comments hither)
Wow, it has CUDA support even though it's built with Cosmopolitan? Awesome, I see Cosmopolitan just this month added some support for dynamic linking specifically to enable GPUs! This is amazing, I'm glad they found a way to do this. <a href="https://github.com/jart/cosmopolitan/commit/5e8c928f1a37349a8c72f0b6aae5e535eace3f41">https://github.com/jart/cosmopolitan/commit/5e8c928f1a37349a...</a><p>I see it unfortunately requires the CUDA developer toolkit to be installed. It's totally possible to distribute CUDA apps that run without any dependencies installed other than the Nvidia driver. If they could figure <i>that</i> out it would be a game changer.
> Stick that file on a USB stick and stash it in a drawer as insurance against a future apocalypse. You’ll never be without a language model ever again.<p><3
> Windows also has a maximum file size limit of 2GB for executables. You need to have llamafile and your weights be separate files on the Windows platform.<p>The 4GB .exe ran fine on my Windows 10 64-bit system.
great! worked easily on desktop Linux, first try. It appears to execute with zero network connection. I added a 1200x900 photo from a journalism project and asked "please describe this photo" .. in 4GB of RAM, it took between two and three minutes to execute with CPU-only support. The response was of mixed value. On the one hand, it described "several people appear in the distance" but no, it was brush and trees in the distance, no other people. There was a single figure of a woman walking with a phone in the foreground, which was correctly described by this model. The model did detect 'an atmosphere suggesting a natural disaster' and that is accurate.<p>thx to Mozilla and Justin Tunney for this very easy, local experiment today!
Can confirm that this runs on an ancient i3 NUC under Ubuntu 20.04. It emits a token every five or six seconds, which is "ask a question then go get coffee" speed. Still, very cool.
From a technical standpoint, this project is really fascinating. I can see a lot of use cases for getting something up fast locally for an individual user.<p>But for anyone in a production/business setting, it would be tough to see this being viable. Seems like it would be a non-starter for most medium to large companies IT teams. The great thing about a Dockerfile is that it can be inspected and the install process is relatively easy to understand.
> you pass the --n-gpu-layers 35 flag (or whatever value is appropriate) to enable GPU<p>This is a bit like specifying how large your strings will be to a C program. That was maybe accepted in the old days, but not anymore really.
This puts a super great evil happy grin on my face. I am going to add it in the next version of L2E OS! Thank you jart, thank you mozilla! Love you folks!
Small field test: I uploaded a picture of a typical small Korean Buddhist temple, with a stone pagoda in front. Anyone at all familiar with East Asian Buddhism would instantly recognize both the pagoda and the temple behind it as Korean.<p>Llamafile: "The image features a tall, stone-like structure with many levels and carved designs on it. It is situated in front of an Asian temple building that has several windows. In the vicinity, there are two cars parked nearby – one closer to the left side of the scene and another further back towards the right edge. . ."<p>ChatGPT4:"The photo depicts a traditional Korean stone pagoda, exhibiting a tiered tower with multiple levels, each diminishing in size as they ascend. It is an example of East Asian pagodas, which are commonly found within the precincts of Buddhist temples. . . The building is painted in vibrant colors, typical of Korean temples, with green being prominent."<p>No comparison, alas.
Nice but you are leaving some performance on the table (if you have a GPU)<p>Exllama + GPTQ is the way to go<p>llama.cpp && GGUF are great on CPUs<p>More data:
<a href="https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/" rel="nofollow noreferrer">https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacp...</a>
Which is a smaller model, that gives good output and that works best with this. I am looking to run this on lower end systems.<p>I wonder if someone has already tried <a href="https://github.com/jzhang38/TinyLlama">https://github.com/jzhang38/TinyLlama</a>, could save me some time :)
This is not to be dismissive but there is a security risk if we keep on using the abstraction with arbitrary objects being serialized to disk and being able to trace back and see if the model file (most commonly python pickle files) aren’t tampered with .
I like the idea of putting it in one file but not an executable file. Using CBOR (MessagePack has a 4gb bytestring limit) and providing a small utility to copy the executable portion and run it would be a win. No 4gb limit. It could use delta updates.
Fantastic.<p>For those of who who swim in the Microsoft ecosystem, and do not compile Linux apps from code, what Linux dustro would run this without fixing a huge number of dependencies?<p>It seems like someone would have included Llama.cpp in their distro, ready-to-run.<p>Yes, I'm an idiot.
If my reading is correct, this literally just distribute an LLM model and code, and you need to do some tasks - like building - to make it actually run, right?<p>And for this, you need to have additional tools installed?
I want to replicate the ROM personality of McCoy Pauley that Case steals in Neuromancer by tuning an LLM to speak like him, and dumping a llamafile of him onto a USB stick.
This is really impressive. I am glad locally hosted LLMs is a thing. It would be disastrous if e.g. "OpenAI" would get monopoly on these programs.<p>The model seems worse than the original ChatGPT at coding. However the model is quite small. It certainly could be a NPC in some game.<p>I guess I need to buy a new computer soon, to be able to run these in their big variants.
Incredible, up and running offline at 104ms per token with no additional configurations. Worked with various permutations of questions and outputs. The fact this is so readily available is wonderful. Using xdg make a nice little shortcut to drop in to automatically fire this off, open up a web browser, and begin.
Does it use Metal on Mac OS (Apple Silicon)? And if not, how does it compare performance-wise against regular llama.cpp? It's not necessarily an advantage to pack everything (huge quantified 4bit? model and code) into a single file, or at least it wasn't when llama.cpp was gaining speed almost daily.
This is the first time I'm able to get a chat model to work this easily. Although I can't see myself using it as it is very limited in UI, quality and context length in and out vs ChatGPT
The technical details in the README are quite an interesting read:<p><a href="https://github.com/mozilla-Ocho/llamafile#technical-details">https://github.com/mozilla-Ocho/llamafile#technical-details</a>
Noob question - how might I call this from my Python script? Say as a replacement gpt3.5 turbo of sorts.
Is there an option without GUI?<p>This is great thank you, very user friendly (exhibit a: me)