I'm all for Rust and WASM, but if you look at the code it's just 150 lines of a basic Rust command-line script. All the heavy lifting is done by a single line passing the model to the WASI-NN backend, which in this case is provided by the WasmEdge runtime, which incidentally is C++, not Rust.<p>Rust is bringing zero advantage here really, the backend could be called from Python or anything else.
Whoa! Great work. To other folks checking it out, it still requires downloading the weights, which are pretty large. But they essentially made a fully portable, no-dependency llama.cpp, in 2mb.<p>If you're an app developer this might be the easiest way to package an inference engine in a distributable file (the weights are already portable and can be downloaded on-demand — the inference engine is really the part you want to lock down).
This just wrapping llama.cpp right?
I’m sorry but I’m pretty tired of projects wrapping x.cpp.<p>I’ve been developing a Rust + WebGPU ML framework for the past 6 months. I’ve learned quickly how impressive the work by GG is.<p>It’s early stages but you can check it out here:
<a href="https://www.ratchet.sh/" rel="nofollow noreferrer">https://www.ratchet.sh/</a>
<a href="https://github.com/FL33TW00D/whisper-turbo">https://github.com/FL33TW00D/whisper-turbo</a>
Mmm…<p>The wasm-nn that this relies on (<a href="https://github.com/WebAssembly/wasi-nn">https://github.com/WebAssembly/wasi-nn</a>) is a proposal that relies on sending arbitrarily chunks to some vendor implementation. The api is literally like set input, compute, set output.<p>…and that is totally non portable.<p>The reason <i>this</i> works, is because it’s relying on the abstraction already implemented in llama.cpp that allows it to take a gguf model and map it to multiple hardware targets,which you can see has been lifted as-is into WasmEdge here: <a href="https://github.com/WasmEdge/WasmEdge/tree/master/plugins/wasi_nn/thirdparty/ggml">https://github.com/WasmEdge/WasmEdge/tree/master/plugins/was...</a><p>So..<p>> Developers can refer to this project to write their machine learning application in a high-level language using the bindings, compile it to WebAssembly, and run it with a WebAssembly runtime that supports the wasi-nn proposal, such as WasmEdge.<p>Is total rubbish; no, you can’t.<p>This isn’t portable.<p>It’s not sandboxed.<p>It’s not a HAL.<p>If you have a wasm binary you <i>might</i> be able to run it <i>if</i> the version of the runtime you’re using <i>happens</i> to implement the specific ggml backend you need, which it probably doesn’t… because there’s literally no requirement for it to do so.<p>…and if you do, you’re just calling the llama.cpp ggml code, so it’s as safe as that library is…<p>There’s a lot of “so portable” and “such rust” talk in this article which really seems misplaced; this doesn’t seem to have the benefits of either of those two things.<p>Let’s imagine you have some new hardware with a WASI runtime on it, can you run your model on it? Does it have GPU support?<p>Well, turns out the answer is “go and see if llama.cpp compiles on that platform with GPU support and if the runtime you’re using happens have a ggml plugin in it and happens to have a copy of that version of ggml vendored in it, and if not, then no”.<p>..at which point, wtf are you even using WASI for?<p>Cross platform GPU support <i>is</i> hard, but this… I dunno. It seems absolutely ridiculous.<p>Imagine if webGPU was just “post some binary chunk to the GPU and maybe it’ll draw something or whatever if it’s the right binary chunk for the current hardware.”<p>That’s what this is.
The way things are going, we'll see more efficient and faster methods to run transformer arch on edge, but I'm afraid we're approaching the limit because you can't just rust your way out of the VRAM requirements, which is the main bottleneck in loading large-enough models. One might say "small models are getting better, look at Mistral vs. llama 2", but small models are also approaching their capacity (there's only so much you can put in 7b parameters).<p>I don't know man, this approach to AI doesn't "feel" like it'll lead to AGI—it's too inefficient.
> the Mac OS build of the GGML plugin uses the Metal API to run the inference workload on M1/M2/M3’s built-in neural processing engines<p>I don't think that's accurate (someone please correct me...)<p>GGML use of Metal API means it runs on the M1/2/3 <i>GPU</i> and not the neural engine<p>Which is all good, but for sake of being pedantic...
I hate this kind of clickbait marketing suggesting the project is delivering 1/100 of the size or 100x-35000x the speed of other solutions because it uses a different language for a wrapper around core library and completely neglecting tooling and community expertise built around other solutions.<p>First of all the project is based on llama.cpp[1], which does the heavy work of loading and running multi-GB model files on GPU/CPU and the inference speed is not limited by the wrapper choice (there are other wrappers in Go, Python, Node, Rust, etc. or one can use llama.cpp directly). The size of the binary is also not that important when common quantized model files are often in the range of 5GB-40GB and require a beefy GPU or a MB with 16-64GB of RAM.<p>[1] <a href="https://github.com/ggerganov/llama.cpp">https://github.com/ggerganov/llama.cpp</a>
If a large part of the size is essentially the trained weights of a model, how can one reduce the size by orders of magnitude (without losing any accuracy)?
> The core Rust source code is very simple. It is only 40 lines of code. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2’s chat template, and runs the inference operations using the WASI NN API.<p>TL;DR a 2MB executable that reads stdin and calls WASI-NN
It looks like this is Rust for the application wrapped around a WASM port of llama.cpp that in turn uses an implementation of WASI-NN for the actual NN compute. It would be interesting to see how this compares to the TFLite, the new stuff in the PyTorch ecosystem, etc.
I'm getting lost in all that.<p>Using llama cpp and mlc-llm. Both on my 2 years old mobile Ryzen APU with 64GB of RAM. First does not use GPU at all, tried plenty of options, nothing did work, but llama 34B works - painfully slow, but does work. Second is working on top of Vulkan and I didn't take any precise measurements but it's limit looks like is 32GB RAM (so no llama 34B), but it offloads CPU, unfortunately seem like performance is similar to CPU (that is my perception, didn't take any measurements here too).<p>So ... will I get any benefits from switching to rust/webassembly version???
Very cool, but unless I missed it could someone please explain why not just compile a Rust application? Is the Wasm part needed for the GPU acceleration (whatever the user GPU is?)
> the binary application (only 2MB) is completely portable across devices with heterogeneous hardware accelerators.<p>What does the “heterogenous hardware accelerators” mean in practice?
The binary size is not really important in this case, llama.cpp should not be that far from this, what's matter as we all know is how much gpu memory we need.
Congrats on the work... it's an impressive demo!<p>It may be worth researching to add support of it into the Wasmer WebAssembly runtime [1]. (Note: I work at Wasmer!)<p>[1] <a href="https://wasmer.io/">https://wasmer.io/</a>
Wow, this is a “holy shit” moment for Rust in AI applications if this works as described. Also, so long Mojo!<p>EDIT:<p>Looks like I’m wrong, but I appreciate getting schooled by all the HNers with low-level expertise. Lots to go and learn about now.
Confused about the title rewrite from “Fast and Portable Llama2 Inference on the Heterogeneous Edge” which more clearly communicates what this article is about - a wasm version of llama.cpp.<p>I feel like editorializing to highlight the fact that it’s 2MB and runs on a mac are missing some of the core aspects of the project and write up.