Fast and Portable Llama2 Inference on the Heterogeneous Edge

313 pointsby 3Sophonsover 1 year ago

25 comments

oerstedover 1 year ago

I'm all for Rust and WASM, but if you look at the code it's just 150 lines of a basic Rust command-line script. All the heavy lifting is done by a single line passing the model to the WASI-NN backend, which in this case is provided by the WasmEdge runtime, which incidentally is C++, not Rust.Rust is bringing zero advantage here really, the backend could be called from Python or anything else.

评论 #38249776 未加载

edover 1 year ago

Whoa! Great work. To other folks checking it out, it still requires downloading the weights, which are pretty large. But they essentially made a fully portable, no-dependency llama.cpp, in 2mb.If you're an app developer this might be the easiest way to package an inference engine in a distributable file (the weights are already portable and can be downloaded on-demand — the inference engine is really the part you want to lock down).

评论 #38247587 未加载

评论 #38252439 未加载

FL33TW00Dover 1 year ago

This just wrapping llama.cpp right? I’m sorry but I’m pretty tired of projects wrapping x.cpp.I’ve been developing a Rust + WebGPU ML framework for the past 6 months. I’ve learned quickly how impressive the work by GG is.It’s early stages but you can check it out here: <a href="https://www.ratchet.sh/" rel="nofollow noreferrer">https://www.ratchet.sh/</a> <a href="https://github.com/FL33TW00D/whisper-turbo">https://github.com/FL33TW00D/whisper-turbo</a>

评论 #38248195 未加载

评论 #38252475 未加载

评论 #38247768 未加载

评论 #38248052 未加载

wokwokwokover 1 year ago

Mmm…The wasm-nn that this relies on (<a href="https://github.com/WebAssembly/wasi-nn">https://github.com/WebAssembly/wasi-nn</a>) is a proposal that relies on sending arbitrarily chunks to some vendor implementation. The api is literally like set input, compute, set output.…and that is totally non portable.The reason this works, is because it’s relying on the abstraction already implemented in llama.cpp that allows it to take a gguf model and map it to multiple hardware targets,which you can see has been lifted as-is into WasmEdge here: <a href="https://github.com/WasmEdge/WasmEdge/tree/master/plugins/wasi_nn/thirdparty/ggml">https://github.com/WasmEdge/WasmEdge/tree/master/plugins/was...</a>So..> Developers can refer to this project to write their machine learning application in a high-level language using the bindings, compile it to WebAssembly, and run it with a WebAssembly runtime that supports the wasi-nn proposal, such as WasmEdge.Is total rubbish; no, you can’t.This isn’t portable.It’s not sandboxed.It’s not a HAL.If you have a wasm binary you might be able to run it if the version of the runtime you’re using happens to implement the specific ggml backend you need, which it probably doesn’t… because there’s literally no requirement for it to do so.…and if you do, you’re just calling the llama.cpp ggml code, so it’s as safe as that library is…There’s a lot of “so portable” and “such rust” talk in this article which really seems misplaced; this doesn’t seem to have the benefits of either of those two things.Let’s imagine you have some new hardware with a WASI runtime on it, can you run your model on it? Does it have GPU support?Well, turns out the answer is “go and see if llama.cpp compiles on that platform with GPU support and if the runtime you’re using happens have a ggml plugin in it and happens to have a copy of that version of ggml vendored in it, and if not, then no”...at which point, wtf are you even using WASI for?Cross platform GPU support is hard, but this… I dunno. It seems absolutely ridiculous.Imagine if webGPU was just “post some binary chunk to the GPU and maybe it’ll draw something or whatever if it’s the right binary chunk for the current hardware.”That’s what this is.

评论 #38249401 未加载

评论 #38249152 未加载

reidjsover 1 year ago

Can I run this offline on my iPhone? That would be like having basic internet search regardless of reception. Could come in handy when camping

评论 #38247061 未加载

评论 #38247120 未加载

评论 #38249510 未加载

评论 #38248539 未加载

behnamohover 1 year ago

The way things are going, we'll see more efficient and faster methods to run transformer arch on edge, but I'm afraid we're approaching the limit because you can't just rust your way out of the VRAM requirements, which is the main bottleneck in loading large-enough models. One might say "small models are getting better, look at Mistral vs. llama 2", but small models are also approaching their capacity (there's only so much you can put in 7b parameters).I don't know man, this approach to AI doesn't "feel" like it'll lead to AGI—it's too inefficient.

评论 #38248311 未加载

anentropicover 1 year ago

> the Mac OS build of the GGML plugin uses the Metal API to run the inference workload on M1/M2/M3’s built-in neural processing enginesI don't think that's accurate (someone please correct me...)GGML use of Metal API means it runs on the M1/2/3 GPU and not the neural engineWhich is all good, but for sake of being pedantic...

评论 #38249951 未加载

nigmaover 1 year ago

I hate this kind of clickbait marketing suggesting the project is delivering 1/100 of the size or 100x-35000x the speed of other solutions because it uses a different language for a wrapper around core library and completely neglecting tooling and community expertise built around other solutions.First of all the project is based on llama.cpp[1], which does the heavy work of loading and running multi-GB model files on GPU/CPU and the inference speed is not limited by the wrapper choice (there are other wrappers in Go, Python, Node, Rust, etc. or one can use llama.cpp directly). The size of the binary is also not that important when common quantized model files are often in the range of 5GB-40GB and require a beefy GPU or a MB with 16-64GB of RAM.[1] <a href="https://github.com/ggerganov/llama.cpp">https://github.com/ggerganov/llama.cpp</a>

hnarayananover 1 year ago

If a large part of the size is essentially the trained weights of a model, how can one reduce the size by orders of magnitude (without losing any accuracy)?

评论 #38246897 未加载

评论 #38246922 未加载

评论 #38246920 未加载

diimdeepover 1 year ago

I do not see the point to use this instead of directly using llama.cpp

评论 #38247209 未加载

评论 #38247216 未加载

estover 1 year ago

> The core Rust source code is very simple. It is only 40 lines of code. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2’s chat template, and runs the inference operations using the WASI NN API.TL;DR a 2MB executable that reads stdin and calls WASI-NN

评论 #38247532 未加载

hedgehogover 1 year ago

It looks like this is Rust for the application wrapped around a WASM port of llama.cpp that in turn uses an implementation of WASI-NN for the actual NN compute. It would be interesting to see how this compares to the TFLite, the new stuff in the PyTorch ecosystem, etc.

danielEMover 1 year ago

I'm getting lost in all that.Using llama cpp and mlc-llm. Both on my 2 years old mobile Ryzen APU with 64GB of RAM. First does not use GPU at all, tried plenty of options, nothing did work, but llama 34B works - painfully slow, but does work. Second is working on top of Vulkan and I didn't take any precise measurements but it's limit looks like is 32GB RAM (so no llama 34B), but it offloads CPU, unfortunately seem like performance is similar to CPU (that is my perception, didn't take any measurements here too).So ... will I get any benefits from switching to rust/webassembly version???

anon23432343over 1 year ago

So you need to mb2 for sending an api call to the edge?Okaayyyy...

评论 #38248120 未加载

dkgaover 1 year ago

Very cool, but unless I missed it could someone please explain why not just compile a Rust application? Is the Wasm part needed for the GPU acceleration (whatever the user GPU is?)

评论 #38247428 未加载

thih9over 1 year ago

> the binary application (only 2MB) is completely portable across devices with heterogeneous hardware accelerators.What does the “heterogenous hardware accelerators” mean in practice?

gvandover 1 year ago

The binary size is not really important in this case, llama.cpp should not be that far from this, what's matter as we all know is how much gpu memory we need.

rowanG077over 1 year ago

I don't think you can call anything wasm efficient.

rjzzleepover 1 year ago

Is there any detailed info on how a 4090 + ryzen 7840 compares to any of the new Apple offerings with 64GB or more unified RAM?

评论 #38247691 未加载

antirezover 1 year ago

Linkbait at its finest. But it's true that the Python AI stack sucks big times.

syrusakbaryover 1 year ago

Congrats on the work... it's an impressive demo!It may be worth researching to add support of it into the Wasmer WebAssembly runtime [1]. (Note: I work at Wasmer!)[1] <a href="https://wasmer.io/">https://wasmer.io/</a>

classifiedover 1 year ago

How is it still fast if it was compiled to WASM?

tomalbrcover 1 year ago

> No wonder Elon Musk said that Rust is the language of AGI.What.

bugglebeetleover 1 year ago

Wow, this is a “holy shit” moment for Rust in AI applications if this works as described. Also, so long Mojo!EDIT:Looks like I’m wrong, but I appreciate getting schooled by all the HNers with low-level expertise. Lots to go and learn about now.

评论 #38246987 未加载

评论 #38247099 未加载

评论 #38247309 未加载

评论 #38247003 未加载

jasonjmcgheeover 1 year ago

Confused about the title rewrite from “Fast and Portable Llama2 Inference on the Heterogeneous Edge” which more clearly communicates what this article is about - a wasm version of llama.cpp.I feel like editorializing to highlight the fact that it’s 2MB and runs on a mac are missing some of the core aspects of the project and write up.

评论 #38253479 未加载

评论 #38248051 未加载

评论 #38248025 未加载

评论 #38247349 未加载