I've counted three different Rust LLaMA implementations on r/rust subreddit this week:<p><a href="https://github.com/Noeda/rllama/">https://github.com/Noeda/rllama/</a> (pure Rust+OpenCL)<p><a href="https://github.com/setzer22/llama-rs/">https://github.com/setzer22/llama-rs/</a> (ggml based)<p><a href="https://github.com/philpax/ggllama">https://github.com/philpax/ggllama</a> (also ggml based)<p>There's also a discussion on GitHub issue on setzer's repo to collaborate a bit on these separate efforts: <a href="https://github.com/setzer22/llama-rs/issues/4">https://github.com/setzer22/llama-rs/issues/4</a>
Anyone know if these LLaMA models can have a large pile of context fed in? Eg to have the "AI" act like ChatGPT with a specific knowledge base you feed in?<p>Ie imagine you feed in the last year of chatlogs of yours, and then ask the Assistant queries about the chatlogs. Compound that with your Wiki, itinerary, etc. Is this possible with LLaMA? Where might it fail in doing this?<p><i>(and yes, i know this is basically autocomplete on steroids. I'm still curious hah)</i>
I feel like <a href="https://github.com/ggerganov/llama.cpp/issues/171">https://github.com/ggerganov/llama.cpp/issues/171</a> is a better approach here?<p>With how fast llama.cpp is changing, this seems like a lot of churn for no reason.
Great job porting the C++ code! Seems like the reasoning was to provide the code as a library to embed in a HTTP Server, cannot wait to see that happen and try it out.<p>Looking at how the inference runs, this shouldn't be a big problem, right? <a href="https://github.com/setzer22/llama-rs/blob/main/llama-rs/src/main.rs#L42">https://github.com/setzer22/llama-rs/blob/main/llama-rs/src/...</a>
Can someone a lot smarter than me give a basic explanation as to why something like this can run at a respectable speed on the CPU whereas Stable Diffusion is next to useless on them? (That is to say, 10-100x slower, whereas I have not seen GPU based LLaMA go 10-100x faster than the demo here.) I had assumed there were similar algorithms at play.
Funny that he had a hard time converting llama.cop to expose a web server… I was just asking gpt 4 to write one for me… will hopefully have a pr ready soon
Anyone more knowledgeable in this space please explain what is meant by inference?<p>From what I know. LLaMa is built in Python and assuming PyTorch, does this port in Rust make use of Python process or is it what LLaMA algorithm is fully written in Rust?
From the readme, to preempt the moaning: "I just like collecting imaginary internet points, in the form of little stars, that people seem to give to me whenever I embark on pointless quests for rewriting X thing, but in Rust."<p>OK? Just don't. Let us have this. :)