Woo! This sounds like it will make it easier to run in normal mode (ie, not interactive) and manage the chat history yourself if there's less penalty for a full program reload. Currently my perl irc bot wrapper for llama.cpp just open2's the program in interactive mode (-i) and reads/writes to stdout/stdin of llama.cpp to get the loading time savings of having it manage history and keep state. In one-shot there'd still be the "extra" inference time of processing the full history each time instead of saving state like interactive does but the memory load time matters just as much.<p>For me personally this matters most because right now when llama.cpp runs out of the 2048 tokens it segfaults and this causes difficulties. In interactive mode if it goes off rails and generates 1000 tokens of nonsense then that nonsense is taking up tokens for the next line from chat. In normal mode where it just runs once and all history has to be manually supplied this can be avoided.
"It's a ~200 LOC change that only took me a few hours and it worked the first time."<p>jart is on another plane of existence—and 100% earned this flex
What's the best way to download and get setup with this stuff atm? Ie lets say i want to run currently available variations of LLaMA -- 7B, 13B, and 30B [1] -- is there a current summary of how to acquire them, possibly quantize them, etc? Would i download a quantized version or do it myself? etc<p>I ran Alpaca 7B Q4 almost instantly because they provided Curl's to download it. Super simple. But it seems most aren't doing that because it's prone to getting Facebook's gaze. So.. what's recommended?<p>I happened to find this[2], but i think that's the non-quantized raw models? Not sure yet.<p>[1]: Won't bother with 65B, can't fit in memory i believe?
[2]: <a href="https://github.com/shawwn/llama-dl/blob/main/llama.sh">https://github.com/shawwn/llama-dl/blob/main/llama.sh</a><p><i>edit</i>: I forgot about <a href="https://github.com/cocktailpeanut/dalai">https://github.com/cocktailpeanut/dalai</a> - i suspect this is best in breed atm? Though a Docker container would be nice to wrangle all the dependencies
There are some cool ideas in here. I've long been curious why people don't use mmap to re-use all those wonderful pages that got loaded (without reparsing the disk data).
I welcome all progress but I don't see why these models aren't simply run on a thin Python server that loads the model into memory once and then you can curl it instantly whenever you want?
Can someone break this down? Since this seems to be inferencing without having the entire model loaded into memory, is it possible this could be a way to relax memory requirements of the 65b model?
If you want to avoid twitter this discusses the changes:<p><a href="https://github.com/ggerganov/llama.cpp/issues/91">https://github.com/ggerganov/llama.cpp/issues/91</a>