TechEcho

11 comments

superkuhabout 2 years ago

Woo! This sounds like it will make it easier to run in normal mode (ie, not interactive) and manage the chat history yourself if there's less penalty for a full program reload. Currently my perl irc bot wrapper for llama.cpp just open2's the program in interactive mode (-i) and reads/writes to stdout/stdin of llama.cpp to get the loading time savings of having it manage history and keep state. In one-shot there'd still be the "extra" inference time of processing the full history each time instead of saving state like interactive does but the memory load time matters just as much.For me personally this matters most because right now when llama.cpp runs out of the 2048 tokens it segfaults and this causes difficulties. In interactive mode if it goes off rails and generates 1000 tokens of nonsense then that nonsense is taking up tokens for the next line from chat. In normal mode where it just runs once and all history has to be manually supplied this can be avoided.

评论 #35200426 未加载

turnsoutabout 2 years ago

"It's a ~200 LOC change that only took me a few hours and it worked the first time."jart is on another plane of existence—and 100% earned this flex

评论 #35205707 未加载

评论 #35204529 未加载

unshavedyakabout 2 years ago

What's the best way to download and get setup with this stuff atm? Ie lets say i want to run currently available variations of LLaMA -- 7B, 13B, and 30B [1] -- is there a current summary of how to acquire them, possibly quantize them, etc? Would i download a quantized version or do it myself? etcI ran Alpaca 7B Q4 almost instantly because they provided Curl's to download it. Super simple. But it seems most aren't doing that because it's prone to getting Facebook's gaze. So.. what's recommended?I happened to find this[2], but i think that's the non-quantized raw models? Not sure yet.[1]: Won't bother with 65B, can't fit in memory i believe? [2]: <a href="https://github.com/shawwn/llama-dl/blob/main/llama.sh">https://github.com/shawwn/llama-dl/blob/main/llama.sh</a>edit: I forgot about <a href="https://github.com/cocktailpeanut/dalai">https://github.com/cocktailpeanut/dalai</a> - i suspect this is best in breed atm? Though a Docker container would be nice to wrangle all the dependencies

评论 #35204345 未加载

评论 #35210979 未加载

0cf8612b2e1eabout 2 years ago

How does jart find the time? Just a brilliant engineer who is seemingly cranking out amazing projects on a regular basis.

评论 #35201234 未加载

评论 #35200369 未加载

评论 #35200260 未加载

dekhnabout 2 years ago

There are some cool ideas in here. I've long been curious why people don't use mmap to re-use all those wonderful pages that got loaded (without reparsing the disk data).

评论 #35200916 未加载

评论 #35200016 未加载

评论 #35200033 未加载

meghan_rainabout 2 years ago

I welcome all progress but I don't see why these models aren't simply run on a thin Python server that loads the model into memory once and then you can curl it instantly whenever you want?

评论 #35200266 未加载

评论 #35200542 未加载

评论 #35200455 未加载

dougmwneabout 2 years ago

Can someone break this down? Since this seems to be inferencing without having the entire model loaded into memory, is it possible this could be a way to relax memory requirements of the 65b model?

评论 #35200063 未加载

评论 #35200075 未加载

adultSwimabout 2 years ago

I'm seeing a lot of interest in generative use cases. Has anyone tried LLaMA or GPT for classification?

评论 #35219773 未加载

kir-gadjelloabout 2 years ago

This is cool, but SSD read bandwidth is still the bottleneck. On my non-mac machine it still takes several seconds to load the model.

评论 #35200842 未加载

lxeabout 2 years ago

I wonder if something like this is possible for CUDA/pytorch loading?

评论 #35201093 未加载

eternalbanabout 2 years ago

If you want to avoid twitter this discusses the changes:<a href="https://github.com/ggerganov/llama.cpp/issues/91">https://github.com/ggerganov/llama.cpp/issues/91</a>

11 comments

superkuhabout 2 years ago

评论 #35200426 未加载

turnsoutabout 2 years ago

"It's a ~200 LOC change that only took me a few hours and it worked the first time."jart is on another plane of existence—and 100% earned this flex

评论 #35205707 未加载

评论 #35204529 未加载

unshavedyakabout 2 years ago

评论 #35204345 未加载

评论 #35210979 未加载

0cf8612b2e1eabout 2 years ago

How does jart find the time? Just a brilliant engineer who is seemingly cranking out amazing projects on a regular basis.

评论 #35201234 未加载

评论 #35200369 未加载

评论 #35200260 未加载

dekhnabout 2 years ago

There are some cool ideas in here. I've long been curious why people don't use mmap to re-use all those wonderful pages that got loaded (without reparsing the disk data).

评论 #35200916 未加载

评论 #35200016 未加载

评论 #35200033 未加载

meghan_rainabout 2 years ago

I welcome all progress but I don't see why these models aren't simply run on a thin Python server that loads the model into memory once and then you can curl it instantly whenever you want?

评论 #35200266 未加载

评论 #35200542 未加载

评论 #35200455 未加载

dougmwneabout 2 years ago

Can someone break this down? Since this seems to be inferencing without having the entire model loaded into memory, is it possible this could be a way to relax memory requirements of the 65b model?

评论 #35200063 未加载

评论 #35200075 未加载

adultSwimabout 2 years ago

I'm seeing a lot of interest in generative use cases. Has anyone tried LLaMA or GPT for classification?

评论 #35219773 未加载

kir-gadjelloabout 2 years ago

This is cool, but SSD read bandwidth is still the bottleneck. On my non-mac machine it still takes several seconds to load the model.

评论 #35200842 未加载

lxeabout 2 years ago

I wonder if something like this is possible for CUDA/pytorch loading?

评论 #35201093 未加载

eternalbanabout 2 years ago

If you want to avoid twitter this discusses the changes:<a href="https://github.com/ggerganov/llama.cpp/issues/91">https://github.com/ggerganov/llama.cpp/issues/91</a>

Load LLaMA Models Instantly

11 comments

Load LLaMA Models Instantly

11 comments