There was a recent thread on explaining Mamba <a href="https://news.ycombinator.com/item?id=39501982">https://news.ycombinator.com/item?id=39501982</a> (<a href="https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html" rel="nofollow">https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html</a>)<p>There was another one on the same thing, probably better <a href="https://news.ycombinator.com/item?id=39482428">https://news.ycombinator.com/item?id=39482428</a> (<a href="https://jackcook.com/2024/02/23/mamba.html" rel="nofollow">https://jackcook.com/2024/02/23/mamba.html</a>)
To those curious about the tradeoffs between transformer and state space model layers, I highly recommend Sasha Rush's video on it: <a href="https://www.youtube.com/watch?v=dKJEpOtVgXc" rel="nofollow">https://www.youtube.com/watch?v=dKJEpOtVgXc</a>
Has anyone gotten this to work in linux using 1 or 2 4090s? I get stuck on "Loading checkpoint shards: 71%" and then it bails. But weirdly nvidia-smi shows plenty of VRAM available. My machine has 256gb of RAM so I don't think that's the problem either. Really excited to try this one.
It's great to see a full production level model using Mamba. But when it comes to long context window benchmarks, I'd love to see performance as well as throughput. I was under the impressions that Mamba has huge increases in throughput at the cost of modest losses in accuracy when using long contexts.
> Jamba boasts an extensive context window of 256K tokens, equivalent to around 210 pages of text, while fitting up to 140K tokens on a single 80GB GPU.<p>I realize this is a big improvement, but it’s striking how inefficient LLM’s are, that you need 80GB of GPU memory to analyze less than 1 megabyte of data. That’s a lot of bloat! Hopefully there’s a lot of room for algorithmic improvements.
I’m pretty sure computational chemists were combining NNs with Kalman Filters for a while now… I recall the issue it was slow due to the N^2 size of the covariance matrix
Jamba-v0.1-hybrid-MoE (16x6B?) is like giving a big NOS boost to a mixtral 8x7B tier LLM. If true 256k context, 3x longer, faster & cheaper than anything else, it should mean an end to the One Model To Rule Them All mindset for now. The big boys will have to offer some version of it as separate but close side-kick integration to their hero offering.
On a side note: working over longer contexts also reminds me of MemGPT(<a href="https://github.com/cpacker/MemGPT">https://github.com/cpacker/MemGPT</a>)
I think a similar concept can be applied to Mamba architecture models too.
Does this mean that I can continue a chat without needing to send a full transcript? This feels like it could make inference a lot cheaper for multi-step dialogs.
would a 192GB RAM mac studio or even a 7950x with 192GB RAM be practical for running this model for inference and possibly fine tuning? Especially if I don't need very low latency e.g. 1 token per second is fine for inference. i also have two 3090s.
I'm glad we're seeing exploration into scaling post-transformer LLM architectures, but I'm disappointed that it <i>has</i> a context window. That was kind of the selling point of Mamba(and SSM models in general), right linear scaling because state+input=next_state+output?
Please link to the original post:<p><a href="https://www.ai21.com/blog/announcing-jamba" rel="nofollow">https://www.ai21.com/blog/announcing-jamba</a><p>Jamba looks <i>fabulous</i>. Good performance for its size <i>and</i> much more efficient than the available open alternatives.<p>The key idea: One of out of every eight transformer blocks in Jamba applies dot-product attention with quadratic cost, but the other seven out of eight apply a Mamba layer with linear cost. And the entire model is a mixture of experts(MoE) so only ~12B parameters are used at once for inference.<p>Thank you to the folks at AI21 for making Jamba available!
@dang this is blogspam for the official post: <a href="https://www.ai21.com/blog/announcing-jamba" rel="nofollow">https://www.ai21.com/blog/announcing-jamba</a>