Jamba: Production-grade Mamba-based AI model

346 pointsby bubblehack3rabout 1 year ago

22 comments

There was a recent thread on explaining Mamba <a href="https://news.ycombinator.com/item?id=39501982">https://news.ycombinator.com/item?id=39501982</a> (<a href="https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html" rel="nofollow">https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html</a>)There was another one on the same thing, probably better <a href="https://news.ycombinator.com/item?id=39482428">https://news.ycombinator.com/item?id=39482428</a> (<a href="https://jackcook.com/2024/02/23/mamba.html" rel="nofollow">https://jackcook.com/2024/02/23/mamba.html</a>)

评论 #39856065 未加载

a_wild_dandanabout 1 year ago

To those curious about the tradeoffs between transformer and state space model layers, I highly recommend Sasha Rush's video on it: <a href="https://www.youtube.com/watch?v=dKJEpOtVgXc" rel="nofollow">https://www.youtube.com/watch?v=dKJEpOtVgXc</a>

评论 #39860588 未加载

eigenvalueabout 1 year ago

Has anyone gotten this to work in linux using 1 or 2 4090s? I get stuck on "Loading checkpoint shards: 71%" and then it bails. But weirdly nvidia-smi shows plenty of VRAM available. My machine has 256gb of RAM so I don't think that's the problem either. Really excited to try this one.

Reubendabout 1 year ago

It's great to see a full production level model using Mamba. But when it comes to long context window benchmarks, I'd love to see performance as well as throughput. I was under the impressions that Mamba has huge increases in throughput at the cost of modest losses in accuracy when using long contexts.

评论 #39855406 未加载

评论 #39856509 未加载

skybrianabout 1 year ago

> Jamba boasts an extensive context window of 256K tokens, equivalent to around 210 pages of text, while fitting up to 140K tokens on a single 80GB GPU.I realize this is a big improvement, but it’s striking how inefficient LLM’s are, that you need 80GB of GPU memory to analyze less than 1 megabyte of data. That’s a lot of bloat! Hopefully there’s a lot of room for algorithmic improvements.

评论 #39855393 未加载

评论 #39857845 未加载

评论 #39860568 未加载

评论 #39861610 未加载

评论 #39857344 未加载

评论 #39861496 未加载

gautamcgoelabout 1 year ago

Why include self-attention layers at all? In other words, why not just alternate SSM and MLP layers?

评论 #39855518 未加载

google234123about 1 year ago

I’m pretty sure computational chemists were combining NNs with Kalman Filters for a while now… I recall the issue it was slow due to the N^2 size of the covariance matrix

评论 #39856528 未加载

unravellerabout 1 year ago

Jamba-v0.1-hybrid-MoE (16x6B?) is like giving a big NOS boost to a mixtral 8x7B tier LLM. If true 256k context, 3x longer, faster & cheaper than anything else, it should mean an end to the One Model To Rule Them All mindset for now. The big boys will have to offer some version of it as separate but close side-kick integration to their hero offering.

ninjahatoriabout 1 year ago

On a side note: working over longer contexts also reminds me of MemGPT(<a href="https://github.com/cpacker/MemGPT">https://github.com/cpacker/MemGPT</a>) I think a similar concept can be applied to Mamba architecture models too.

zelphirkaltabout 1 year ago

Is there a Sparabo too?It is always funny to see old names associated with totally different new things!

toddmoreyabout 1 year ago

Released with open weights!

CGamesPlayabout 1 year ago

Does this mean that I can continue a chat without needing to send a full transcript? This feels like it could make inference a lot cheaper for multi-step dialogs.

haddrabout 1 year ago

Will it be possible to run such model family in ollama?

评论 #39856247 未加载

kjkjadksjabout 1 year ago

People need to pick better names. Mamba is already a popular python package and internet search tools are on their knees already.

moneycantbuyabout 1 year ago

would a 192GB RAM mac studio or even a 7950x with 192GB RAM be practical for running this model for inference and possibly fine tuning? Especially if I don't need very low latency e.g. 1 token per second is fine for inference. i also have two 3090s.

评论 #39873084 未加载

kelseyfrogabout 1 year ago

I'm glad we're seeing exploration into scaling post-transformer LLM architectures, but I'm disappointed that it has a context window. That was kind of the selling point of Mamba(and SSM models in general), right linear scaling because state+input=next_state+output?

评论 #39857954 未加载

评论 #39855364 未加载

评论 #39857084 未加载

zzzzzzzzzz10about 1 year ago

Where can I download and use it?

cs702about 1 year ago

Please link to the original post:<a href="https://www.ai21.com/blog/announcing-jamba" rel="nofollow">https://www.ai21.com/blog/announcing-jamba</a>Jamba looks fabulous. Good performance for its size and much more efficient than the available open alternatives.The key idea: One of out of every eight transformer blocks in Jamba applies dot-product attention with quadratic cost, but the other seven out of eight apply a Mamba layer with linear cost. And the entire model is a mixture of experts(MoE) so only ~12B parameters are used at once for inference.Thank you to the folks at AI21 for making Jamba available!

评论 #39858412 未加载

评论 #39858869 未加载

ipsum2about 1 year ago

@dang this is blogspam for the official post: <a href="https://www.ai21.com/blog/announcing-jamba" rel="nofollow">https://www.ai21.com/blog/announcing-jamba</a>

评论 #39856540 未加载

krasinabout 1 year ago

The license is a proper open-source one: Apache 2.0. Thanks, AI21 Labs.

评论 #39857161 未加载

评论 #39857972 未加载

评论 #39855385 未加载

sleepingresetabout 1 year ago

god damn

htrpabout 1 year ago

compute still has cost?

评论 #39856567 未加载