Ask HN: Cheapest hardware to run Llama 2 70B

70 点作者 danielEM将近 2 年前

Was wondering if I was to buy cheapest hardware (eg PC) to run for personal use at reasonable speed llama 2 70b what would that hardware be? Any experience or recommendations?

16 条评论

orost将近 2 年前

Anything with 64GB of memory will run a quantized 70B model. What else you need depends on what is acceptable speed for you. With a decent CPU but without any GPU assistance, expect output on the order of 1 token per second, and excruciatingly slow prompt ingestion. Any decent Nvidia GPU will dramatically speed up ingestion, but for fast generation, you need 48GB VRAM to fit the entire model. That means 2x RTX 3090 or better. That should generate faster than you can read.Edit: the above is about PC. Macs are much faster at CPU generation, but not nearly as fast as big GPUs, and their ingestion is still slow.

评论 #37072234 未加载

评论 #37071648 未加载

Tepix将近 2 年前

I built a DIY PC with used GPUs (2x RTX 3090) for around 2300€ earlier this year. You can probably do it for slightly less now (i also added 128GB RAM and NVLink). You can generate text with >10 tok/s with that setup. Make sure to get a PSU with more than 1000W. Air cooling is a challenge, but it's possible.Recommended reading: Tim Dettmer's guide <a href="https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/" rel="nofollow noreferrer">https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...</a>

评论 #37089920 未加载

spikedoanz将近 2 年前

If you have a lot of money (but not H100/A100 money), get 4090s as they're currently the best bang for your buck on the CUDA side (according to George Hotz). If broke, get multiple second hand 3090s. <a href="https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/" rel="nofollow noreferrer">https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...</a>. If unwilling to spend any money at all and just want to play around with llama70b, look into petals <a href="https://github.com/bigscience-workshop/petals">https://github.com/bigscience-workshop/petals</a>

评论 #37095091 未加载

评论 #37069815 未加载

leonletto将近 2 年前

We bought an A6000 48GB ( as mentioned by someone else ) and it’s works great for $3800. The power requirements are modest as well compared to consumer GPU’s. We looked at the ADA version but even used they are a lot more and your buying speed not usability. I would rather buy another A6000 and have 96GB of ram to fine tune with. That’s just me though and everyone needs to rank their needs against what they can afford.

gorbypark将近 2 年前

A 192gb Mac Studio should be able to run an unquantized 70B and I think would cost less than running a multi gpu setup made up of nvidia cards. I haven’t actually done the math, though. If you factor in electricity costs over a certain time period it might make the Mac even cheaper!

评论 #37068690 未加载

评论 #37070897 未加载

评论 #37070010 未加载

Ms-J将近 2 年前

The only info I can provide is the table I've seen on: <a href="https://github.com/jmorganca/ollama">https://github.com/jmorganca/ollama</a> where it states one needs "32 GB to run the 13B models." I would assume you may need a GPU for this.Related, could someone please point me in the right direction on how to run Wizard Vicuna Uncensored or Llama2 13B locally in Linux? I've been searching for a guide and have not found what I need for a beginner like myself. In the Github I referenced the download is only for Mac at the time. I have a Macbook Pro M1 I can use though it's running Debian.Thank you.

评论 #37089671 未加载

评论 #37069327 未加载

mechagodzilla将近 2 年前

I've been able to run it fine using llama.cpp on my 2019 iMac with 128GB of RAM. It's not super fast, but it works fine for "send it a prompt, look at the reply a few minutes later", and all it cost me was a few extra sticks of RAM.

more_corn将近 2 年前

You can run on cpu and regular ram, but gpu is quite a bit faster.You need about a gig of RAM/nvram per billion parameters (plus some headroom for a context window). Lower precision doesn’t really affect quality.When Ethereum flipped from proof of work to proof of stake, a lot of used high-end cards hit the market.4 of them in a cheap server would do the trick. Would be a great business model for some cheap colo to stand up a crap-ton of those and rent while servers to everyone here.In the meantime if you’re interested in a cheap server as described above, post in this thread.

tuxpenguine将近 2 年前

I don't think it is the cheapest, but the tiny box is an option:<a href="https://tinygrad.org/" rel="nofollow noreferrer">https://tinygrad.org/</a>

评论 #37077846 未加载

PenguinRevolver将近 2 年前

I feel as if the cheapest way of running these kinds of models would be to have the whole cache/memory take space on the hard drive rather than the RAM. Then, you could just use CPU power instead of splurging out thousands for RAM & a GPU with enough VRAM.It might or might not be reasonable speeds, but I would reason that it could avoid "sunk cost irony"; if you decide, that any point, Chat-GPT would have sufficed in your task. It's rare, but it can happen.If you want to take this silly logic further, you can theoretically run any sized model on any computer. You could even attempt this dumb idea on a computer running Windows 95. I don't care how long it would take; if it takes seven and a half million years for 42 tokens, I would still call it a success!

评论 #37072711 未加载

1letterunixname将近 2 年前

If it's only for a short time, use a price calculator to decide if it's worth renting GPUs on a cloud provider. You can get immediate temporary access for far more computing power than you can ever hope to buy outright.

amerine将近 2 年前

You would need at least a RTX A6000 for the 70b. You’re looking at maybe $4k? Plus whatever you spend on the rest of the machine? Maybe $6k all-in?

ano88888将近 2 年前

another question I would like to know: which cloud provider provides the cheapest GPU?TQ

评论 #37072268 未加载

评论 #37077860 未加载

mromanuk将近 2 年前

one 4090 + 3090-Ti<a href="https://github.com/turboderp/exllama">https://github.com/turboderp/exllama</a>

评论 #37070027 未加载

cypress66将近 2 年前

2 used 3090s

phas0ruk将近 2 年前

SageMaker

评论 #37068800 未加载

评论 #37068950 未加载

评论 #37068461 未加载