I did something similar but using a K80 and M40 I dug up from eBay for pennies. Be advised though, stay as far away as possible from the K80 - the drivers were one of the most painful tech things I've ever had to endure, even if 24GB of VRAM for 50 bucks sounds incredibly appealing. That said, I had a decent-ish HP workstation laying around with 1200 watt power supply so I had where to put those two in. The one thing to note here is that these types of GPUs do not have a cooling of their own. My solution was to 3d print a bunch of brackets and attach several Noctua fans and have them blow at full speed 24/7. Surprisingly it worked way better than I expected - I've never gone above 60 degrees. As a side efffect, the CPUs are also benefiting from this hack: at idle, they are in the mid-20 degrees range. Mind you, the noctua fans are located on the front and the back of the case: the ones on the front act as an intake and the ones on the back as exhaust and there's two more inside the case that are stuck in front of the GPUs.<p>The workstation was refurbished for just over 600 bucks, and another 120 bucks for the GPUs and another ~60 for the fans.<p>Edit: and before someone asks - no I have not uploaded the STL's anywhere cause I haven't had the time but also since this is a very niche use case, though I might: the back(exhaust) bracket came out brilliant the first try - it was a sub-millimeter fit. Then I got cocky and thought that I'd also nail it first try on the intake and ended up re-printing it 4 times.
For the same price ($1799) you could buy a Mac Mini with 48gb of unified memory and an m4 pro. It’d probably use less power and be much quieter to run and likely could outperform this setup in terms of tokens per second. I enjoyed the write up still, but I would probably just buy a Mac in this situation.
I’d really love to build a machine for local LLMs. I’ve tested models on my MBP M3 Max with 128GB of ram and it’s really cool but I’d like a dedicated local server. I’d also like an excuse to play with proxmox as I’ve just run raw Linux servers or UnRaid w/ containers in the past.<p>I have OpenWebUI and LibreChat running on my local “app server” and I’m quite enjoying that but every time I price out a beefier box I feel like the ROI just isn’t there, especially for an industry that is moving so fast.<p>Privacy is not something to ignore at all but the cost of inference online is very hard to beat, especially when I’m still learning how best to use LLMs.
The thing is though.... the locally hosted models in such hardware are cute as toys, and sure do write funny jokes and importantly, perform private tasks that I would never consider passing to non-selfhosted models, but pale in comparison to the models accessible over APIs(Claude 3.5 Sonnet, OpenAI etc).
If I could run deepseek-r1-678b locally, without breaking the bank, I would. But, for now, opex > capex at a consumer level.
The author mentions it but I want to expand on it: Apple is a seriously good option here, specifically the M4 Mac Mini.<p>What makes Apple attractive is (as the author mentions) that RAM is shared between main and video RAM whereas NVidia is quite intentionally segmenting the market and charging huge premiums for high VRAM cards. Here are some options:<p>1. Base $599 Mac Mini: 16GB of RAM. Stocked in store.<p>2. $999 Mac Mini: 24GB of RAM. Stocked in store.<p>3. Add RAM to either of the above up to 32GB. It's not cheap at $200/8GB but you can buy a Mac Mini with 32GB of shared RAM for $999, substantially cheaper than the author's PC build but less storage (although you can upgrade that too).<p>4. M4 Pro: $1399 w/ 24GB of RAM. Stocked in store. You can customize this all the way to 64GB of RAM for +$600 so $1999 in total. That is amazing value for this kind of workload.<p>5. The Mac Studio is really the ultimate option. Way more cores and you can go all the way to 192GB of unified memory (for a $6000 machine). The problem here is that the Mac Studio is old, still on the M2 architecture. An M4 Ultra update is expected sometime this year, possibly late this year.<p>6. You can get into clustering these (eg [1]).<p>7. There are various Macbook Pro options, the highest of which is a 16" Mackbook Pro with 128GB of unified memory for $4999.<p>But the main takeaway is the M4 Mac Mini is fantastic value.<p>Some more random thoughts:<p>- Some Mac Minis have Thunderbolt 5 ("TB5"), which is up to either 80Gbps or 120Gbps bidirectional (I've seen it quoted as both);<p>- Mac Minis have the option of 10GbE (+$200);<p>- The Mac Mini has 2 USB3 ports and either 3 TB4 or 3 TB5 ports.<p>[1]: <a href="https://blog.exolabs.net/day-2/" rel="nofollow">https://blog.exolabs.net/day-2/</a>
The middle ground is to rent a GPU VPS as needed. You can get an H100 for $2/h. Not quite the same privacy as fully local offline, but better than a SASS API and good enough for me. Hopefully in a year or three it will truly be cost effective to run something useful locally and then I can switch.
I was wondering if anyone here has experimented with running a cluster of SBC for LLM inference? Ex. the Radxa ROCK 5C has 32GB of memory and also a NPU and only costs about 300 euros. I'm not super up to date on the architecture on modern LLMs, but as far as I understand you should be able to split the layers between multiple nodes? It is not that much data the needs to be sent between them, right? I guess you won't get quite the same performance as a modern mac or nvidia GPU, but it could be quite acceptable and possibly a cheap way of getting a lot of memory.<p>On the other hand I am wondering about what is the state of the art in CPU + GPU inference. Prompt processing is both compute and memory constrained, but I think token generation afterwards is mostly memory bound. Are there any tools that support loading a few layers at a time into a GPU for initial prompt processing and then switches to CPU inference for token generation? Last time I experimented it was possible to run some layers on the GPU and some on the CPU, but to me it seems more efficient to run everything on the GPU initially (but a few layers at a time so they fit in VRAM) and then switch to the CPU when doing the memory bound token generation.
Great breakdown!. The "own your own AI" at home is a terrific hobby if you like to tinker, but you are going to spend a ton of time and money on hardware that will be underutilized most of the time. If you want to go nuts check out Mitko Vasilev's dream machine. It makes no sense if you don't have a very clear use case that only requires small models or really slow token generation speeds.<p>If the goal however is not to tinker but to really build and learn AI, it is going to be financially better to rent those GPUs/TPUs as needs arise.
2 x Nvidia Tesla P40 card for €660 is not a thing i consider to be "on a budget".<p>People can play with "small" or "medium" models less powerfull and cheaper cards. A Nvidia Geforce RTX 3060 card with "only" 12Gb VRAM can be found around €200-250 on second hand market (and they are around 300~350 new).<p>In my opinion, 48Gb of VRAM is overkill to call it "on a budget", for me this setup is nice but it's for semi-professional or professional usage.<p>There is of course a trade off to use medium or small models, but being "on a budget" is also to do trade off.
As others have said, a high powered Mac could be used for the same purpose at a comparable price and lower power usage. Which makes me wonder: why doesn't Apple get into the enterprise AI chip game and compete with Nvidia? They could design their own ASIC for it with all their hardware & manufacturing knowledge. Maybe they already are.
The problem for me with making such an investment is that next month a better model will be released. It will either require more or less RAM than the current best model- making it either not runnable or expensive to run on an overbuilt machine.<p>Using cloud infrastructure should help with this issue. It may cost much more per run but money can be saved if usage is intermittent.<p>How are HN users handling this?
Pay attention to IO bandwidth if you’re building a machine with multiple GPUs like this!<p>In this setup the model is sharded between cards so data must be shuffled through a PCIe 3.0 x16 link which is limited to ~16 GB/s max. For reference that’s an order of magnitude lower than the ~350 GB/s memory bandwidth of the Tesla P40 cards being used.<p>Author didn’t mention NVLink so I’m presuming it wasn’t used, but I believe these cards would support it.<p>Building on a budget is really hard. In my experience 5-15 tok/s is a bit too slow for use cases like coding, but I admit once you’ve had a taste of 150 tok/s it’s hard to go back (I’ve been spoiled by RTX 4090 with vLLM).
> Another important finding: Terry is by far the most popular name for a tortoise, followed by Turbo and Toby. Harry is a favorite for hares. All LLMs are loving alliteration.<p>Mode-collapse. One reason that the tuned (or tuning-contaminated models) are bad for creative writing: every protagonist and place seems to be named the same thing.
This article is coming out at an interesting time for me.<p>We probably have different definitions for "budget", but I just ordered a super janky eGPU setup for my very dated 8th gen Intel NUC, with a m2->pcie adapter, a PSU, and a refurb Intel A770 for about 350 all-in, not bad considering that's about the cost of a proper Thunderbolt eGPU enclosure alone.<p>The overall idea: A770 seems like a really good budget LLM GPU since it has more memory (16GB) and more memory bandwidth (512GB/s) than a 4070, but costs a tiny fraction. The m2-pcie adapter should give it a bit more bandwidth to the rest of the system than Thunderbolt as well, so hopefully it'll make for a decent gaming experience too.<p>If the eGPU part of the setup doesn't work out for some reason, I'll probably just bite the bullet and order the rest of the PC for a couple hundred more, and return the m2-pcie adapter (I got it off of Amazon instead of Aliexpress specifically so I could do this), looking to end up somewhere around 600 bux total. I think that's probably a more reasonable price of entry for something like this for most people.<p>Curious if anyone else has experience with the A770 for LLM? Been looking at Intel's <a href="https://github.com/intel/ipex-llm">https://github.com/intel/ipex-llm</a> project and it looked pretty promising, that's what made me pull the trigger in the end. Am I making a huge mistake?
I doubt it is that efficient. Even though it has 48GB of VRAM, it's more than twice slower than a single 3090 GPU.<p>In my budget AI setup I use 7840 Ryzen based miniPC with USB4 port and connect 3090 to it via the eGPU adapter (ADT-link UT3G). It costed me about $1000 total and I can easily achieve 35 t/s with qwen2.5-coder-32b using ollama.
From the article: In the future, I fully expect to be able to have a frank and honest discussion about the Tiananmen events with an American AI agent, but the only one I can afford will have assumed the persona of Father Christmas who, while holding a can of Coca-Cola, will intersperse the recounting of the tragic events with a joyful "Ho ho ho... Didn't you know? The holidays are coming!"<p>How unfortunate that people are discounting the likelihood that American AI agents will avoid saying things their master think should not be said.
Anyone want to take bets on when the big 3 (Open AI, Meta, and Google) will quietly remove anything to do with DEI, trans people, or global warming?
They'll start out changing all mentions of "Gulf of Mexico" to "Gulf of America",
but then what?
Does using 2x24GB VRAM mean that the model can be fully loaded into memory if it's between 24 and 48 GB in size? I somehow doubt it, at least ollama wouldn't work like that I think. But does anyone know?
How many credits does this get you in any cloud or via openai/anthropic api credits? For ~$1700 I can accomplish way more without hardware locally. Don't get me wrong I enjoy tinkering and building projects like this - but it doesn't make financial sense here to me. Unless of course you live 100% off-grid and have Stallman level privacy concerns.<p>Of course I do want my own local GPU compute setup, but the juice just isn't worth the squeeze.
A lot of people build personal deep learning machines. The economics and convenience can definitely work out... I am confused however by "dummy GPU" - I searched for "dummy" for an explanation but didn't find one. Modern motherboards all include an integrated video card, so I'm not sure what this would be for?<p>My personal DL machine has a 24 core CPU, 128GB RAM and 2 x 3060 GPUs and 2 x 2TB NVMe drives in a RAID 1 array. I <3 it.
This is just a limited recreation of the ancient mikubox from <a href="https://rentry.org/lmg-build-guides" rel="nofollow">https://rentry.org/lmg-build-guides</a><p>Its funny to see people independently "discover" these builds that are a year plus old.<p>Everyone is sleeping on these guides, but I guess the stink of 4chan scares people away?
One reason to bother with private AI: cloud AI ToS for consumers may have legal clauses about usage of prompt and context data, e.g. data that is not already on the Internet. Enterprise customers can exclude their data from future training.<p><a href="https://stratechery.com/2025/deep-research-and-knowledge-value/" rel="nofollow">https://stratechery.com/2025/deep-research-and-knowledge-val...</a><p><i>> Unless, of course, the information that matters is not on the Internet. This is why I am not sharing the Deep Research report that provoked this insight: I happen to know some things about the industry in question — which is not related to tech, to be clear — because I have a friend who works in it, and it is suddenly clear to me how much future economic value is wrapped up in information not being public. In this case the entity in question is privately held, so there aren’t stock market filings, public reports, barely even a webpage! And so AI is blind.</i><p>(edited for clarity)
I bought a Mac M4 Mini (the cheapest one) at Costco for 559 and while I don't know exactly how many tokens per second, it seems to generate text from llama 3.2 (through ollama) as fast as chatgpt.
I can run the 17b deepseek models (I know, these smaller ones are not actually deepseek) on my old 1080ti gaming desktop with 64 Gb of RAM. Not exactly speedy, but pretty neat nonetheless.
you can run 32B and even 70B (a bit slow) models on a m4 mac mini pro with 48 GB ram. Out of the box using Ollama. If you enjoy putting together a desktop, thats understandable.<p><a href="https://deepgains.substack.com/p/running-deepseek-locally-for-free" rel="nofollow">https://deepgains.substack.com/p/running-deepseek-locally-fo...</a>