TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Training AI models might not need enormous data centres

90 pointsby jkuria4 months ago

8 comments

jkuria4 months ago
<a href="https:&#x2F;&#x2F;archive.is&#x2F;kRfd2" rel="nofollow">https:&#x2F;&#x2F;archive.is&#x2F;kRfd2</a>
openrisk4 months ago
Open source public models trained on kosher data are substantially derisking the AI hype. It makes a lot of sense to push this approach as far as it can get. Its similar to SETI at home etc. but potentially with far more impact.
评论 #42694767 未加载
评论 #42696060 未加载
评论 #42695117 未加载
评论 #42695072 未加载
评论 #42695365 未加载
评论 #42696120 未加载
gnabgib4 months ago
Related:<p><i>New Training Technique for Highly Efficient AI Methods</i> (2 points, 5 hours ago) <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42690664">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42690664</a><p><i>DiLoCo: Distributed Low-Communication Training of Language Models</i> (46 points, 1 year ago, 14 comments) <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38549337">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38549337</a>
评论 #42694340 未加载
aimanbenbaha4 months ago
This bottleneck right here is why Open Source is presented with a golden plate opportunity to lead the training of cutting edge models.<p>Federated learning breaks the barrier to entry and expands the ecosystem allowing more participants to share compute and&#x2F;or datasets for small players to train models.<p>DiLoCo introduced by Douillard minimizes communication overhead by averaging weight updates. What this article misses though is that despite this, each GPU in the distributed cluster still needs to have enough VRAM to load the entire copy of the model to complete the training process. That&#x27;s where DisTrO comes in which even reduces further the inter-GPU communication using a decoupling technique (DeMo) that only shares the fast moving parts of the optimizer across the GPU cluster.<p>&gt;And what if the costs could drop further still? The dream for developers pursuing truly decentralised ai is to drop the need for purpose-built training chips entirely. Measured in teraflops, a count of how many operations a chip can do in a second, one of Nvidia’s most capable chips is roughly as powerful as 300 or so top-end iPhones. But there are a lot more iPhones in the world than gpus. What if they (and other consumer computers) could all be put to work, churning through training runs while their owners sleep?&quot;<p>This aligns with DisTrO techniques because, according to them it could also allow consumer devices like Desktop Gaming PCs to join the compute cluster and share workloads. Besides there&#x27;s also an open-source implementation called exo that allows models to be split among idle local devices but it&#x27;s only limited to inference.<p>Again might still be relevant since in the article it mentions that DiLoCo was able to make the model respond better when faced with instruction prompts or reasoning questions never encountered during pre-training. And Arthur seems to think test-time training will make his approach become the norm.<p>sources: DisTrO: <a href="https:&#x2F;&#x2F;github.com&#x2F;NousResearch&#x2F;DisTrO">https:&#x2F;&#x2F;github.com&#x2F;NousResearch&#x2F;DisTrO</a> DeMo: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2411.19870" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2411.19870</a> Exo: <a href="https:&#x2F;&#x2F;github.com&#x2F;exo-explore&#x2F;exo">https:&#x2F;&#x2F;github.com&#x2F;exo-explore&#x2F;exo</a>
评论 #42697154 未加载
m3kw94 months ago
It’s talking about training 10b parameters “capable models” with less compute using ew techniques, but top models will always need more
评论 #42696783 未加载
评论 #42694964 未加载
whazor4 months ago
You could consider a LLM as a very lossy compression artifact. Where they took terabytes of input data, and ended up with model under the 100 gigabytes. It is quite remarkable what such a model can do, even fabricating new output that was not in the input data.<p>However, in my naïvety, I wonder whether vastly simpler algorithms could be used to end up with similar results. Regular compression techniques work with speeds up to 700MB&#x2F;s.
评论 #42696385 未加载
评论 #42695778 未加载
评论 #42695803 未加载
评论 #42695577 未加载
评论 #42695685 未加载
neom4 months ago
I think the problem is people are going to start playing, everyone is going to train in their own things, businesses are going to want to train different architectures for different business functions etc. I did my first real adventure with training last night, $3,200 and a lot of fun later (whooops) - the tooling has become very easy to use, and I presume will just get easier. If I want to train in even say 10ish gigs, wouldn&#x27;t I want to use a DC, even with a powerful laptop or DiLoCo? Seems unlikely DiLoCo is enough?<p>(edit: I may also not be accounting enough for using a pre-trained general model next to a fine tuned specialized model?)
FrustratedMonky4 months ago
Are there not any lessons from Protein Folding that could be used here?<p>There was a distributed Protein Folding project a couple decades ago.<p>I remember there was even Protein folding apps that could run on game consoles when not playing games.<p>But maybe Protein Folding code is more Parallelizable across machines, than AI models.