TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Apple downloads ~45 TB of models per day from our S3 bucket

553 pointsby julien_cover 5 years ago

34 comments

cs702over 5 years ago
&quot;Almost everyone&quot; working on NLP uses one of hugginface&#x27;s pretrained models at one point or another, sooner or later: <a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;pytorch-transformers" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;pytorch-transformers</a><p>It&#x27;s so damn convenient, and so nicely done.<p>And they keep doing neat things like this one: <a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;swift-coreml-transformers" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;swift-coreml-transformers</a><p>Kudos to Julien Chaumond et al for their work!
评论 #20993573 未加载
评论 #20990662 未加载
nurettinover 5 years ago
This is probably apple&#x27;s continuous integration tests, lazily written to download the whole thing every time someone merges a commit.
评论 #20992287 未加载
CobrastanJorjiover 5 years ago
If you host large, publicly available data in a cloud blob service, but you don&#x27;t have a budget for it, one option is to use the &quot;Requester Pays&quot; feature that Amazon and Google provide. This makes the data available to anyone to download, but they need to pay the download cost themselves.<p>This is at the tradeoff of making your data significantly more irritating to access, as it&#x27;s no longer just plugging in a URL into a program, plus everyone who wants your dataset needs to set up a billing account with Amazon or Google.
评论 #20990990 未加载
评论 #20992184 未加载
dharmonover 5 years ago
I don&#x27;t have high hopes for his business prospects if this is how he handles one of the richest companies in the world clearly having a high need for something his company offers.<p>Maybe spend less time on Twitter and more on your business model?
评论 #20989522 未加载
评论 #20989939 未加载
评论 #20989557 未加载
评论 #20992746 未加载
评论 #20992459 未加载
评论 #20988702 未加载
评论 #20991176 未加载
评论 #20989990 未加载
emeralddover 5 years ago
This looks kind of interesting:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;pytorch-pretrained-BigGAN&#x2F;blob&#x2F;6ae20a35a051816d66811d85597033623a8ac888&#x2F;pytorch_pretrained_biggan&#x2F;model.py#L29" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;pytorch-pretrained-BigGAN&#x2F;blo...</a><p>When you look further down you find:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;pytorch-pretrained-BigGAN&#x2F;blob&#x2F;6ae20a35a051816d66811d85597033623a8ac888&#x2F;pytorch_pretrained_biggan&#x2F;model.py#L253-L281" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;pytorch-pretrained-BigGAN&#x2F;blo...</a><p>And that&#x27;s just a quick search for s3 in the repo. It would not surprise me in the least to discover a `from_pretrained` that points at one of the s3 resources being pulled. There&#x27;s probably other stuff like that as well in the code that could be causing equally nasty heartache .. especially if non-persistent containers are involved....<p>(This is a WAG aka Wild A<i></i> Guess)<p>EDIT: Dug a little more and found:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;search?q=org%3Ahuggingface+s3&amp;type=Code" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;search?q=org%3Ahuggingface+s3&amp;type=Code</a><p>Unless I&#x27;m mistaken here, there&#x27;s a crap ton of code that could be downloading models at runtime ... Which seems significantly less than ideal ...
评论 #20990162 未加载
评论 #20989718 未加载
btownover 5 years ago
A brief reminder: Whenever you publish code or documentation that might be used&#x2F;scraped by the outside world, ALWAYS use a domain you own. If you&#x27;re on Cloudflare you can instantly (and for free) create Page Rules to use Cloudflare as a CDN, redirect to another CDN, or black-hole or reroute traffic anywhere you want.
评论 #20989856 未加载
评论 #20989985 未加载
评论 #20989888 未加载
lackerover 5 years ago
Well, you could contact them and make a very-likely-to-succeed case that they should pay you some money, or you could complain about it on Twitter.
评论 #20987979 未加载
paxysover 5 years ago
That&#x27;s about $4000&#x2F;month in bandwidth costs, assuming retail pricing.<p>FYI he is bragging, not complaining. There are a dozen ways to reduce or eliminate this problem.
评论 #20990494 未加载
评论 #20997523 未加载
评论 #20990077 未加载
ebg13over 5 years ago
If you don&#x27;t want someone else to do something that costs you money, you&#x27;re going to have a bad time if you don&#x27;t prevent them from doing it.
评论 #20987615 未加载
alphagrep12345over 5 years ago
What does hugging face do? Do they implement models from papers and make them available for free?
评论 #20989679 未加载
rhackerover 5 years ago
I&#x27;m guessing someone at apple internally distributed a dockerfile that pulls that down.
fitzroyover 5 years ago
In a few weeks he can just point Apple&#x27;s IP range to a shared iCloud folder.
评论 #20990769 未加载
StreamBrightover 5 years ago
Paid by requester is the feature they are looking for.<p><a href="https:&#x2F;&#x2F;docs.aws.amazon.com&#x2F;AmazonS3&#x2F;latest&#x2F;dev&#x2F;configure-requester-pays-console.html" rel="nofollow">https:&#x2F;&#x2F;docs.aws.amazon.com&#x2F;AmazonS3&#x2F;latest&#x2F;dev&#x2F;configure-re...</a>
评论 #20992787 未加载
hi41over 5 years ago
I read the Twitter post but did not understand what is happening. Can someone please explain.
评论 #20987845 未加载
cpachover 5 years ago
Isn’t this a use case where BitTorrent would shine?
评论 #20993220 未加载
评论 #20993621 未加载
peterwwillisover 5 years ago
If your CI&#x2F;CD is re-downloading and re-building everything on every single run, you are not only being wasteful, you&#x27;re actually more likely to have an outage due to not storing dependency artifacts needed for deploy. Use a local artifact store to be more resilient to failures of servers you don&#x27;t control (and also save everyone money and time).
soaredover 5 years ago
Charge, them, money?
jijjiover 5 years ago
Hosting terabytes of data on an S3 bucket where people would download 45TB per month ($0.023&#x2F;GB == $1000+&#x2F;month) sounds like a really expensive way to distribute your data to people...
评论 #20991279 未加载
评论 #21000004 未加载
评论 #20990193 未加载
mrfusionover 5 years ago
What’s the backstory on this? (Is it something I should already know)
yaloginover 5 years ago
Isn&#x27;t it likely that someone wrote a script for testing some regression and it keeps running in a loop? I can almost bet that will be the case.
tnoletover 5 years ago
Is this what they call product market fit?
codesternewsover 5 years ago
Looks like open source company. What&#x27;s their business model? Does any one know, How they earn money?
idlewordsover 5 years ago
This is what success looks like if you charge money for a good or service.
ChuckMcMover 5 years ago
And now the twitter post is gone? I&#x27;m guessing the west coast woke up and someone at Apple said &quot;Wait, you could infer some proprietary information with that information ...&quot;
评论 #20993945 未加载
cuillevel3over 5 years ago
Are those full downloads or just HEAD or range requests from some CI?
dlasekover 5 years ago
They&#x27;re the ones that made Amazon get those Data Trucks lol
z3t4over 5 years ago
Apple are probably doing &quot;continuous integration&quot; where all assets are re-downloaded from the Internet in each iteration. Tip: put your stuff on Github :P
评论 #20996164 未加载
ajay-dover 5 years ago
Aren’t all the authors of that paper from Apple?
ecnahc515over 5 years ago
Why can&#x27;t they use cloudfront?
评论 #20989615 未加载
half-kh-hackerover 5 years ago
It&#x27;s surprising that nobody here&#x27;s mentioned Wasabi, since they have free egress.
评论 #20993639 未加载
master_yoda_1over 5 years ago
So these jokers at apple publish a paper by using code from huggingface.
dymkover 5 years ago
Need to distribute large static content? Looks like a good job for a torrent.
评论 #20993692 未加载
bryan_wover 5 years ago
Have you considered cloudflair?
评论 #20989961 未加载
评论 #20987777 未加载
kelnosover 5 years ago
If a company the size of Apple finds this that useful, perhaps you should consider charging for your service, rather than just complaining on Twitter about the free usage you appear to have willingly given away?<p>Or perhaps you have reached out to them, but are for some reason still complaining on Twitter to drum up PR or something?<p>Regardless, this posting is ridiculously context-free to the point of being click-baity. (But hey, good job, I clicked on it anyway.)
评论 #20990593 未加载
评论 #20993244 未加载