TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Apple downloads ~45 TB of models per day from our S3 bucket

553 点作者 julien_c超过 5 年前

34 条评论

cs702超过 5 年前
&quot;Almost everyone&quot; working on NLP uses one of hugginface&#x27;s pretrained models at one point or another, sooner or later: <a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;pytorch-transformers" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;pytorch-transformers</a><p>It&#x27;s so damn convenient, and so nicely done.<p>And they keep doing neat things like this one: <a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;swift-coreml-transformers" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;swift-coreml-transformers</a><p>Kudos to Julien Chaumond et al for their work!
评论 #20993573 未加载
评论 #20990662 未加载
nurettin超过 5 年前
This is probably apple&#x27;s continuous integration tests, lazily written to download the whole thing every time someone merges a commit.
评论 #20992287 未加载
CobrastanJorji超过 5 年前
If you host large, publicly available data in a cloud blob service, but you don&#x27;t have a budget for it, one option is to use the &quot;Requester Pays&quot; feature that Amazon and Google provide. This makes the data available to anyone to download, but they need to pay the download cost themselves.<p>This is at the tradeoff of making your data significantly more irritating to access, as it&#x27;s no longer just plugging in a URL into a program, plus everyone who wants your dataset needs to set up a billing account with Amazon or Google.
评论 #20990990 未加载
评论 #20992184 未加载
dharmon超过 5 年前
I don&#x27;t have high hopes for his business prospects if this is how he handles one of the richest companies in the world clearly having a high need for something his company offers.<p>Maybe spend less time on Twitter and more on your business model?
评论 #20989522 未加载
评论 #20989939 未加载
评论 #20989557 未加载
评论 #20992746 未加载
评论 #20992459 未加载
评论 #20988702 未加载
评论 #20991176 未加载
评论 #20989990 未加载
emeraldd超过 5 年前
This looks kind of interesting:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;pytorch-pretrained-BigGAN&#x2F;blob&#x2F;6ae20a35a051816d66811d85597033623a8ac888&#x2F;pytorch_pretrained_biggan&#x2F;model.py#L29" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;pytorch-pretrained-BigGAN&#x2F;blo...</a><p>When you look further down you find:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;pytorch-pretrained-BigGAN&#x2F;blob&#x2F;6ae20a35a051816d66811d85597033623a8ac888&#x2F;pytorch_pretrained_biggan&#x2F;model.py#L253-L281" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;pytorch-pretrained-BigGAN&#x2F;blo...</a><p>And that&#x27;s just a quick search for s3 in the repo. It would not surprise me in the least to discover a `from_pretrained` that points at one of the s3 resources being pulled. There&#x27;s probably other stuff like that as well in the code that could be causing equally nasty heartache .. especially if non-persistent containers are involved....<p>(This is a WAG aka Wild A<i></i> Guess)<p>EDIT: Dug a little more and found:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;search?q=org%3Ahuggingface+s3&amp;type=Code" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;search?q=org%3Ahuggingface+s3&amp;type=Code</a><p>Unless I&#x27;m mistaken here, there&#x27;s a crap ton of code that could be downloading models at runtime ... Which seems significantly less than ideal ...
评论 #20990162 未加载
评论 #20989718 未加载
btown超过 5 年前
A brief reminder: Whenever you publish code or documentation that might be used&#x2F;scraped by the outside world, ALWAYS use a domain you own. If you&#x27;re on Cloudflare you can instantly (and for free) create Page Rules to use Cloudflare as a CDN, redirect to another CDN, or black-hole or reroute traffic anywhere you want.
评论 #20989856 未加载
评论 #20989985 未加载
评论 #20989888 未加载
lacker超过 5 年前
Well, you could contact them and make a very-likely-to-succeed case that they should pay you some money, or you could complain about it on Twitter.
评论 #20987979 未加载
paxys超过 5 年前
That&#x27;s about $4000&#x2F;month in bandwidth costs, assuming retail pricing.<p>FYI he is bragging, not complaining. There are a dozen ways to reduce or eliminate this problem.
评论 #20990494 未加载
评论 #20997523 未加载
评论 #20990077 未加载
ebg13超过 5 年前
If you don&#x27;t want someone else to do something that costs you money, you&#x27;re going to have a bad time if you don&#x27;t prevent them from doing it.
评论 #20987615 未加载
alphagrep12345超过 5 年前
What does hugging face do? Do they implement models from papers and make them available for free?
评论 #20989679 未加载
rhacker超过 5 年前
I&#x27;m guessing someone at apple internally distributed a dockerfile that pulls that down.
fitzroy超过 5 年前
In a few weeks he can just point Apple&#x27;s IP range to a shared iCloud folder.
评论 #20990769 未加载
StreamBright超过 5 年前
Paid by requester is the feature they are looking for.<p><a href="https:&#x2F;&#x2F;docs.aws.amazon.com&#x2F;AmazonS3&#x2F;latest&#x2F;dev&#x2F;configure-requester-pays-console.html" rel="nofollow">https:&#x2F;&#x2F;docs.aws.amazon.com&#x2F;AmazonS3&#x2F;latest&#x2F;dev&#x2F;configure-re...</a>
评论 #20992787 未加载
hi41超过 5 年前
I read the Twitter post but did not understand what is happening. Can someone please explain.
评论 #20987845 未加载
cpach超过 5 年前
Isn’t this a use case where BitTorrent would shine?
评论 #20993220 未加载
评论 #20993621 未加载
peterwwillis超过 5 年前
If your CI&#x2F;CD is re-downloading and re-building everything on every single run, you are not only being wasteful, you&#x27;re actually more likely to have an outage due to not storing dependency artifacts needed for deploy. Use a local artifact store to be more resilient to failures of servers you don&#x27;t control (and also save everyone money and time).
soared超过 5 年前
Charge, them, money?
jijji超过 5 年前
Hosting terabytes of data on an S3 bucket where people would download 45TB per month ($0.023&#x2F;GB == $1000+&#x2F;month) sounds like a really expensive way to distribute your data to people...
评论 #20991279 未加载
评论 #21000004 未加载
评论 #20990193 未加载
mrfusion超过 5 年前
What’s the backstory on this? (Is it something I should already know)
yalogin超过 5 年前
Isn&#x27;t it likely that someone wrote a script for testing some regression and it keeps running in a loop? I can almost bet that will be the case.
tnolet超过 5 年前
Is this what they call product market fit?
codesternews超过 5 年前
Looks like open source company. What&#x27;s their business model? Does any one know, How they earn money?
idlewords超过 5 年前
This is what success looks like if you charge money for a good or service.
ChuckMcM超过 5 年前
And now the twitter post is gone? I&#x27;m guessing the west coast woke up and someone at Apple said &quot;Wait, you could infer some proprietary information with that information ...&quot;
评论 #20993945 未加载
cuillevel3超过 5 年前
Are those full downloads or just HEAD or range requests from some CI?
dlasek超过 5 年前
They&#x27;re the ones that made Amazon get those Data Trucks lol
z3t4超过 5 年前
Apple are probably doing &quot;continuous integration&quot; where all assets are re-downloaded from the Internet in each iteration. Tip: put your stuff on Github :P
评论 #20996164 未加载
ajay-d超过 5 年前
Aren’t all the authors of that paper from Apple?
ecnahc515超过 5 年前
Why can&#x27;t they use cloudfront?
评论 #20989615 未加载
half-kh-hacker超过 5 年前
It&#x27;s surprising that nobody here&#x27;s mentioned Wasabi, since they have free egress.
评论 #20993639 未加载
master_yoda_1超过 5 年前
So these jokers at apple publish a paper by using code from huggingface.
dymk超过 5 年前
Need to distribute large static content? Looks like a good job for a torrent.
评论 #20993692 未加载
bryan_w超过 5 年前
Have you considered cloudflair?
评论 #20989961 未加载
评论 #20987777 未加载
kelnos超过 5 年前
If a company the size of Apple finds this that useful, perhaps you should consider charging for your service, rather than just complaining on Twitter about the free usage you appear to have willingly given away?<p>Or perhaps you have reached out to them, but are for some reason still complaining on Twitter to drum up PR or something?<p>Regardless, this posting is ridiculously context-free to the point of being click-baity. (But hey, good job, I clicked on it anyway.)
评论 #20990593 未加载
评论 #20993244 未加载