Apple downloads ~45 TB of models per day from our S3 bucket

553 pointsby julien_cover 5 years ago

34 comments

cs702over 5 years ago

"Almost everyone" working on NLP uses one of hugginface's pretrained models at one point or another, sooner or later: <a href="https://github.com/huggingface/pytorch-transformers" rel="nofollow">https://github.com/huggingface/pytorch-transformers</a>It's so damn convenient, and so nicely done.And they keep doing neat things like this one: <a href="https://github.com/huggingface/swift-coreml-transformers" rel="nofollow">https://github.com/huggingface/swift-coreml-transformers</a>Kudos to Julien Chaumond et al for their work!

评论 #20993573 未加载

评论 #20990662 未加载

nurettinover 5 years ago

This is probably apple's continuous integration tests, lazily written to download the whole thing every time someone merges a commit.

评论 #20992287 未加载

CobrastanJorjiover 5 years ago

If you host large, publicly available data in a cloud blob service, but you don't have a budget for it, one option is to use the "Requester Pays" feature that Amazon and Google provide. This makes the data available to anyone to download, but they need to pay the download cost themselves.This is at the tradeoff of making your data significantly more irritating to access, as it's no longer just plugging in a URL into a program, plus everyone who wants your dataset needs to set up a billing account with Amazon or Google.

评论 #20990990 未加载

评论 #20992184 未加载

dharmonover 5 years ago

I don't have high hopes for his business prospects if this is how he handles one of the richest companies in the world clearly having a high need for something his company offers.Maybe spend less time on Twitter and more on your business model?

评论 #20989522 未加载

评论 #20989939 未加载

评论 #20989557 未加载

评论 #20992746 未加载

评论 #20992459 未加载

评论 #20988702 未加载

评论 #20991176 未加载

评论 #20989990 未加载

emeralddover 5 years ago

This looks kind of interesting:<a href="https://github.com/huggingface/pytorch-pretrained-BigGAN/blob/6ae20a35a051816d66811d85597033623a8ac888/pytorch_pretrained_biggan/model.py#L29" rel="nofollow">https://github.com/huggingface/pytorch-pretrained-BigGAN/blo...</a>When you look further down you find:<a href="https://github.com/huggingface/pytorch-pretrained-BigGAN/blob/6ae20a35a051816d66811d85597033623a8ac888/pytorch_pretrained_biggan/model.py#L253-L281" rel="nofollow">https://github.com/huggingface/pytorch-pretrained-BigGAN/blo...</a>And that's just a quick search for s3 in the repo. It would not surprise me in the least to discover a `from_pretrained` that points at one of the s3 resources being pulled. There's probably other stuff like that as well in the code that could be causing equally nasty heartache .. especially if non-persistent containers are involved....(This is a WAG aka Wild A Guess)EDIT: Dug a little more and found:<a href="https://github.com/search?q=org%3Ahuggingface+s3&type=Code" rel="nofollow">https://github.com/search?q=org%3Ahuggingface+s3&type=Code</a>Unless I'm mistaken here, there's a crap ton of code that could be downloading models at runtime ... Which seems significantly less than ideal ...

评论 #20990162 未加载

评论 #20989718 未加载

btownover 5 years ago

A brief reminder: Whenever you publish code or documentation that might be used/scraped by the outside world, ALWAYS use a domain you own. If you're on Cloudflare you can instantly (and for free) create Page Rules to use Cloudflare as a CDN, redirect to another CDN, or black-hole or reroute traffic anywhere you want.

评论 #20989856 未加载

评论 #20989985 未加载

评论 #20989888 未加载

lackerover 5 years ago

Well, you could contact them and make a very-likely-to-succeed case that they should pay you some money, or you could complain about it on Twitter.

评论 #20987979 未加载

paxysover 5 years ago

That's about $4000/month in bandwidth costs, assuming retail pricing.FYI he is bragging, not complaining. There are a dozen ways to reduce or eliminate this problem.

评论 #20990494 未加载

评论 #20997523 未加载

评论 #20990077 未加载

ebg13over 5 years ago

If you don't want someone else to do something that costs you money, you're going to have a bad time if you don't prevent them from doing it.

评论 #20987615 未加载

alphagrep12345over 5 years ago

What does hugging face do? Do they implement models from papers and make them available for free?

评论 #20989679 未加载

rhackerover 5 years ago

I'm guessing someone at apple internally distributed a dockerfile that pulls that down.

fitzroyover 5 years ago

In a few weeks he can just point Apple's IP range to a shared iCloud folder.

评论 #20990769 未加载

StreamBrightover 5 years ago

Paid by requester is the feature they are looking for.<a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/configure-requester-pays-console.html" rel="nofollow">https://docs.aws.amazon.com/AmazonS3/latest/dev/configure-re...</a>

评论 #20992787 未加载

hi41over 5 years ago

I read the Twitter post but did not understand what is happening. Can someone please explain.

评论 #20987845 未加载

cpachover 5 years ago

Isn’t this a use case where BitTorrent would shine?

评论 #20993220 未加载

评论 #20993621 未加载

peterwwillisover 5 years ago

If your CI/CD is re-downloading and re-building everything on every single run, you are not only being wasteful, you're actually more likely to have an outage due to not storing dependency artifacts needed for deploy. Use a local artifact store to be more resilient to failures of servers you don't control (and also save everyone money and time).

soaredover 5 years ago

Charge, them, money?

jijjiover 5 years ago

Hosting terabytes of data on an S3 bucket where people would download 45TB per month ($0.023/GB == $1000+/month) sounds like a really expensive way to distribute your data to people...

评论 #20991279 未加载

评论 #21000004 未加载

评论 #20990193 未加载

mrfusionover 5 years ago

What’s the backstory on this? (Is it something I should already know)

yaloginover 5 years ago

Isn't it likely that someone wrote a script for testing some regression and it keeps running in a loop? I can almost bet that will be the case.

tnoletover 5 years ago

Is this what they call product market fit?

codesternewsover 5 years ago

Looks like open source company. What's their business model? Does any one know, How they earn money?

idlewordsover 5 years ago

This is what success looks like if you charge money for a good or service.

ChuckMcMover 5 years ago

And now the twitter post is gone? I'm guessing the west coast woke up and someone at Apple said "Wait, you could infer some proprietary information with that information ..."

评论 #20993945 未加载

cuillevel3over 5 years ago

Are those full downloads or just HEAD or range requests from some CI?

dlasekover 5 years ago

They're the ones that made Amazon get those Data Trucks lol

z3t4over 5 years ago

Apple are probably doing "continuous integration" where all assets are re-downloaded from the Internet in each iteration. Tip: put your stuff on Github :P

评论 #20996164 未加载

ajay-dover 5 years ago

Aren’t all the authors of that paper from Apple?

ecnahc515over 5 years ago

Why can't they use cloudfront?

评论 #20989615 未加载

half-kh-hackerover 5 years ago

It's surprising that nobody here's mentioned Wasabi, since they have free egress.

评论 #20993639 未加载

master_yoda_1over 5 years ago

So these jokers at apple publish a paper by using code from huggingface.

dymkover 5 years ago

Need to distribute large static content? Looks like a good job for a torrent.

评论 #20993692 未加载

bryan_wover 5 years ago

Have you considered cloudflair?

评论 #20989961 未加载

评论 #20987777 未加载

kelnosover 5 years ago

If a company the size of Apple finds this that useful, perhaps you should consider charging for your service, rather than just complaining on Twitter about the free usage you appear to have willingly given away?Or perhaps you have reached out to them, but are for some reason still complaining on Twitter to drum up PR or something?Regardless, this posting is ridiculously context-free to the point of being click-baity. (But hey, good job, I clicked on it anyway.)

评论 #20990593 未加载

评论 #20993244 未加载