"Almost everyone" working on NLP uses one of hugginface's pretrained models at one point or another, sooner or later: <a href="https://github.com/huggingface/pytorch-transformers" rel="nofollow">https://github.com/huggingface/pytorch-transformers</a><p>It's so damn convenient, and so nicely done.<p>And they keep doing neat things like this one: <a href="https://github.com/huggingface/swift-coreml-transformers" rel="nofollow">https://github.com/huggingface/swift-coreml-transformers</a><p>Kudos to Julien Chaumond et al for their work!
If you host large, publicly available data in a cloud blob service, but you don't have a budget for it, one option is to use the "Requester Pays" feature that Amazon and Google provide. This makes the data available to anyone to download, but they need to pay the download cost themselves.<p>This is at the tradeoff of making your data significantly more irritating to access, as it's no longer just plugging in a URL into a program, plus everyone who wants your dataset needs to set up a billing account with Amazon or Google.
I don't have high hopes for his business prospects if this is how he handles one of the richest companies in the world clearly having a high need for something his company offers.<p>Maybe spend less time on Twitter and more on your business model?
This looks kind of interesting:<p><a href="https://github.com/huggingface/pytorch-pretrained-BigGAN/blob/6ae20a35a051816d66811d85597033623a8ac888/pytorch_pretrained_biggan/model.py#L29" rel="nofollow">https://github.com/huggingface/pytorch-pretrained-BigGAN/blo...</a><p>When you look further down you find:<p><a href="https://github.com/huggingface/pytorch-pretrained-BigGAN/blob/6ae20a35a051816d66811d85597033623a8ac888/pytorch_pretrained_biggan/model.py#L253-L281" rel="nofollow">https://github.com/huggingface/pytorch-pretrained-BigGAN/blo...</a><p>And that's just a quick search for s3 in the repo. It would not surprise me in the least to discover a `from_pretrained` that points at one of the s3 resources being pulled. There's probably other stuff like that as well in the code that could be causing equally nasty heartache .. especially if non-persistent containers are involved....<p>(This is a WAG aka Wild A<i></i> Guess)<p>EDIT: Dug a little more and found:<p><a href="https://github.com/search?q=org%3Ahuggingface+s3&type=Code" rel="nofollow">https://github.com/search?q=org%3Ahuggingface+s3&type=Code</a><p>Unless I'm mistaken here, there's a crap ton of code that could be downloading models at runtime ... Which seems significantly less than ideal ...
A brief reminder: Whenever you publish code or documentation that might be used/scraped by the outside world, ALWAYS use a domain you own. If you're on Cloudflare you can instantly (and for free) create Page Rules to use Cloudflare as a CDN, redirect to another CDN, or black-hole or reroute traffic anywhere you want.
That's about $4000/month in bandwidth costs, assuming retail pricing.<p>FYI he is bragging, not complaining. There are a dozen ways to reduce or eliminate this problem.
If you don't want someone else to do something that costs you money, you're going to have a bad time if you don't prevent them from doing it.
Paid by requester is the feature they are looking for.<p><a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/configure-requester-pays-console.html" rel="nofollow">https://docs.aws.amazon.com/AmazonS3/latest/dev/configure-re...</a>
If your CI/CD is re-downloading and re-building everything on every single run, you are not only being wasteful, you're actually more likely to have an outage due to not storing dependency artifacts needed for deploy. Use a local artifact store to be more resilient to failures of servers you don't control (and also save everyone money and time).
Hosting terabytes of data on an S3 bucket where people would download 45TB per month ($0.023/GB == $1000+/month) sounds like a really expensive way to distribute your data to people...
And now the twitter post is gone? I'm guessing the west coast woke up and someone at Apple said "Wait, you could infer some proprietary information with that information ..."
Apple are probably doing "continuous integration" where all assets are re-downloaded from the Internet in each iteration. Tip: put your stuff on Github :P
If a company the size of Apple finds this that useful, perhaps you should consider charging for your service, rather than just complaining on Twitter about the free usage you appear to have willingly given away?<p>Or perhaps you have reached out to them, but are for some reason still complaining on Twitter to drum up PR or something?<p>Regardless, this posting is ridiculously context-free to the point of being click-baity. (But hey, good job, I clicked on it anyway.)