How to train large deep learning models as a startup

273 点作者 dylanbfox超过 3 年前

18 条评论

apl超过 3 年前

Several hints here are severely outdated.For instance, never train a model in end-to-end FP16. Use mixed precision, either via native TF/PyTorch or as a freebie when using TF32 on A100s. This’ll ensure that only suitable ops are run with lower precision; no need to fiddle with anything. Also, PyTorch DDP in multi-node regimes hasn’t been slower or less efficient than Horovod in ages.Finally, buying a local cluster of TITAN Xs is an outright weird recommendation for massive models. VRAM limitations alone make this a losing proposition.

评论 #28798178 未加载

评论 #28793574 未加载

cleverpebble超过 3 年前

I definitely enjoyed reading your article!Did you play around with any AI-specific accelerators (eg TPUs?)Looking at some basic cost analysis from a stranger on the Internet - <a href="https://medium.com/bigdatarepublic/cost-comparison-of-deep-learning-hardware-google-tpuv2-vs-nvidia-tesla-v100-3c63fe56c20f" rel="nofollow">https://medium.com/bigdatarepublic/cost-comparison-of-deep-l...</a> - you can probably get a decent price reduction in training, especially using preemptive instances (and perhaps a better pricing contract with Google/AWS)It's kind of crazy how the shortage of GPUs is affecting pricing on physical devices. My RTX Titan I bought in 2019 for $2,499 runs almost $5k on Amazon and is in short supply. The Titan V options you linked (although I think theres a typo because you referred it it as a Titan X) is an option - but it is still super overpriced for it's performance. Of course, this will probably settle down in the next year or two, and by then there will be new GPUs that are ~2-4x flop/$ compared to the V100/A100.

评论 #28788734 未加载

评论 #28788964 未加载

aledalgrande超过 3 年前

> How to train large deep learning models as a startupHow to train large deep learning models at a well founded startup*Everything described here is absolutely not affordable by bootstrappers and startups with little funding, unless the model to train is not that deep.

评论 #28792127 未加载

etrain超过 3 年前

Check out Determined <a href="https://github.com/determined-ai/determined" rel="nofollow">https://github.com/determined-ai/determined</a> to help manage this kind of work at scale: Determined leverages Horovod under the hood, automatically manages cloud resources and can get you up on spot instances, T4's, etc. and will work on your local cluster as well. Gives you additional features like experiment management, scheduling, profiling, model registry, advanced hyperparameter tuning, etc.Full disclosure: I'm a founder of the project.

评论 #28793482 未加载

评论 #28787576 未加载

评论 #28788007 未加载

elmomle超过 3 年前

Excellent and informative article--and a good bit of brand-building, I might say :-). One thing I'd love to see more writing about is prototyping and iterative development in these contexts--deep NNs are notoriously hard to get "right", and there seems to be a constant tension between model architecting, tuning hyperparameters, etc.--for example, you presumably don't want to have to wait a couple of weeks (and burn through thousands of dollars) seeing if one choice of hyperparameters works well for your chosen architecture.Of course, some development practices, such as ensuring that your loss function works in a basic sense, are covered in many places. But I'd love to see more in-depth coverage of architecture development & development best practices. Does anyone know of any particularly good resources / discussions there?

评论 #28790424 未加载

endisneigh超过 3 年前

If you wanted to do something like "OK Google" with AssemblyAI would you have to transcribe everything and then process the substring "OK Google" on the application layer (and therefore incur all of the cost of listening constantly)?It'd be cool if there was the ability to train a phrase locally on your own premises and then use that to begin the real transcription.This probably wouldn't be super difficult to build, but was wondering if it was available (didn't see anything at a glance)

评论 #28788363 未加载

评论 #28788452 未加载

评论 #28789432 未加载

评论 #28789210 未加载

visarga超过 3 年前

> that still adds up to $2,451,526.58 to run 1,024 A100 GPUs for 34 daysSalary costs are probably even higher than compute costs. Automatic Speech Recognition is an industrial scale application, it costs a lot to train, but so do many other projects in different fields. How expensive is a plane or a ship? How much can a single building cost? A rocket launch?

评论 #28788541 未加载

评论 #28790486 未加载

评论 #28787478 未加载

stayfrosty420超过 3 年前

in my experience it's often more like "just use linear regression and tell everyone you're using AI"

评论 #28789005 未加载

评论 #28789447 未加载

评论 #28790984 未加载

freshthought超过 3 年前

Does anyone use this? How does AssemblyAI compare to Google’s? We are considering adding speech recognition to a small part of our product.

评论 #28787444 未加载

评论 #28787790 未加载

评论 #28788805 未加载

评论 #28787455 未加载

评论 #28787554 未加载

评论 #28787587 未加载

评论 #28787960 未加载

评论 #28788549 未加载

评论 #28788571 未加载

Reubend超过 3 年前

This is an excellent article, which does a good job of detailing several factors involved here. But while it does suggest several ways to reduce the cost of training models, I'm left with a huge questions at the end.How much does it ultimately cost to train a model at this size, and is it feasible to do without VS funding (and cloud credits)?

评论 #28792438 未加载

hedgehog超过 3 年前

One thing to note on the "Train with lower precision" is on newer hardware with TF32 support that gives you much of the speedup of FP16 without being as finicky. Doesn't save memory, but still useful. Automatic in PyTorch, not sure in TensorFlow:<a href="https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices" rel="nofollow">https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-...</a>This is mostly important because these settings can significantly affect the price/perf evaluation for your specific model & the available hardware.

sandGorgon超过 3 年前

the hardest part here is horizontal scaling. OpenAI handrolled its own MPI+SSH stack (<a href="https://openai.com/blog/scaling-kubernetes-to-7500-nodes/" rel="nofollow">https://openai.com/blog/scaling-kubernetes-to-7500-nodes/</a>)I wonder what is the state of art for horizontal scaling here ...preferably on kubernetes.Pytorch is tricky to integrate (using TorchElastic). You could use Dask or Ray Distributed. Tensorflow has its own mechanism that doesnt play nice with Kubernetes.How are others doing it ?

jamesmishra超过 3 年前

I founded a company where we train a lot of machine learning models on music. We aren't quite at AssemblyAI's scale yet, but here is how I built my company's first on-premise GPU cluster to get us started:1. Purchase GPU machines from Lambda Labs. I went with machines with 256 GB of CPU RAM, 24-core AMD Threadrippers, 2 NVIDIA RTX 3090s, and 10gbps Ethernet. You might want to choose even more expensive GPUs.2. Make sure your electrical circuits have sufficient capacity to run your GPU machines at peak power consumption. I gave each machine its own US residential electrical circuit. If you are storing your GPU servers in a data center, look into whether they can get you enough electrical power for Lambda Labs's 8-GPU machines. When talking with a data center's sales team, make sure they understand how much electrical power you need. They might charge you a lot of money if you ask for much more electrical power than they usually install in a cabinet. Try to negotiate with multiple data centers to see you can give you the best offer.3. Purchase storage machines from 45Drives. I recommend buying their 30-drive machines and setting up a ZFS pool of 10 3-drive mirrors. Do not bother with raidz because your read and write speeds will be too slow, bottlenecking your ETL and training jobs.4. Serve files from your storage machines to your GPU machines using NFS. I like to use MergerFS to merge mounts from different NFS servers. Alternatively, you might want to use Ceph, Min.io, or Lustre.5. Buy Intel NUCs to run miscellaneous services--like monitoring--that you wouldn't want to colocate with your storage or GPU machines. They are small, cheap, and don't require a lot of electrical power. I bought a couple of NUCs with 64 GB of RAM and a 1 TB NVMe SSD each. Then I purchased external 10gbps Ethernet cards to plug into each NUC's 40gbps Thunderbolt 3 port.6. Buy 10gbps network switches. MikroTik has affordable 4-port, 8-port, and 16-port 10gbps switches. These are SFP+ (optical) switches, so you may need to buy copper adapters. I really like MikroTik's balance of quality and affordability, so I also buy network routers and other equipment from MikroTik.7. If possible, try to train models small enough that each model only needs one machine to train. For this reason, maybe you will want to buy one 10-GPU machine instead of 5 2-GPU machines. There are Amdahl's Law-style coordination costs to using multiple machines to train the same model. When I do large hyperparameter searches over many candidate models, I minimize these coordination costs and maximize throughput by limiting each model to only one machine. Of course, this is impossible if you are like AppliedAI and need 48 V100s to train a model.8. If you do need to train a single model using multiple machines, I've heard good things about Horovod, but I'm also excited about Ray.io--which offers user-friendly distributed training wrappers around TensorFlow MultiWorkerMirroredStrategy, PyTorch's DistributedDataParallel, or Horovod (which itself can train TensorFlow, PyTorch, or MXNet).

cagataygurturk超过 3 年前

Aren‘t preemptible/spot instances a way of dramatically reducing the public cloud cost if the training jobs are designed to be resumable/resilient to interruptions? Most providers offer also GPUs with this pricing model.

shoo超过 3 年前

tangent: i would dearly love to read a similar article focusing on practical advice on industrial application of statistical modelling, probabilistic programming & Bayesian inference

评论 #28800581 未加载

评论 #28807451 未加载

peter_retief超过 3 年前

500 million parameters seems like a lot, are there not duplication or redundancies that can reduce the parameters. One could also use batches of data. Seems very expensive!

kartayyar超过 3 年前

TLDR of the top two point: "get accepted to YC and use cloud credits." and use dedicated servers from Cirrascale.Saved you a click.

gonab超过 3 年前

Use a small network, train it in a local GPU