I founded a company where we train a lot of machine learning models on music. We aren't quite at AssemblyAI's scale yet, but here is how I built my company's first on-premise GPU cluster to get us started:<p>1. Purchase GPU machines from Lambda Labs. I went with machines with 256 GB of CPU RAM, 24-core AMD Threadrippers, 2 NVIDIA RTX 3090s, and 10gbps Ethernet. You might want to choose even more expensive GPUs.<p>2. Make sure your electrical circuits have sufficient capacity to run your GPU machines at peak power consumption. I gave each machine its own US residential electrical circuit. If you are storing your GPU servers in a data center, look into whether they can get you enough electrical power for Lambda Labs's 8-GPU machines. When talking with a data center's sales team, make sure they understand how much electrical power you need. They might charge you a lot of money if you ask for much more electrical power than they usually install in a cabinet. Try to negotiate with multiple data centers to see you can give you the best offer.<p>3. Purchase storage machines from 45Drives. I recommend buying their 30-drive machines and setting up a ZFS pool of 10 3-drive mirrors. Do not bother with raidz because your read and write speeds will be too slow, bottlenecking your ETL and training jobs.<p>4. Serve files from your storage machines to your GPU machines using NFS. I like to use MergerFS to merge mounts from different NFS servers. Alternatively, you might want to use Ceph, Min.io, or Lustre.<p>5. Buy Intel NUCs to run miscellaneous services--like monitoring--that you wouldn't want to colocate with your storage or GPU machines. They are small, cheap, and don't require a lot of electrical power. I bought a couple of NUCs with 64 GB of RAM and a 1 TB NVMe SSD each. Then I purchased external 10gbps Ethernet cards to plug into each NUC's 40gbps Thunderbolt 3 port.<p>6. Buy 10gbps network switches. MikroTik has affordable 4-port, 8-port, and 16-port 10gbps switches. These are SFP+ (optical) switches, so you may need to buy copper adapters. I really like MikroTik's balance of quality and affordability, so I also buy network routers and other equipment from MikroTik.<p>7. If possible, try to train models small enough that each model only needs one machine to train. For this reason, maybe you will want to buy one 10-GPU machine instead of 5 2-GPU machines. There are Amdahl's Law-style coordination costs to using multiple machines to train the same model. When I do large hyperparameter searches over many candidate models, I minimize these coordination costs and maximize throughput by limiting each model to only one machine. Of course, this is impossible if you are like AppliedAI and need 48 V100s to train a model.<p>8. If you do need to train a single model using multiple machines, I've heard good things about Horovod, but I'm also excited about Ray.io--which offers user-friendly distributed training wrappers around TensorFlow MultiWorkerMirroredStrategy, PyTorch's DistributedDataParallel, or Horovod (which itself can train TensorFlow, PyTorch, or MXNet).