Or, do your inference using an AVX-512 CPU:<p><a href="https://NN-512.com" rel="nofollow">https://NN-512.com</a> (open source, free software, no dependencies)<p>With batch size 1, NN-512 is easily 2x faster than TensorFlow and does 27 ResNet50 inferences per second on a c5.xlarge instance. For more unusual networks, like DenseNet or ResNeXt, the performance gap is wider.<p>Even if you allow TensorFlow to use a larger ResNet50 batch size, NN-512 is easily 1.3x faster.<p>If you need a few dozen inferences per second per server, this is the cheapest way. And you're not depending on a proprietary solution whose parent company could go out of business in a year.<p>If you need Transformers instead of convolutions, Fabrice Bellard's LibNC is a good solution: <a href="https://bellard.org/libnc/" rel="nofollow">https://bellard.org/libnc/</a>