There is certainly a lot of hype around AI chips, but I'm very skeptical of the reward. There are several technical concerns I have with any "AI" chip that ultimately leave you with something more general purpose (and not really an "AI" chip, but good at low precision matmul):<p>* For inference, how do you efficiently move your data to the chip? In general most of the time is spent in matmul, and there are lots of exciting DSPs, mobile GPUs, etc. that require a fair amount of jumping through hoops to get your data to the ML coprocessor. If you're doing anything low latency, good luck because you need tight control (or bypassing entirely) of the OS. Will this lead to a battle between chip makers? Seems more likely to be a battle between end to end platforms.<p>* For training, do you have an efficient data flow with distributed compute? For the foreseeable future any large model (or small model with lots of data) needs to be distributed. The bottlenecks that come from this limit the improvements from your new specialized architecture without good distributed computing. Again better chips don't really solve this, and comes from a platform. I've noticed many training loops have terrible GPU utilization, particularly with Tensorflow and V100s. Why does this happen? The GPU is so fast, but things like summary ops add to CPU time limiting perf. Bad data pipelines not actually pipelining transformations. Slow disks bottlenecking transfers. Not staging/pipelining transfers to the GPU. And then there is a bit of an open question of how to best pipeline transfers from the GPU. Is there a simulator feeding data? Then you have a whole new can of worms to train fast.<p>* For your chip architecture, do you have the right abstractions to train the next architecture efficiently? Backprop trains some wonderful nets but for the cost of a new chip (50-100M), and the time it takes to build (18 months min), how confident are you that the chip will still be relevant to the needs of your teams? This generally points you towards something more general purpose, which may leave some efficiency on the table. Eventually you end up at a low precision matmul core, which is the same thing everyone is moving towards or already doing whether you call yourself a GPU, DSP, or TPU (which is quite similar to DSPs).<p>Coming from an HPC/Graphics turned deep learning engineer, I've worked with gpus since 2006 and neural net chips since 2010 (before even AlexNet!!), so I'm a bit of an outlier here having seen so many perspectives. From my point of view the computational fabric exists we're just not using it well :)