tl;dr<p>1. The typical deep neural network tutorial introduces deep networks as compositions of nonlinearities and affine transforms.<p>2. In fact, a deep network with relu activation simplifies to a linear combination of affine transformations with compact support. But, why would affine transformations be useful?<p>3. After recent discussions on Twitter it occurred to me that the reason why they work is that they are actually first-order Taylor approximations of a suitable analytic function.<p>4. What is really cool about this is that by this logic partial derivatives, i.e. Jacobians, are computational primitives for both inference and learning.<p>5. I think this also provides insight into how deep networks approximate functions. They approximate the intrinsic geometry of a relation using piece-wise linear functions.<p>This works because a suitable polynomial approximation exists and all polynomials are locally Lipschitz.