The most interesting thing I've seen on AD is "The simple essence of automatic differentiation" (2018) [1]. See past discussion [2], and talk [3]. I think the main idea is that by compiling to categories and pairing up a function with its derivative, the pair becomes trivially composable in forward mode, and the whole structure is easily converted to reverse mode afterwards.<p>[1]: <a href="https://dl.acm.org/doi/10.1145/3236765" rel="nofollow">https://dl.acm.org/doi/10.1145/3236765</a><p>[2]: <a href="https://news.ycombinator.com/item?id=18306860" rel="nofollow">https://news.ycombinator.com/item?id=18306860</a><p>[3]: Talk at Microsoft Research: <a href="https://www.youtube.com/watch?v=ne99laPUxN4" rel="nofollow">https://www.youtube.com/watch?v=ne99laPUxN4</a> Other presentations listed here: <a href="https://github.com/conal/essence-of-ad" rel="nofollow">https://github.com/conal/essence-of-ad</a>
My professor has talked about this. He thinks that the real gem of the deep learning revolution is the ability to take the derivative of arbitrary code and use that to optimize. Deep learning is just one application of that, but there are tons more.
Nice article, but the intro is a little lengthy.<p>I have one remark, though: If your language allows for automatic differentiation already, why do you bother with a neural network in the first place?<p>I think you should have a good reason why you choose a neural network for your approximation of the inverse function and why it has exactly that amount of layers. For instance, why shouldn't a simple polynomial suffice? Could it be that your neural network ends up as an approximation of the Taylor expansion of your inverse function?
The nice thing about differentiable programming is that we can use all sorts of different optimizers compared to gradient descent that can offer quadratic convergence instead of linear!
Does someone have an example where the ability to “differentiate” a program gets you something interesting?<p>I understand perfectly what it means for a neural network, but how about more abstract things.<p>Im not even sure as currently presented, the implementation actually means something. What is the derivative of a function like List, or Sort or GroupBy etc? These articles all assume that somehow it just looks like derivative from calculus somehow.<p>Approximating everything as some non smooth real function doesn’t seem entirely morally correct. A program is more discrete or synthetic. I think it should be a bit more algebraic flavoured, like differentials over a ring.
At first glance, this approach appears to re-invent an applied mathematics approach to optimal control. There, one writes a generalized Hamiltonian, from which forward and backward-in-time paths can be iterated.<p>The Pontryagin maximum (or minumum, if you define your objective function with a minus sign) principle is the essence to that approach to optimal control.