There is a recent 5 page theoretical paper on this topic that I thought was pretty interesting, and it tackles both deep nets and recurrent nets: <a href="http://arxiv.org/abs/1509.08101" rel="nofollow">http://arxiv.org/abs/1509.08101</a><p>Here is the abstract:<p>This note provides a family of classification problems, indexed by a positive integer k, where all shallow networks with fewer than exponentially (in k) many nodes exhibit error at least 1/6, whereas a deep network with 2 nodes in each of 2k layers achieves zero error, as does a recurrent network with 3 distinct nodes iterated k times. The proof is elementary, and the networks are standard feedforward networks with ReLU (Rectified Linear Unit) nonlinearities.