This is really clever. So basically iiuc, they set up a network to encode down to a representation that consists of parameters for a rendering engine. In order to ensure that this is the representation that is learned, the decoding stage is used to re-render the image subject to transformations and perform the decoding based on a an initial reduction phase after rendering. I.e. it is like an autoencoder, but the inner-most reduced representation is forced to be related to a graphics rendering engine by manipulating related transformation parameters.<p>Not only is this interesting from the point of view of using it for learning how to generate images, but it is a novel way to force a semantic internal representation instead of leaving it up to a regularisation strategy and interpreting the sparse encoding post-hoc. It forces the internal representation to be inherently "tweakable."
Very cool work, I'm happy to see more people thinking about deep networks along these lines.
It seems that this is very similar to a recent work put on arxiv back in November,<p>"Learning to Generate Chairs with Convolutional Neural Networks".
<a href="http://arxiv.org/abs/1411.5928" rel="nofollow">http://arxiv.org/abs/1411.5928</a><p>They also have a very cool video of the generation process:
<a href="https://youtu.be/QCSW4isBDL0" rel="nofollow">https://youtu.be/QCSW4isBDL0</a><p>It's very interesting to see two groups independently developing almost identical networks for inverse graphics tasks, both using pose, shape, and view parameters to guide learning. I think that continuing in this direction could provide a lot of insight into how these deep networks work, and lead to new improvements for recognition tasks too.<p>@tejask - You should probably cite the above paper, and thanks for providing code! awesome!
This is very nice, however I wish they would have used a traditional rendering technique (e.g. raytracing) for the decoder stage. It would have been more difficult to compute the gradient, but maybe not too bad if employing some type of automatic differentiation. If they had done it that way, the renderings could scale to any resolution (post-learning) and employ all types of niceities such as depth of field, sub-surface scattering, etc. Instead we're left with these very blocky, quantized convolution-style images.
Reminds me of being blown away in 2007 by Vetter and Blanz chasing a similar aim: <a href="https://m.youtube.com/watch?v=jrutZaYoQJo" rel="nofollow">https://m.youtube.com/watch?v=jrutZaYoQJo</a>
Whoa. Basically like <a href="http://www.di.ens.fr/willow/pdfscurrent/pami09a.pdf" rel="nofollow">http://www.di.ens.fr/willow/pdfscurrent/pami09a.pdf</a> except it skips the (explicit) 3D mesh reconstruction altogether and goes straight to the rendered output.
So, in essence, this network can learn to "unproject" images.<p>Since projection is a lossy operation, a projected image has potentially multiple inverses. And this makes me wonder how this system deals with the situation where two or more inverses exist and are equally likely.