On the whole this is useful, although I think it's a little unfair to Theano in places.<p>* Performance<p>I feel they should score separately here for compilation/startup time vs runtime. Theano's compilation step can be slow the first time around. (In my personal experience not enough to add significant friction at development time, but YMMV -- I hear it can struggle with some more complex architectures like deep stacked RNNs.)<p>Its compilation process gives some unique advantages though --
for example it can generate and compile custom kernels for fused elementwise operations, which can give speed advantages at runtime that aren't achievable via a simple stacking of layers with pre-canned kernels. Some of its graph optimisations are pretty useful too. In short smarter compilation can save you from having to implement your own kernels to achieve good performance on non-standard architectures. If you're doing research that can matter.<p>* Architecture<p>The architecture of Theano's main public API is clean and elegant IMO, which is what matters most.<p>When it comes to extensibility, firstly you don't need to go implement custom Ops very often, certainly not as often as you might implement a custom Layer in Torch. That's because Theano ships with lots of fundamental tensor operations that you can compose, <i>and</i> a compiler that can optimise the resulting graph well.<p>About the idea that it's hacky that "the whole code base is Python where C/CUDA code is packaged as Python string": if you want to generate new CUDA kernels programatically then you're going to want to use some high-level language to do it. As stated Theano gets some unique advantages from being able to do this. At some conceptual cost I'm sure it'd be possible to handle this code generation in a slightly cleaner way, but I don't really see anyone else in this area doing it significantly better, so I think given the constraints it's a bit subjective and slightly unfair to call it "hacky".<p>I also think it's something that matters more for framework developers than users. In my experience, on the relatively rare situations where you do need to implement a custom Op, it's usually as a performance optimisation and you can get away with something relatively simple and problem-specific, essentially a thin python wrapper around some fixed kernel code.<p>The CGT project (which seems to be aiming for a better Theano) has some valid and more detailed criticism of the architecture of the compiler, which I think is fairer: <a href="http://rll.berkeley.edu/cgt/#whynottheano" rel="nofollow">http://rll.berkeley.edu/cgt/#whynottheano</a><p>I'm also hoping in due course that Tensorflow will come closer to parity with some of Theano's compiler smarts, at which point I'll be eager to switch as Tensorflow has some other advantages, multi-GPU for one.