Lead author of DSSTNE here...<p>1. DSSTNE was designed two years ago specifically for product recommendations from Amazon's catalog. At that time, there was no TensorFlow, only Theano and Torch. DSSTNE differentiated from these two frameworks by optimizing for sparse data and multi-GPU spanning neural networks. What it's not currently is another framework for running AlexNet/VGG/GoogleNet etc, but about 500 lines of code plus cuDNN could change that if the demand exists. Implementing Krizhevsky's one weird trick is mostly trivial since the harder model parallel part has already been written.<p>2. DSSTNE does not yet explicitly support RNNs, but it does have support for shared weights and that's more than enough to build an unrolled RNN. We tried a few in fact. CuDNN 5 can be used to add LSTM support in a couple hundred lines of code. But since (I believe) the LSTM in cuDNN is a black box, it cannot be spread across multiple GPUs. Not too hard to write from the ground up though.<p>3. There are a huge number of collaborators and people behind the scenes that made this happen. I'd love to acknowledge them openly, but I'm not sure they want their names known.<p>4. Say what you want about Amazon, and they're not perfect, but they let us build this from the ground up and now they have given it away. Google hired me away from NVIDIA (another one of those offers I couldn't refuse) OTOH blind-allocated me into search in 2011 and would not let me work with GPUs despite my being one of the founding members of NVIDIA's CUDA team because they had not yet seen them as useful. I didn't stay there long. DSSTNE is 100% fresh code, warts and all, and I think Amazon both for letting me work on a project like this and for OSSing the code.<p>5. NetCDF is a nice efficient format for big data files. What other formats would you suggest we support here?<p>6. I was boarding a plane when they finally released this. I will be benchmarking it in the next few days. TLDR spoilers: near-perfect scaling for hidden layers with 1000 or so hidden units per GPU in use, and effectively free sparse input layers because both activation and weight gradient calculation have custom sparse kernels.<p>7. The JSON format made sense in 2014, but IMO what this engine needs now is a TensorFlow graph importer. Since the engine builds networks from a rather simple underlying C struct, this isn't particularly hard, but it does require supporting some additional functionality to be 100% compatible.<p>8. I left Amazon 4 months ago after getting an offer I couldn't refuse. I was the sole GPU coder on this project. I can count the number of people I'd trust with an engine like this with two hands and most of them are already building deep learning engines elsewhere. I'm happy to add whatever functionality is desired here. CNN and RNN support seem like two good first steps and the spec already accounts for this.<p>8. Ditto for a Python interface, easily implemented IMO through the Python C/C++ extension mechanism: <a href="https://docs.python.org/2/extending/extending.html" rel="nofollow">https://docs.python.org/2/extending/extending.html</a><p>Anyway, it's late, and it's turned out to be a fantastic day to see the project on which I spent nearly two years go OSS.