In short, these guys are using deep neural nets to find good hyperparameters for training other deep neural nets, and this works as well as a Gaussian Process[1] but is more scalable and can be parallelized, allowing for faster optimization of hyperparameters.<p>--<p>[1] For example, like Spearmint: <a href="https://github.com/JasperSnoek/spearmint" rel="nofollow">https://github.com/JasperSnoek/spearmint</a>
I haven't read the paper, just skimmed through it, but isn't this a sort of unreasonable comparison? Full GPs are O(N^3) because they do inverse on a covariance matrix that basically allows every datapoint to be related to every other datapoint. There exists a bunch of literature on sparse approximate Gaussian processes which iirc are basically O(N*M) where N is the data and M is basically the size of some set of active points (which is tunable). That seems like the natural point of comparison to their neural net approach. Broadly, in their deep net, they've chosen some architecture which determines the number of parameters. It seems like either the neural net or the sparse GP approach can claim to have runtime which is linear with the data size, and is doing less work than the full GP for basically the same reasons.