if AutoML-Zero is going to be more than a grid-like method then I think it should try to learn a probabilistic distribution over (method, problem, efficiency) and use it to discover features for problems using an auto-encoder in which the loss function is a metric over the (method,efficiency) space. That means using transfer-learning from related problems in which the similarity of problems is based of the (method,efficiency) differency.<p>Problem P1 is locally similar to P2 if (method,efficiency,P1) meassured in computation time is similar to (method,efficiency,P2) for method in a local space of methods. The method should learn to classify both problem and methods, that's similar to learning words and context words in NLP or matrix factorization in recommendation systems. To sample the (space,method,efficiency) space one need huge resources.<p>Added: To compare a pair of (method,problem) some stardardization should be used, for linear problems related to solving linear systems the condition number of the coefficiency matrix should be used as a feature for standardization and, for example in SAT an heuristic using the number of clauses and variables should be used for estimating the complexity and normalization of problems. So the preprocessing step should use the best known heuristic for solving the problem and estimating its complexity as both a feature and a method for normalization. Heuristic and DL for TSP is approaching SOTA (but concord is better yet).<p>Finally perhaps some encoding about how the heuristic was obtained could be used as a feature of the problem (heuristic from minimum spanning tree, branch and bound, dynamic programming, recurrence, memoization, hill climbing, ...) as an enumerative type.<p>So some problems for preprocessing are:
1) What is a good heuristic for solving this problem.
2) What is a good heuristic for bounding or estimating its complexity.
3) How can you use those heuristics to standardize or normalize its complexity.
4) How big should be the problem so that the assymptotic complexity takes over the noise of small problems.
5) How do you encode the different types of heuristics.
6) How do you value the sequential versus parallel method for solving the problem.<p>Finally, I wonder if once a problem is autoencoded then if some kind of curvature could be defined, that curvature should be related to the average complexity of a local space of problems, also transitions like in graph problems should be feautured. The idea is using gems of features to allow the system to combine those or discover new better features. Curvature could be used for clustering problem that is for classification of types of problems. For example all preprocessed problems for solving a linear system should be normalize to have similar efficiency when using the family F of learning methods otherwise a feature is introduced for further normalization. For example some problems could require to estimate the number of local extrema and the flat (zero curvature extend of those zones)