> Erkut Aykutlug and Mark Peng used XGBoost with creative feature engineering whereas AutoML uses both neural network and gradient boosting tree (TFBT) with automatic feature engineering and hyperparameter tuning.<p>It's hilarious that gradient boosted descent tree beat Google's fancy AutoML-generated neural networks.
I think it’s a bit of an overstatement to call this an end-to-end solution.<p>What they are starting with here is a single table of data with all the features already defined and an existing binary label column. Typically when this type of data is collected in the field it is much more fine grained (i.e many observations collected over time) and unlabeled (e.g how do we define a true example? How many false examples do we select?).<p>The competition description even goes as far to say “We have chosen a dataset that you can get started with easily”.<p>So, yes, this is a cool demonstration of Google's product, but the success in the competition might not extend to the problems real business face when trying to apply ML to a problem like this.<p>That being said, I do think AutoML can help with these problems as it is extended to handle data that isn’t in a single table already.<p>For example, I’m a developer of a open source library called Featuretools (<a href="https://github.com/Featuretools/featuretools" rel="nofollow">https://github.com/Featuretools/featuretools</a>) that tries to automate feature engineering for temporal and relational datasets. Basically, it helps data scientists prepare real world data into the form this competition starts with.
I'm interested to know how easy this for regular people (software engineers with just a little knowledge of data science) to use.<p>This part stands out:<p>"our team spent most of time monitoring jobs and and waiting for them to finish. Our solution for second place on the final leaderboard required 1 hour on 2500 CPUs"<p>Before I got to this part, I had assumed using AutoML would involve only reformatting the training/validation data, and then letting a single job run its course. Why does something that's 'automatic' need people to run multiple jobs?<p>Anyone know why they used CPUs instead of GPUs/TPUs? If they're distributing the computation over 100s of CPUs, then it's clear the computations can be done in parallel.
How does these auto ML-solutions (like h2o) work in practice, anyone willing to share their experience?<p>I wonder how automatic machine learning tools like these will shape the "data science" roles in the future. Obviously, the most cutting edge research will always be done by specialised human experts, but perhaps tools like these will lower the bar required for the bulk of mainstream ML work.
Anthony, can Kaggle make this dataset public or make competition public and enable post-competition submission? It will be beneficial for AutoML research.
Looks like H20 achieved a score of 0.61312 against Google's 0.61598, just training on a single machine:<p><a href="https://twitter.com/ledell/status/1116533416155963392" rel="nofollow">https://twitter.com/ledell/status/1116533416155963392</a>
Let's just hope this does not become the 'Excel' of the ML space. Then anyone will start 'coding' some godawful models and use them in critical day to day infrastructure ...<p>Don't get me wrong, I'm all for democratizing ML, but sometime these tools become fully-automatic-high-caliber footguns.