TechEcho

10 comments

ipsum2about 6 years ago

> Erkut Aykutlug and Mark Peng used XGBoost with creative feature engineering whereas AutoML uses both neural network and gradient boosting tree (TFBT) with automatic feature engineering and hyperparameter tuning.It's hilarious that gradient boosted descent tree beat Google's fancy AutoML-generated neural networks.

评论 #19876357 未加载

评论 #19876428 未加载

评论 #19875892 未加载

kmax12about 6 years ago

I think it’s a bit of an overstatement to call this an end-to-end solution.What they are starting with here is a single table of data with all the features already defined and an existing binary label column. Typically when this type of data is collected in the field it is much more fine grained (i.e many observations collected over time) and unlabeled (e.g how do we define a true example? How many false examples do we select?).The competition description even goes as far to say “We have chosen a dataset that you can get started with easily”.So, yes, this is a cool demonstration of Google's product, but the success in the competition might not extend to the problems real business face when trying to apply ML to a problem like this.That being said, I do think AutoML can help with these problems as it is extended to handle data that isn’t in a single table already.For example, I’m a developer of a open source library called Featuretools (<a href="https://github.com/Featuretools/featuretools" rel="nofollow">https://github.com/Featuretools/featuretools</a>) that tries to automate feature engineering for temporal and relational datasets. Basically, it helps data scientists prepare real world data into the form this competition starts with.

hooloovoo_zooabout 6 years ago

Slightly misleading as it wasn't a traditional competition. The same page shows it at about the 75th percentile in "real" ones.

评论 #19875874 未加载

rahimnathwaniabout 6 years ago

I'm interested to know how easy this for regular people (software engineers with just a little knowledge of data science) to use.This part stands out:"our team spent most of time monitoring jobs and and waiting for them to finish. Our solution for second place on the final leaderboard required 1 hour on 2500 CPUs"Before I got to this part, I had assumed using AutoML would involve only reformatting the training/validation data, and then letting a single job run its course. Why does something that's 'automatic' need people to run multiple jobs?Anyone know why they used CPUs instead of GPUs/TPUs? If they're distributing the computation over 100s of CPUs, then it's clear the computations can be done in parallel.

评论 #19875533 未加载

评论 #19879421 未加载

filleokusabout 6 years ago

How does these auto ML-solutions (like h2o) work in practice, anyone willing to share their experience?I wonder how automatic machine learning tools like these will shape the "data science" roles in the future. Obviously, the most cutting edge research will always be done by specialised human experts, but perhaps tools like these will lower the bar required for the bulk of mainstream ML work.

评论 #19876840 未加载

评论 #19876347 未加载

pplonski86about 6 years ago

Anthony, can Kaggle make this dataset public or make competition public and enable post-competition submission? It will be beneficial for AutoML research.

villuxabout 6 years ago

Does someone know what process they use for feature engineering?

评论 #19876409 未加载

评论 #19875869 未加载

martingoodsonabout 6 years ago

Looks like H20 achieved a score of 0.61312 against Google's 0.61598, just training on a single machine:<a href="https://twitter.com/ledell/status/1116533416155963392" rel="nofollow">https://twitter.com/ledell/status/1116533416155963392</a>

kelvin0about 6 years ago

Let's just hope this does not become the 'Excel' of the ML space. Then anyone will start 'coding' some godawful models and use them in critical day to day infrastructure ...Don't get me wrong, I'm all for democratizing ML, but sometime these tools become fully-automatic-high-caliber footguns.

评论 #19877459 未加载

ptahabout 6 years ago

Can automl be replicated outside of google cloud?

评论 #19876628 未加载

评论 #19876146 未加载

10 comments

ipsum2about 6 years ago

评论 #19876357 未加载

评论 #19876428 未加载

评论 #19875892 未加载

kmax12about 6 years ago

hooloovoo_zooabout 6 years ago

Slightly misleading as it wasn't a traditional competition. The same page shows it at about the 75th percentile in "real" ones.

评论 #19875874 未加载

rahimnathwaniabout 6 years ago

评论 #19875533 未加载

评论 #19879421 未加载

filleokusabout 6 years ago

评论 #19876840 未加载

评论 #19876347 未加载

pplonski86about 6 years ago

Anthony, can Kaggle make this dataset public or make competition public and enable post-competition submission? It will be beneficial for AutoML research.

An End-to-End AutoML Solution for Tabular Data at KaggleDays

10 comments

An End-to-End AutoML Solution for Tabular Data at KaggleDays

10 comments