TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

An End-to-End AutoML Solution for Tabular Data at KaggleDays

96 点作者 antgoldbloom大约 6 年前

10 条评论

ipsum2大约 6 年前
&gt; Erkut Aykutlug and Mark Peng used XGBoost with creative feature engineering whereas AutoML uses both neural network and gradient boosting tree (TFBT) with automatic feature engineering and hyperparameter tuning.<p>It&#x27;s hilarious that gradient boosted descent tree beat Google&#x27;s fancy AutoML-generated neural networks.
评论 #19876357 未加载
评论 #19876428 未加载
评论 #19875892 未加载
kmax12大约 6 年前
I think it’s a bit of an overstatement to call this an end-to-end solution.<p>What they are starting with here is a single table of data with all the features already defined and an existing binary label column. Typically when this type of data is collected in the field it is much more fine grained (i.e many observations collected over time) and unlabeled (e.g how do we define a true example? How many false examples do we select?).<p>The competition description even goes as far to say “We have chosen a dataset that you can get started with easily”.<p>So, yes, this is a cool demonstration of Google&#x27;s product, but the success in the competition might not extend to the problems real business face when trying to apply ML to a problem like this.<p>That being said, I do think AutoML can help with these problems as it is extended to handle data that isn’t in a single table already.<p>For example, I’m a developer of a open source library called Featuretools (<a href="https:&#x2F;&#x2F;github.com&#x2F;Featuretools&#x2F;featuretools" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Featuretools&#x2F;featuretools</a>) that tries to automate feature engineering for temporal and relational datasets. Basically, it helps data scientists prepare real world data into the form this competition starts with.
hooloovoo_zoo大约 6 年前
Slightly misleading as it wasn&#x27;t a traditional competition. The same page shows it at about the 75th percentile in &quot;real&quot; ones.
评论 #19875874 未加载
rahimnathwani大约 6 年前
I&#x27;m interested to know how easy this for regular people (software engineers with just a little knowledge of data science) to use.<p>This part stands out:<p>&quot;our team spent most of time monitoring jobs and and waiting for them to finish. Our solution for second place on the final leaderboard required 1 hour on 2500 CPUs&quot;<p>Before I got to this part, I had assumed using AutoML would involve only reformatting the training&#x2F;validation data, and then letting a single job run its course. Why does something that&#x27;s &#x27;automatic&#x27; need people to run multiple jobs?<p>Anyone know why they used CPUs instead of GPUs&#x2F;TPUs? If they&#x27;re distributing the computation over 100s of CPUs, then it&#x27;s clear the computations can be done in parallel.
评论 #19875533 未加载
评论 #19879421 未加载
filleokus大约 6 年前
How does these auto ML-solutions (like h2o) work in practice, anyone willing to share their experience?<p>I wonder how automatic machine learning tools like these will shape the &quot;data science&quot; roles in the future. Obviously, the most cutting edge research will always be done by specialised human experts, but perhaps tools like these will lower the bar required for the bulk of mainstream ML work.
评论 #19876840 未加载
评论 #19876347 未加载
pplonski86大约 6 年前
Anthony, can Kaggle make this dataset public or make competition public and enable post-competition submission? It will be beneficial for AutoML research.
villux大约 6 年前
Does someone know what process they use for feature engineering?
评论 #19876409 未加载
评论 #19875869 未加载
martingoodson大约 6 年前
Looks like H20 achieved a score of 0.61312 against Google&#x27;s 0.61598, just training on a single machine:<p><a href="https:&#x2F;&#x2F;twitter.com&#x2F;ledell&#x2F;status&#x2F;1116533416155963392" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;ledell&#x2F;status&#x2F;1116533416155963392</a>
kelvin0大约 6 年前
Let&#x27;s just hope this does not become the &#x27;Excel&#x27; of the ML space. Then anyone will start &#x27;coding&#x27; some godawful models and use them in critical day to day infrastructure ...<p>Don&#x27;t get me wrong, I&#x27;m all for democratizing ML, but sometime these tools become fully-automatic-high-caliber footguns.
评论 #19877459 未加载
ptah大约 6 年前
Can automl be replicated outside of google cloud?
评论 #19876628 未加载
评论 #19876146 未加载