TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

A Look into Machine Learning's First Cheating Scandal

246 pointsby metachrisover 9 years ago

10 comments

TuringTestover 9 years ago
Long story short:<p>- The LSVRC &quot;visual recognition&quot; competition has a rule in place, that limits to twice per week how often each contestant can run their entries to the contest against the ImageNet dataset.<p>- The Baidu team run their tests much more often, claiming that they understood the limit was placed on a person basis and not per team.<p>- This rule is in place because more frequent submissions somehow distort the quality of the test, by switching its emphasis to overlearning (adapting too much to the specific data included in the test dataset) instead of true algorithmic advances, which can influence the whole machine learning discipline given this contest&#x27;s high profile. ( * )<p>- As a result, the whole Baidu company has been banned from the competition for a year.<p>( * ) (&quot;The Baidu team’s oversubmissions tilted the balance of forward progress on the LSVRC from algorithmic advances to hyperparameter optimization.&quot;)
评论 #10730303 未加载
评论 #10744164 未加载
not_that_noobover 9 years ago
To translate to a more familiar domain, think about the SATs. After a period of study, students take the SAT where they get back the composite score of how they did, but never the actual answers to the questions on the test.<p>Now imagine a student can take the test repeatedly over the space of a few days, and can use the score to reverse engineer the answers to the questions. They can put in random answers and note which ones cause the score to go up. Of course the real life SATs don&#x27;t allow this, and they change up the questions to prevent this sort of cheating. If this were possible, our enterprising&#x2F;cheating student can derive the complete answer key over time, noting the changes in scores for each run. And once they have the key, they can ace the test. No longer is it a test of their aptitude, but rather of their knowledge of the answer key.<p>This scandal is analogous to this albeit contrived example. With an ML testset, it&#x27;s not possible to change the data because you want it standardized so you can evaluate improvements that new approaches may bring. It&#x27;s the only way to have a meaningful yardstick to measure against. Thus, the only way to prevent such gaming is restricting multiple submissions, so that you can&#x27;t do &#x27;hyperparameter optimization&#x27; - i.e. overlearn on the testset.<p>That&#x27;s why it&#x27;s cheating - it&#x27;s not a measure of how well your algorithm did, but rather on how well you reverse-engineered the answer key. It&#x27;s a huge disservice to the field and the people who did this should be ashamed of themselves.
philhover 9 years ago
Idle speculation:<p>&gt; &quot;The key sentence here is, &#x27;Please note that you cannot make more than 2 submissions per week.&#x27; It is our understanding that this is to say that one individual can at most upload twice per week. Otherwise, if the limit was set for a team, the proper text should be &#x27;your team&#x27; instead,&quot; Wu wrote.<p>I wonder whether, to a native Chinese speaker, this really does sound like it&#x27;s talking about individual people, and saying &quot;you&quot; when one means &quot;your team&quot; seems really bizarre. Can any Chinese speakers weigh in?<p>(Even stipulating this, the affair still sounds more like malice than incompetence on the part of Wu.)
评论 #10730794 未加载
评论 #10730715 未加载
评论 #10732603 未加载
评论 #10730775 未加载
dasbothover 9 years ago
Congratulations to the author for both of his last 2 posts making it to the HN front page!<p>This explains the need to drill home the train-test idea from the last post. I hadn&#x27;t thought about this before but multiple submissions do amount to multiple peeks at your held-out test set, which is a huge ML no-no.<p>I don&#x27;t know much about LSVRC, but doesn&#x27;t the way Kaggle work prevent this? AFAIR you get a &quot;public&quot; test-score which is used for the leaderboards, but once the deadline for submissions is up, each submission is evaluated on a held-out test set giving you a &quot;private&quot; score. Now that I think about it, I&#x27;m not sure how that works, I guess the accuracy they show you as your public score is only on part of the submitted rows? Regardless of how that&#x27;s done, could the LSVRC organisers not do something similar?
hbogertover 9 years ago
Couldn&#x27;t the LSVRC just limit the amount of submissions? Why would you rely on the competence or good intentions for this?
评论 #10730358 未加载
评论 #10730353 未加载
HappyTypistover 9 years ago
&gt; there are almost no papers focusing on 3 or 4-layer CNN’s these days, for example<p>What&#x27;s the &#x27;best practices&#x27; for the number of hidden layers in a CNN? 1 or 2 hidden layers?
评论 #10730338 未加载
Flocksterover 9 years ago
So how would a better dataset look like? Does bigger always equal harder? What are the criteria to measure that?<p>And wouldn&#x27;t video datasets be somewhat easier to analyse given the fact that you have multiple frames of the same object?
评论 #10730329 未加载
评论 #10730293 未加载
jmountover 9 years ago
Nice article on why it is cheating (from a mathematical sense, independent of language) to get scores from the hold-out leader board too many times (plus some methods to mitigate the effect): <a href="http:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1502.04585" rel="nofollow">http:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1502.04585</a> .
mikeskimover 9 years ago
As long as you make the public leaderboard set small, and the private one shot leaderboard set very large, the number of submissions matters very little in the final rankings. The only real issue is hand labeling the public leaderboard set to augment training data.
king_of_nounsover 9 years ago
Meh.. I&#x27;m not so sure about this.<p>Didn&#x27;t people claim &quot;cheating&quot; back when the first compilers started doing data flow analysis too?
评论 #10732898 未加载
评论 #10731167 未加载