The rules seem pretty clear that consent is required from any persons appearing in any external datasets that are required. The winners scraped data from Youtube videos so I am not not sure the issue is.<p>The more worrying takeaway is that the winners scraped videos from people who clearly had no intention of their videos being used for a deepfake detection algorithm. Yet they did not think of the ethical considerations of using that data (did everyone in the video even have a say in the video being uploaded?). I think Kaggle disqualifying the team is the right move (even if it's a painful one for the winners).
IMO the real issue is that Facebook wanted a commercially-usable product, and they thought Kaggle had all the safeguards for that, but no:<p>Because of RGPD and friends, Facebook can't store those photos, even though their license are permissive, and respects Kaggle's rules.<p>This clearly shows what Kaggle is: It is a way to get very cheap and high quality data-science work. It's not for hiring people, not for truly helping the research community, not for to help people learn. Nope, just cheap workers.<p>It really feels like Facebook have there whole deepfake detection strategy here! They put like 2M$ on the table to solve an issue that will(?) plague their whole multi-billion platform.
I think the issue here is that Kaggle's statement that the top teams broke the rules is just very opaque. They stated they broke the rules on external data. The article then goes on to talk about what data the teams used and what licenses it has, and what data the teams were asked to provide. But it really is almost impossible to know what the concerns of FB/Kaggle were without them specifically stating them. Clearly whatever the issue was it didn't effect every team - so it may be there were details of the licenses that the disqualified teams used that weren't good enough. As I say though, it's very difficult to say and it's kind of hard to think of a reason Facebook would arbitrarily disqualify teams for no good reason. It's perfectly possible FB were concerned about image rights or something else, but people seem to be perfectly happy just assuming some grand conspiracy.
For those who aren't aware, many Kaggle competitions allow external data (this one did) but require disclosure, and often there is some back-and-forth to clarify the exact details of what is used.<p>In this case the disqualified participants are well respected and haven't previously been involved in any dubious behavior. They properly disclosed what they were doing and despite there being other clarifications there was none that person releases for CC-BY data would be required.<p>Obviously this is a ridiculous requirement. There's no way for that team to be able to do that, but they did take proper care to use data that Facebook could reasonably use. It's unreasonable for FB/Kaggle to expect participants in a data science competition to suddenly know what Facebook's data ethics department is demanding this week outside what is legally required.
Why would written consent be needed from people appearing on pictures with CC-BY licence? Was this just an overreaction or is there any legal risk for Facebook using those pictures without additional consent?
As a machine learning researcher, where exactly am I supposed to get a dataset that complies with Facebook's/Kaggle's rules in this case?<p>No one is disputing that the team was disqualified fair and square. But this rule – where you must get consent from every single person appearing in your training data – seems neither standard nor sensible.<p>Firstly, as someone else pointed out, copyright doesn't apply here at all. You can use whatever training data you want as long as your model is sufficiently transformative. OpenAI used terabytes of copyrighted music in their training for OpenAI Jukebox; they certainly didn't get a license from every musician.<p>Beyond that – big companies don't play by this rule! If a BigCo wants to train on some data, you bet they'll be using it. When's the last time Google sent you an email like "Are you ok with us using your flickr photos to help improve Google Image Search?"<p>So my question is simple: in the context of this competition, where should I go to get a decent dataset? The winners were disqualified doing exactly what I would have done. What's the alternative?<p>Also, yes, ethics are a concern. If you're concerned about ethics, <i>aim it at big companies</i>, not us small fries that are merely trying to win some cash. Again, no one disputes that they were disqualified for valid reasons. But it has nothing to do with ethics and everything to do with the artificial constraints imposed by this competition.
As someone who previously competed on Kaggle, this seems a reasonable decision. In previous contests it was pretty clear if you wanted to do something that used third party data you should get pre-clearance for it from Kaggle/contest organizers.<p>The disqualified competitors here seem to have assumed that CC-BY meant you can do whatever you want with data, when actually that's far from true. CC-BY is solely about copyright and doesn't address other rights (e.g. model release, gdpr, etc.)
> and each individual participant further waives all rights to have damages multiplied or increased.<p>What about divided? by a fraction? :trollface: Does that fall under "increased"?
This competition should not be about scraping and tagging skills (impressive as they may).<p>So maybe they'll get to win on lack of clarity in the specifications, but that will be unfortunate.
It's unfortunate that the title leads with the "backlash" from the thing that happened, not the thing itself ("Kaggle disqualifies participants over usage of external data usage"). This suggests that the decision about the case has already been made by a plurality and with fervor. In reality this article is the first time many HN readers learn about this for the first time.<p>While I'm sure it's just accidental in this case, I see this all over the news and suspect attempts to steer public opinion by condemning people or institions in the headline, before the news is actually reported on. A forum of independent thinkers should insist on not having the news presented to them in a potentially manipulative manner.
One possible reason for the “no external datasets” rule might be that the data which Facebook uses to judge the competition is also taken from the same publicly available sources. If this is so, then if somebody uses those same datasets, they would have trained to the test, so to speak, which obviously would not lead to good outcomes when run against future data.