As a machine learning researcher, where exactly am I supposed to get a dataset that complies with Facebook's/Kaggle's rules in this case?<p>No one is disputing that the team was disqualified fair and square. But this rule – where you must get consent from every single person appearing in your training data – seems neither standard nor sensible.<p>Firstly, as someone else pointed out, copyright doesn't apply here at all. You can use whatever training data you want as long as your model is sufficiently transformative. OpenAI used terabytes of copyrighted music in their training for OpenAI Jukebox; they certainly didn't get a license from every musician.<p>Beyond that – big companies don't play by this rule! If a BigCo wants to train on some data, you bet they'll be using it. When's the last time Google sent you an email like "Are you ok with us using your flickr photos to help improve Google Image Search?"<p>So my question is simple: in the context of this competition, where should I go to get a decent dataset? The winners were disqualified doing exactly what I would have done. What's the alternative?<p>Also, yes, ethics are a concern. If you're concerned about ethics, <i>aim it at big companies</i>, not us small fries that are merely trying to win some cash. Again, no one disputes that they were disqualified for valid reasons. But it has nothing to do with ethics and everything to do with the artificial constraints imposed by this competition.