There are a few services popping up with aim to provide data repositories for analysis/ML (Kaggle, data.world, /r/datasets)<p>As someone who likes making analyses from random datasets, I have a few issues with these types of services:<p>1) There is often no indication of the distribution rights of the data, or whether the data was obtained ethically from the source (i.e. following the ToS). I made this mistake when I used an OKCupid dataset released on an Open Data Repository; turns out it was scraped with a logged-in account and the dataset was taken down by DMCA<p>2) There is no indication of the <i>quality</i> of the data, and as a result, it may take an absurd amount of time cleaning the data for accuracy. Some datasets may not be salvageable.<p>3) Bandwidth. Good datasets have lots of data for better models, which these sites may not be able to support. (BigQuery public datasets solve this problem however)