I actually used GH archive to mine github data! Two notes:<p>- The easiest way to access the data is using Google Cloud Platform -> BigQuery -> githubarchive. Google lets you write SQL queries for 1TB of the data for free. So you can filter or aggregate the data you want, then download it.<p>- This is the sad part. Github data is notoriously noisy, and not really valuable for data mining. [1] My work was on predicting GitHub collaborator skill using open-source collaboration data. Filtering out bots and people who use GitHub like a version of Google Drive was very difficult.<p>[1]: <a href="https://kblincoe.github.io/publications/2014_MSR_Promises_Perils.pdf" rel="nofollow">https://kblincoe.github.io/publications/2014_MSR_Promises_Pe...</a>