About the data:<p>- DB Size: 543 million rows<p>- Data Size: 173GB (uncompressed)<p>- Stored in mysql<p>- 200+ Million tweets from 13+ Million users<p>- Collected in 1 week<p>- Operation costs: 100+ dollars<p>- Rackspace Cloud - 1 CentOS 8GB Ram server<p>- Java, memcache, mysql and perl for core processing<p>- js, php for analytics & visualization<p><i></i>* Download the data at this url
http://www.archive.org/details/2011-06-calufa-twitter-sql
Twitter changed their ToS to explicitly disallow distributing twitter dumps like this: <a href="http://chronicle.com/blogs/profhacker/the-end-of-twapperkeeper-and-what-to-do-about-it/31582" rel="nofollow">http://chronicle.com/blogs/profhacker/the-end-of-twapperkeep...</a><p>I was a part of the webecology project (and 140kit.com, both of which gave large twitter datasets to researchers.
Thanks! More interested in the scraper.. is it open-source? If yes, where can we download it? If not, can you write about your experience in building it?
Neat ! here some tips for creating a kick ass graph visualization: <a href="http://www.martinlaprise.info/2010/02/15/visualize-your-own-twitter-graph-part-2/" rel="nofollow">http://www.martinlaprise.info/2010/02/15/visualize-your-own-...</a>
All that is meaningless chatter between people and information about bathroom habits. Perhaps if we pooled that distributed effort into something constructive, the world would be a better place.