Start mining - FREE 100MM tweet db

12 点作者 calufa大约 14 年前

Sup,I have been collecting tweets for 4 days now, using a app that I haven coding for the last 5 months.The reason why I did this app was because I wanted to make user-based recommendations, and other types of data mining using Mahout, and I didnt find enough data for my experiments.About the app I am using a single 8GB-ram Centos Server hosted on the Rackspace cloud with a cost of less than 15 dollars per day. It can process up to 100 (90 - 105) twitter profiles per second. It works with a average of 2GB of ram and 90% CPU. Its completely fault tolerant. It can process other social networks as well using a simple parse-template.I was able to collect 90+ million tweets from more than 6 million -- the db has 20MM users -- users using JAVA, memcache, mysql, php (visualization), a non ACID architecture, using a object-like structure (no-sql?).I hope this datasets helps you get into the big data world.The current sql dump is too big (66GB) to put in one of my servers so please skypeme:calufaxp or email me calufa{a}gmail.com if you want the data. BTW, the data is FREE...If anyone has a server where I can upload this sql and let others download it let me know.

9 条评论

calufa将近 14 年前

Download the db here. Please dont abuse <a href="http://scramblermedia.com/twitter.sql.gz" rel="nofollow">http://scramblermedia.com/twitter.sql.gz</a>

评论 #2586318 未加载

评论 #2587591 未加载

calufa将近 14 年前

BTW, the sql contains: - bio data(7MM) - tweets(90MM) - followers(10MM) - following(10MM) - location(7MM) - profileName(10MM) - relationships (100MM) - websites (4.5MM) - users(20MM)-- 350+MM rows total --

cstrouse将近 14 年前

If you upload it to my server I will help you seed it from two locations. Email me for details.

calufa将近 14 年前

Thanks to Jason for putting this up on the archive.org site: <a href="http://www.archive.org/details/2011-05-calufa-twitter-sql" rel="nofollow">http://www.archive.org/details/2011-05-calufa-twitter-sql</a>

jparicka将近 14 年前

<a href="http://codebiatch.com/" rel="nofollow">http://codebiatch.com/</a> .. the file is still uploading if it's not in there yet. Good luck with your project!

评论 #2588487 未加载

fhsdfh将近 14 年前

Can someone help a novice and explain what types of things can be achieved with such a dump?

uptown将近 14 年前

Thanks for the data. Guess it's time to see whether my ISP has a data cap or not.

JoachimSchipper将近 14 年前

So, you are scraping Twitter (likely violating their ToS) to get users' Tweets (likely violating their copyright) and now posting about it on HN? When Twitter is selling chunks of its stream, e.g. via InfoChimps?I don't want to be mean, but this doesn't strike me as a very good idea.

评论 #2634422 未加载

mikelbring大约 14 年前

Throw it on a torrent?

评论 #2585673 未加载

Start mining - FREE 100MM tweet db

12 点作者 calufa大约 14 年前

9 条评论

calufa将近 14 年前

Download the db here. Please dont abuse <a href="http://scramblermedia.com/twitter.sql.gz" rel="nofollow">http://scramblermedia.com/twitter.sql.gz</a>

评论 #2586318 未加载

评论 #2587591 未加载

calufa将近 14 年前

cstrouse将近 14 年前

If you upload it to my server I will help you seed it from two locations. Email me for details.

calufa将近 14 年前

jparicka将近 14 年前

<a href="http://codebiatch.com/" rel="nofollow">http://codebiatch.com/</a> .. the file is still uploading if it's not in there yet. Good luck with your project!

评论 #2588487 未加载

fhsdfh将近 14 年前

Can someone help a novice and explain what types of things can be achieved with such a dump?

uptown将近 14 年前

Thanks for the data. Guess it's time to see whether my ISP has a data cap or not.