I created a website a couple of years ago (thatstoday.com) where members can add RSS feeds and read them in our "news reader". The years went by and we've indexed all feed articles in our db (and in Lucene).<p>Now I'm not sure what to do with all the data since the db it's getting large and difficult to handle.<p>Should I delete it or create a new service? What do you guys suggest?<p>Thanks,
Rob
> the db is getting large and difficult to handle.<p>Did your store the raw content in the database? If so, you might consider writing files instead. These blobs, like pictures, are better stored as files. In your database, you should keep relational data: url, time, adder, etc, properly indexed (probably by adder, time and maybe keywords). Then a 6M rows table is quite a small thing for any RDBMS (if your SELECTs are filtering on indexed columns).
You've got the basis for a "you-like-this-so-perhaps-you-will-want-to-read-this" recommendation engine. Perform some - mentioned in another reply - n-gram analysis on the corpus; do some basic cosine-similarity analysis with the feeds people subscribe to and see what pops up. Try other techniques (from e.g. ICWSM[1]) (last time I did something like this is April 2007); iterate; analyse results; publish.<p>At least you'll have fun (YMMV)...<p>[1] <a href="http://www.icwsm.org/[2007-2011]" rel="nofollow">http://www.icwsm.org/[2007-2011]</a>
For God's sake don't delete it!
First of all dump it to files instead of DB. (Or use some NoSQL Document Storage, such as MongoDB. The structure of RSS is actually non-relational, I suppose)
Second of all: is your data clean? If not then you might need to clean it from any boilerplate (such as HTML code)
Then you can process it with some tools. There are some good NLP tools available such as Gate. You may have a look at them.
You can do great deal of things there:
- detect some entities (companies? products?) and do some classification of documents
- you can detect some events (iPhone announcements, etc)
- if you have time & date (hope you have) you can do some trending topics analysis (what was hot in June 2010)
- probably you can't sell the data as the content of articles is not yours, but you may sell some derived data (analysis, etc)
Write an algorithm to match articles from different sources that are about the same story. That way you can autohide news that you've supposedly already read from another source. Clustering news from different sources would be a killer feature for me :)
Depending on how far back the data goes, you could try to spot language trends trough time or make a pretty graph of the average article length trough time. I'm not that inspiring.<p>I hope this'll get a follow up, I'm sure someone can think of something awesome to do with it.<p>If you decide to open source it (copyrights?), use Bittorrent: it's the perfect tool for this job.
I suggest you change your privacy policy (i am sure you have one) and tell the users that you will open the database for researchers, and make the open database a donation based service, this way you can pay to host it and might get some good extra money.
Donate it to <a href="http://Archive.org" rel="nofollow">http://Archive.org</a>. They can reconstruct disappeared web sites from it, preserving the web's past.