TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

I have 6.000.000 RSS articles in a DB. What can I do?

30 pointsby robertpohlalmost 13 years ago
I created a website a couple of years ago (thatstoday.com) where members can add RSS feeds and read them in our "news reader". The years went by and we've indexed all feed articles in our db (and in Lucene).<p>Now I'm not sure what to do with all the data since the db it's getting large and difficult to handle.<p>Should I delete it or create a new service? What do you guys suggest?<p>Thanks, Rob

19 comments

gbogalmost 13 years ago
&#62; the db is getting large and difficult to handle.<p>Did your store the raw content in the database? If so, you might consider writing files instead. These blobs, like pictures, are better stored as files. In your database, you should keep relational data: url, time, adder, etc, properly indexed (probably by adder, time and maybe keywords). Then a 6M rows table is quite a small thing for any RDBMS (if your SELECTs are filtering on indexed columns).
评论 #4141306 未加载
bulte-rsalmost 13 years ago
You've got the basis for a "you-like-this-so-perhaps-you-will-want-to-read-this" recommendation engine. Perform some - mentioned in another reply - n-gram analysis on the corpus; do some basic cosine-similarity analysis with the feeds people subscribe to and see what pops up. Try other techniques (from e.g. ICWSM[1]) (last time I did something like this is April 2007); iterate; analyse results; publish.<p>At least you'll have fun (YMMV)...<p>[1] <a href="http://www.icwsm.org/[2007-2011]" rel="nofollow">http://www.icwsm.org/[2007-2011]</a>
评论 #4141303 未加载
xSwagalmost 13 years ago
You should dump a copy of the database somewhere so we can all take a look at it and perhaps analyse it.
评论 #4141323 未加载
评论 #4141192 未加载
评论 #4141204 未加载
评论 #4141320 未加载
drewcrawfordalmost 13 years ago
There's no e-mail address in your profile. Please contact me at the e-mail address in mine.
评论 #4142021 未加载
huragokalmost 13 years ago
Sell it or lease it. I'm sure there's some value in the aggregation of feeds (though I can't imagine what besides user habits).
haddralmost 13 years ago
For God's sake don't delete it! First of all dump it to files instead of DB. (Or use some NoSQL Document Storage, such as MongoDB. The structure of RSS is actually non-relational, I suppose) Second of all: is your data clean? If not then you might need to clean it from any boilerplate (such as HTML code) Then you can process it with some tools. There are some good NLP tools available such as Gate. You may have a look at them. You can do great deal of things there: - detect some entities (companies? products?) and do some classification of documents - you can detect some events (iPhone announcements, etc) - if you have time &#38; date (hope you have) you can do some trending topics analysis (what was hot in June 2010) - probably you can't sell the data as the content of articles is not yours, but you may sell some derived data (analysis, etc)
raverbashingalmost 13 years ago
Natural Language processing n-gram studies Or just some kind of webservice where people can look this up
评论 #4141316 未加载
spoboalmost 13 years ago
Write an algorithm to match articles from different sources that are about the same story. That way you can autohide news that you've supposedly already read from another source. Clustering news from different sources would be a killer feature for me :)
waxjaralmost 13 years ago
Depending on how far back the data goes, you could try to spot language trends trough time or make a pretty graph of the average article length trough time. I'm not that inspiring.<p>I hope this'll get a follow up, I'm sure someone can think of something awesome to do with it.<p>If you decide to open source it (copyrights?), use Bittorrent: it's the perfect tool for this job.
kngspookalmost 13 years ago
How big is the DB dump?<p>I know I would be interested in downloading it, and just poking it for interesting stats...
评论 #4141327 未加载
ramigbalmost 13 years ago
I suggest you change your privacy policy (i am sure you have one) and tell the users that you will open the database for researchers, and make the open database a donation based service, this way you can pay to host it and might get some good extra money.
评论 #4141325 未加载
bromagosaalmost 13 years ago
Are they all open? If so, I'd contact wikimedia and see what use can they give them.
rocky1138almost 13 years ago
Do you have the rights to distribute the articles in your database?
evanwolfalmost 13 years ago
Donate it to <a href="http://Archive.org" rel="nofollow">http://Archive.org</a>. They can reconstruct disappeared web sites from it, preserving the web's past.
tzamanalmost 13 years ago
If you think it could be useful to someone, try to sell it
评论 #4141329 未加载
ahmedalyalmost 13 years ago
Maybe I can offer you hosting for free if this is your problem.. pls email me on ahmed(at)svwebdev.com
gauravvijayalmost 13 years ago
I can sponsor the S3 storage but with limited IO
评论 #4141521 未加载
aw4yalmost 13 years ago
make it open!
lucamartinettialmost 13 years ago
a bunch of interesting things. It is a nice NLP corpus. Put a dump on S3 and make it public
评论 #4141406 未加载