TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

I have 6.000.000 RSS articles in a DB. What can I do?

30 点作者 robertpohl将近 13 年前
I created a website a couple of years ago (thatstoday.com) where members can add RSS feeds and read them in our "news reader". The years went by and we've indexed all feed articles in our db (and in Lucene).<p>Now I'm not sure what to do with all the data since the db it's getting large and difficult to handle.<p>Should I delete it or create a new service? What do you guys suggest?<p>Thanks, Rob

19 条评论

gbog将近 13 年前
&#62; the db is getting large and difficult to handle.<p>Did your store the raw content in the database? If so, you might consider writing files instead. These blobs, like pictures, are better stored as files. In your database, you should keep relational data: url, time, adder, etc, properly indexed (probably by adder, time and maybe keywords). Then a 6M rows table is quite a small thing for any RDBMS (if your SELECTs are filtering on indexed columns).
评论 #4141306 未加载
bulte-rs将近 13 年前
You've got the basis for a "you-like-this-so-perhaps-you-will-want-to-read-this" recommendation engine. Perform some - mentioned in another reply - n-gram analysis on the corpus; do some basic cosine-similarity analysis with the feeds people subscribe to and see what pops up. Try other techniques (from e.g. ICWSM[1]) (last time I did something like this is April 2007); iterate; analyse results; publish.<p>At least you'll have fun (YMMV)...<p>[1] <a href="http://www.icwsm.org/[2007-2011]" rel="nofollow">http://www.icwsm.org/[2007-2011]</a>
评论 #4141303 未加载
xSwag将近 13 年前
You should dump a copy of the database somewhere so we can all take a look at it and perhaps analyse it.
评论 #4141323 未加载
评论 #4141192 未加载
评论 #4141204 未加载
评论 #4141320 未加载
drewcrawford将近 13 年前
There's no e-mail address in your profile. Please contact me at the e-mail address in mine.
评论 #4142021 未加载
huragok将近 13 年前
Sell it or lease it. I'm sure there's some value in the aggregation of feeds (though I can't imagine what besides user habits).
haddr将近 13 年前
For God's sake don't delete it! First of all dump it to files instead of DB. (Or use some NoSQL Document Storage, such as MongoDB. The structure of RSS is actually non-relational, I suppose) Second of all: is your data clean? If not then you might need to clean it from any boilerplate (such as HTML code) Then you can process it with some tools. There are some good NLP tools available such as Gate. You may have a look at them. You can do great deal of things there: - detect some entities (companies? products?) and do some classification of documents - you can detect some events (iPhone announcements, etc) - if you have time &#38; date (hope you have) you can do some trending topics analysis (what was hot in June 2010) - probably you can't sell the data as the content of articles is not yours, but you may sell some derived data (analysis, etc)
raverbashing将近 13 年前
Natural Language processing n-gram studies Or just some kind of webservice where people can look this up
评论 #4141316 未加载
spobo将近 13 年前
Write an algorithm to match articles from different sources that are about the same story. That way you can autohide news that you've supposedly already read from another source. Clustering news from different sources would be a killer feature for me :)
waxjar将近 13 年前
Depending on how far back the data goes, you could try to spot language trends trough time or make a pretty graph of the average article length trough time. I'm not that inspiring.<p>I hope this'll get a follow up, I'm sure someone can think of something awesome to do with it.<p>If you decide to open source it (copyrights?), use Bittorrent: it's the perfect tool for this job.
kngspook将近 13 年前
How big is the DB dump?<p>I know I would be interested in downloading it, and just poking it for interesting stats...
评论 #4141327 未加载
ramigb将近 13 年前
I suggest you change your privacy policy (i am sure you have one) and tell the users that you will open the database for researchers, and make the open database a donation based service, this way you can pay to host it and might get some good extra money.
评论 #4141325 未加载
bromagosa将近 13 年前
Are they all open? If so, I'd contact wikimedia and see what use can they give them.
rocky1138将近 13 年前
Do you have the rights to distribute the articles in your database?
evanwolf将近 13 年前
Donate it to <a href="http://Archive.org" rel="nofollow">http://Archive.org</a>. They can reconstruct disappeared web sites from it, preserving the web's past.
tzaman将近 13 年前
If you think it could be useful to someone, try to sell it
评论 #4141329 未加载
ahmedaly将近 13 年前
Maybe I can offer you hosting for free if this is your problem.. pls email me on ahmed(at)svwebdev.com
gauravvijay将近 13 年前
I can sponsor the S3 storage but with limited IO
评论 #4141521 未加载
aw4y将近 13 年前
make it open!
lucamartinetti将近 13 年前
a bunch of interesting things. It is a nice NLP corpus. Put a dump on S3 and make it public
评论 #4141406 未加载