科技回声

19 条评论

gbog将近 13 年前

> the db is getting large and difficult to handle.Did your store the raw content in the database? If so, you might consider writing files instead. These blobs, like pictures, are better stored as files. In your database, you should keep relational data: url, time, adder, etc, properly indexed (probably by adder, time and maybe keywords). Then a 6M rows table is quite a small thing for any RDBMS (if your SELECTs are filtering on indexed columns).

评论 #4141306 未加载

bulte-rs将近 13 年前

You've got the basis for a "you-like-this-so-perhaps-you-will-want-to-read-this" recommendation engine. Perform some - mentioned in another reply - n-gram analysis on the corpus; do some basic cosine-similarity analysis with the feeds people subscribe to and see what pops up. Try other techniques (from e.g. ICWSM[1]) (last time I did something like this is April 2007); iterate; analyse results; publish.At least you'll have fun (YMMV)...[1] <a href="http://www.icwsm.org/[2007-2011]" rel="nofollow">http://www.icwsm.org/[2007-2011]</a>

评论 #4141303 未加载

xSwag将近 13 年前

You should dump a copy of the database somewhere so we can all take a look at it and perhaps analyse it.

评论 #4141323 未加载

评论 #4141192 未加载

评论 #4141204 未加载

评论 #4141320 未加载

drewcrawford将近 13 年前

There's no e-mail address in your profile. Please contact me at the e-mail address in mine.

评论 #4142021 未加载

huragok将近 13 年前

Sell it or lease it. I'm sure there's some value in the aggregation of feeds (though I can't imagine what besides user habits).

haddr将近 13 年前

For God's sake don't delete it! First of all dump it to files instead of DB. (Or use some NoSQL Document Storage, such as MongoDB. The structure of RSS is actually non-relational, I suppose) Second of all: is your data clean? If not then you might need to clean it from any boilerplate (such as HTML code) Then you can process it with some tools. There are some good NLP tools available such as Gate. You may have a look at them. You can do great deal of things there: - detect some entities (companies? products?) and do some classification of documents - you can detect some events (iPhone announcements, etc) - if you have time & date (hope you have) you can do some trending topics analysis (what was hot in June 2010) - probably you can't sell the data as the content of articles is not yours, but you may sell some derived data (analysis, etc)

raverbashing将近 13 年前

Natural Language processing n-gram studies Or just some kind of webservice where people can look this up

评论 #4141316 未加载

spobo将近 13 年前

Write an algorithm to match articles from different sources that are about the same story. That way you can autohide news that you've supposedly already read from another source. Clustering news from different sources would be a killer feature for me :)

waxjar将近 13 年前

Depending on how far back the data goes, you could try to spot language trends trough time or make a pretty graph of the average article length trough time. I'm not that inspiring.I hope this'll get a follow up, I'm sure someone can think of something awesome to do with it.If you decide to open source it (copyrights?), use Bittorrent: it's the perfect tool for this job.

kngspook将近 13 年前

How big is the DB dump?I know I would be interested in downloading it, and just poking it for interesting stats...

评论 #4141327 未加载

ramigb将近 13 年前

I suggest you change your privacy policy (i am sure you have one) and tell the users that you will open the database for researchers, and make the open database a donation based service, this way you can pay to host it and might get some good extra money.

评论 #4141325 未加载

bromagosa将近 13 年前

Are they all open? If so, I'd contact wikimedia and see what use can they give them.

rocky1138将近 13 年前

Do you have the rights to distribute the articles in your database?

evanwolf将近 13 年前

Donate it to <a href="http://Archive.org" rel="nofollow">http://Archive.org</a>. They can reconstruct disappeared web sites from it, preserving the web's past.

tzaman将近 13 年前

If you think it could be useful to someone, try to sell it

评论 #4141329 未加载

ahmedaly将近 13 年前

Maybe I can offer you hosting for free if this is your problem.. pls email me on ahmed(at)svwebdev.com

gauravvijay将近 13 年前

I can sponsor the S3 storage but with limited IO

评论 #4141521 未加载

aw4y将近 13 年前

make it open!

lucamartinetti将近 13 年前

a bunch of interesting things. It is a nice NLP corpus. Put a dump on S3 and make it public

评论 #4141406 未加载

19 条评论

gbog将近 13 年前

评论 #4141306 未加载

bulte-rs将近 13 年前

评论 #4141303 未加载

xSwag将近 13 年前

You should dump a copy of the database somewhere so we can all take a look at it and perhaps analyse it.

评论 #4141323 未加载

评论 #4141192 未加载

评论 #4141204 未加载

评论 #4141320 未加载

drewcrawford将近 13 年前

There's no e-mail address in your profile. Please contact me at the e-mail address in mine.

评论 #4142021 未加载

huragok将近 13 年前

Sell it or lease it. I'm sure there's some value in the aggregation of feeds (though I can't imagine what besides user habits).

haddr将近 13 年前

raverbashing将近 13 年前

Natural Language processing n-gram studies Or just some kind of webservice where people can look this up

评论 #4141316 未加载

spobo将近 13 年前

waxjar将近 13 年前

kngspook将近 13 年前

How big is the DB dump?I know I would be interested in downloading it, and just poking it for interesting stats...

评论 #4141327 未加载

ramigb将近 13 年前

评论 #4141325 未加载

bromagosa将近 13 年前

Are they all open? If so, I'd contact wikimedia and see what use can they give them.

rocky1138将近 13 年前

Do you have the rights to distribute the articles in your database?

evanwolf将近 13 年前

Donate it to <a href="http://Archive.org" rel="nofollow">http://Archive.org</a>. They can reconstruct disappeared web sites from it, preserving the web's past.

tzaman将近 13 年前

If you think it could be useful to someone, try to sell it

评论 #4141329 未加载

ahmedaly将近 13 年前

Maybe I can offer you hosting for free if this is your problem.. pls email me on ahmed(at)svwebdev.com

gauravvijay将近 13 年前

I can sponsor the S3 storage but with limited IO

评论 #4141521 未加载

aw4y将近 13 年前

make it open!

lucamartinetti将近 13 年前

a bunch of interesting things. It is a nice NLP corpus. Put a dump on S3 and make it public

评论 #4141406 未加载

I have 6.000.000 RSS articles in a DB. What can I do?

19 条评论

I have 6.000.000 RSS articles in a DB. What can I do?

19 条评论