TechEcho

19 comments

gbogalmost 13 years ago

> the db is getting large and difficult to handle.Did your store the raw content in the database? If so, you might consider writing files instead. These blobs, like pictures, are better stored as files. In your database, you should keep relational data: url, time, adder, etc, properly indexed (probably by adder, time and maybe keywords). Then a 6M rows table is quite a small thing for any RDBMS (if your SELECTs are filtering on indexed columns).

评论 #4141306 未加载

bulte-rsalmost 13 years ago

You've got the basis for a "you-like-this-so-perhaps-you-will-want-to-read-this" recommendation engine. Perform some - mentioned in another reply - n-gram analysis on the corpus; do some basic cosine-similarity analysis with the feeds people subscribe to and see what pops up. Try other techniques (from e.g. ICWSM[1]) (last time I did something like this is April 2007); iterate; analyse results; publish.At least you'll have fun (YMMV)...[1] <a href="http://www.icwsm.org/[2007-2011]" rel="nofollow">http://www.icwsm.org/[2007-2011]</a>

评论 #4141303 未加载

xSwagalmost 13 years ago

You should dump a copy of the database somewhere so we can all take a look at it and perhaps analyse it.

评论 #4141323 未加载

评论 #4141192 未加载

评论 #4141204 未加载

评论 #4141320 未加载

drewcrawfordalmost 13 years ago

There's no e-mail address in your profile. Please contact me at the e-mail address in mine.

评论 #4142021 未加载

huragokalmost 13 years ago

Sell it or lease it. I'm sure there's some value in the aggregation of feeds (though I can't imagine what besides user habits).

haddralmost 13 years ago

For God's sake don't delete it! First of all dump it to files instead of DB. (Or use some NoSQL Document Storage, such as MongoDB. The structure of RSS is actually non-relational, I suppose) Second of all: is your data clean? If not then you might need to clean it from any boilerplate (such as HTML code) Then you can process it with some tools. There are some good NLP tools available such as Gate. You may have a look at them. You can do great deal of things there: - detect some entities (companies? products?) and do some classification of documents - you can detect some events (iPhone announcements, etc) - if you have time & date (hope you have) you can do some trending topics analysis (what was hot in June 2010) - probably you can't sell the data as the content of articles is not yours, but you may sell some derived data (analysis, etc)

raverbashingalmost 13 years ago

Natural Language processing n-gram studies Or just some kind of webservice where people can look this up

评论 #4141316 未加载

spoboalmost 13 years ago

Write an algorithm to match articles from different sources that are about the same story. That way you can autohide news that you've supposedly already read from another source. Clustering news from different sources would be a killer feature for me :)

waxjaralmost 13 years ago

Depending on how far back the data goes, you could try to spot language trends trough time or make a pretty graph of the average article length trough time. I'm not that inspiring.I hope this'll get a follow up, I'm sure someone can think of something awesome to do with it.If you decide to open source it (copyrights?), use Bittorrent: it's the perfect tool for this job.

kngspookalmost 13 years ago

How big is the DB dump?I know I would be interested in downloading it, and just poking it for interesting stats...

评论 #4141327 未加载

ramigbalmost 13 years ago

I suggest you change your privacy policy (i am sure you have one) and tell the users that you will open the database for researchers, and make the open database a donation based service, this way you can pay to host it and might get some good extra money.

评论 #4141325 未加载

bromagosaalmost 13 years ago

Are they all open? If so, I'd contact wikimedia and see what use can they give them.

rocky1138almost 13 years ago

Do you have the rights to distribute the articles in your database?

evanwolfalmost 13 years ago

Donate it to <a href="http://Archive.org" rel="nofollow">http://Archive.org</a>. They can reconstruct disappeared web sites from it, preserving the web's past.

tzamanalmost 13 years ago

If you think it could be useful to someone, try to sell it

评论 #4141329 未加载

ahmedalyalmost 13 years ago

Maybe I can offer you hosting for free if this is your problem.. pls email me on ahmed(at)svwebdev.com

gauravvijayalmost 13 years ago

I can sponsor the S3 storage but with limited IO

评论 #4141521 未加载

aw4yalmost 13 years ago

make it open!

lucamartinettialmost 13 years ago

a bunch of interesting things. It is a nice NLP corpus. Put a dump on S3 and make it public

评论 #4141406 未加载

19 comments

gbogalmost 13 years ago

评论 #4141306 未加载

bulte-rsalmost 13 years ago

评论 #4141303 未加载

xSwagalmost 13 years ago

You should dump a copy of the database somewhere so we can all take a look at it and perhaps analyse it.

评论 #4141323 未加载

评论 #4141192 未加载

评论 #4141204 未加载

评论 #4141320 未加载

drewcrawfordalmost 13 years ago

There's no e-mail address in your profile. Please contact me at the e-mail address in mine.

评论 #4142021 未加载

huragokalmost 13 years ago

Sell it or lease it. I'm sure there's some value in the aggregation of feeds (though I can't imagine what besides user habits).

haddralmost 13 years ago

raverbashingalmost 13 years ago

Natural Language processing n-gram studies Or just some kind of webservice where people can look this up

评论 #4141316 未加载

spoboalmost 13 years ago

waxjaralmost 13 years ago

kngspookalmost 13 years ago

How big is the DB dump?I know I would be interested in downloading it, and just poking it for interesting stats...

评论 #4141327 未加载

ramigbalmost 13 years ago

评论 #4141325 未加载

bromagosaalmost 13 years ago

Are they all open? If so, I'd contact wikimedia and see what use can they give them.

rocky1138almost 13 years ago

Do you have the rights to distribute the articles in your database?

evanwolfalmost 13 years ago

Donate it to <a href="http://Archive.org" rel="nofollow">http://Archive.org</a>. They can reconstruct disappeared web sites from it, preserving the web's past.

tzamanalmost 13 years ago

If you think it could be useful to someone, try to sell it

评论 #4141329 未加载

ahmedalyalmost 13 years ago

Maybe I can offer you hosting for free if this is your problem.. pls email me on ahmed(at)svwebdev.com

gauravvijayalmost 13 years ago

I can sponsor the S3 storage but with limited IO

评论 #4141521 未加载

aw4yalmost 13 years ago

make it open!

lucamartinettialmost 13 years ago

a bunch of interesting things. It is a nice NLP corpus. Put a dump on S3 and make it public

评论 #4141406 未加载

I have 6.000.000 RSS articles in a DB. What can I do?

19 comments

I have 6.000.000 RSS articles in a DB. What can I do?

19 comments