科技回声

9 条评论

billturner超过 16 年前

Look at Sam Ruby's Venus (which I've used): <a href="http://intertwingly.net/code/venus/" rel="nofollow">http://intertwingly.net/code/venus/</a> (Python)Or, his Mars version (haven't used): <a href="http://intertwingly.net/code/mars/" rel="nofollow">http://intertwingly.net/code/mars/</a> (written in Ruby, and newer)

petercooper超过 16 年前

Disclaimer: I built, ran (for two years), and sold a Web app that processed tens of thousands of feeds each hour and distributed summaries based on those feeds hundreds of millions of times per month.That out of the way, the difficulty varies with the scale somewhat. With a large scale, you run into all sorts of issues including arbitrary blocks from feed providers, dealing with database locking, etc. If you're really just doing "100s" on an "hourly" basis, hopefully you'll stay well under that level, but if you think it'll need to scale up quickly, the decisions you make now will need to be different than if it's going to stay small.I can't provide any code here, but just some quick pointers.Our crawler (which is still running under the new owner) was entirely custom and written in Ruby. It performed very well. Instead of using a specific feed parsing library, it uses Hpricot (the Ruby library) and a set of custom built rules for parsing RSS and Atom. The reason for this is that we wanted speed, reliability (no shifting libraries), and it HAD to work with invalid (and even extremely broken) feeds - many "proper" RSS and Atom parsers have issues with busted feeds. Put it this way, though, Ruby is definitely up to the task, as long as you rely on a parsing library (Hpricot, in this case) and aren't just using regular expressions or something ;-)One nasty thing you'll need to deal with is knowing whether items in feeds are new or not. You could delete all items associated with a feed before processing that feed each time.. but what if you want to keep an archive of older items? What if you need to maintain database performance? How are you going to track what's new, what was deleted, etc?I used a hash that was either based on each item's GUID and the feed's ID OR (if no GUID present) the item's link and title. Unfortunately this was not failsafe. If someone changed the description of an item, the change wouldn't get picked up! And.. not all feeds use GUIDs - and some feeds have GUIDs that change when descriptions change.. some don't :)Feed formats are really, really dirty, despite being specified officially. All sorts of nasty publishing systems are mangling the formats and you need to be able to deal with it. These are issues that go far beyond choosing a feed parsing library - it's about the organization of items. You need to do a lot of sanitizing to be 100% effective. You'll find feeds that use wholly inappropriate date / time formats, and the content provider will not care. You need to be able to deal with that. Oh, and watch out for feeds that have wacky dates way into the future.. which can then end up "stuck" at the top of your list of items if you're ordering by date ;-)This all just scrapes the surface of how tricky it is. I was doing it fulltime for over two years and even now I feel I've only seen half the picture. You either strive for 100% effectiveness of processing and parsing feeds and drive yourself nuts - or settle for 90% and sleep at night ;-)

ridertech超过 16 年前

Sorry, I'm looking to build an app that aggregates feeds, not a normal "consumer app"

nreece超过 16 年前

SimplePie - <a href="http://simplepie.org" rel="nofollow">http://simplepie.org</a>

评论 #349535 未加载

评论 #349462 未加载

ridertech超过 16 年前

Thanks Peter! I was looking into FeedTools... <a href="http://sporkmonger.com/2008/2/1/feedtools-0-2-27" rel="nofollow">http://sporkmonger.com/2008/2/1/feedtools-0-2-27</a>But I'll probably just build something custom w/ Ruby. I was hoping someone else had already done the work and open sourced it ;)

评论 #349395 未加载

eventhough超过 16 年前

Zend Framework has an RSS reader. <a href="http://framework.zend.com/manual/en/zend.feed.html" rel="nofollow">http://framework.zend.com/manual/en/zend.feed.html</a>

qhoxie超过 16 年前

I use Google Reader with great success, but there are tons of options out there; desktop and web.

lpgauth超过 16 年前

Yahoo Pipes!

TweedHeads超过 16 年前

Magpie<a href="http://magpierss.sourceforge.net" rel="nofollow">http://magpierss.sourceforge.net</a>

评论 #349583 未加载

9 条评论

billturner超过 16 年前

petercooper超过 16 年前

ridertech超过 16 年前

Sorry, I'm looking to build an app that aggregates feeds, not a normal "consumer app"

nreece超过 16 年前

SimplePie - <a href="http://simplepie.org" rel="nofollow">http://simplepie.org</a>

评论 #349535 未加载

评论 #349462 未加载

ridertech超过 16 年前

评论 #349395 未加载

eventhough超过 16 年前

Zend Framework has an RSS reader. <a href="http://framework.zend.com/manual/en/zend.feed.html" rel="nofollow">http://framework.zend.com/manual/en/zend.feed.html</a>

qhoxie超过 16 年前

I use Google Reader with great success, but there are tons of options out there; desktop and web.

lpgauth超过 16 年前

Yahoo Pipes!

TweedHeads超过 16 年前

Magpie<a href="http://magpierss.sourceforge.net" rel="nofollow">http://magpierss.sourceforge.net</a>

评论 #349583 未加载

Ask YC: How to Build an RSS Aggregator?

9 条评论

Ask YC: How to Build an RSS Aggregator?

9 条评论