Disclaimer: I built, ran (for two years), and sold a Web app that processed tens of thousands of feeds each hour and distributed summaries based on those feeds hundreds of millions of times per month.<p>That out of the way, the difficulty varies with the scale somewhat. With a large scale, you run into all sorts of issues including arbitrary blocks from feed providers, dealing with database locking, etc. If you're really just doing "100s" on an "hourly" basis, hopefully you'll stay well under that level, but if you think it'll need to scale up quickly, the decisions you make now will need to be different than if it's going to stay small.<p>I can't provide any code here, but just some quick pointers.<p>Our crawler (which is still running under the new owner) was entirely custom and written in Ruby. It performed very well. Instead of using a specific feed parsing library, it uses Hpricot (the Ruby library) and a set of custom built rules for parsing RSS and Atom. The reason for this is that we wanted speed, reliability (no shifting libraries), and it HAD to work with invalid (and even extremely broken) feeds - many "proper" RSS and Atom parsers have issues with busted feeds. Put it this way, though, Ruby is definitely up to the task, as long as you rely on a parsing library (Hpricot, in this case) and aren't just using regular expressions or something ;-)<p>One nasty thing you'll need to deal with is knowing whether items in feeds are new or not. You <i>could</i> delete all items associated with a feed before processing that feed each time.. but what if you want to keep an archive of older items? What if you need to maintain database performance? How are you going to track what's new, what was deleted, etc?<p>I used a hash that was <i>either</i> based on each item's GUID and the feed's ID OR (if no GUID present) the item's link and title. Unfortunately this was not failsafe. If someone changed the description of an item, the change wouldn't get picked up! And.. not all feeds use GUIDs - and some feeds have GUIDs that change when descriptions change.. some don't :)<p>Feed formats are really, really dirty, despite being specified officially. All sorts of nasty publishing systems are mangling the formats and you need to be able to deal with it. These are issues that go far beyond choosing a feed parsing library - it's about the organization of items. You need to do a lot of sanitizing to be 100% effective. You'll find feeds that use wholly inappropriate date / time formats, and the content provider will not care. You need to be able to deal with that. Oh, and watch out for feeds that have wacky dates way into the future.. which can then end up "stuck" at the top of your list of items if you're ordering by date ;-)<p>This all just scrapes the surface of how tricky it is. I was doing it fulltime for over two years and even now I feel I've only seen half the picture. You either strive for 100% effectiveness of processing and parsing feeds and drive yourself nuts - or settle for 90% and sleep at night ;-)