I am currently working on what I hope will be a startup (lean, bootstrapped etc) and I am dealing with thousands of feeds.<p>Presently I am batching 5-10 feeds to download in batches of threads from Ruby using FeedZirra (https://github.com/pauldix/feedzirra) and then parse.<p><i></i>Has anyone been in a similar situation and done something particularly innovative they care to share?<i></i> I plan on ranking feeds by frequency of updates after some analysis, but in the mean time I am resigned to pulling everything down in as quick a time as possible.<p>I would love to use Superfeedr for this, but cost is prohibitive for me and I do not want to stump up the cash to pay for the credits whilst in development (although I could move to this in the future).<p>Not so bothered about the technology/language - this is a hodgepodge of Ruby, Ramaze, MySQL, Solr and good old file system storage.<p>Advanced thanks and appreciation of any and all comments!
You might want to take a look at Samuel Clay's NewsBlur project: <a href="https://github.com/samuelclay/NewsBlur" rel="nofollow">https://github.com/samuelclay/NewsBlur</a> and see how he handles this problem.
Check out <a href="http://www.feedparser.org/" rel="nofollow">http://www.feedparser.org/</a>. It's for python and pretty robust, handles Etags and Last-Modified headers. Well documented and loads of unit tests.
I've done my fair share of it, and while I don't know that we ultimately tackled it 100%, there were plenty of gotchas.<p>1) Respect etags / last_updated tags. This will save you a ton of bandwidth, for one, and keep you from getting banned by the feeds you're pulling. It's important. What I ended up doing was a different method for new feeds vs. ones I already knew about -- on the initial parse (and subsequent ones too), I would check for an etag or last_modified indicator. If I can't detect anything, I set a poll frequency to something like a half an hour. This kept me from slamming servers that didn't properly implement etags, while I could check headers on the ones that did more frequently.<p>2) Hang on to your sockets. Opening / closing sockets is expensive for this particular task. What ended up working for us was queueing entries and using the same urllib handle for as many as needed polling at a time. Otherwise, we were flooding the box with open sockets.<p>3) Use a task queue. My environment was Python, so I had the beautiful Rabbit and Celery to work with. Never ended up having to scale, but the intention was that using a distributed task queue, it was built such that we could just add other nodes to do the fetching tasks if we needed to.