TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask YC: How to Build an RSS Aggregator?

11 点作者 ridertech超过 16 年前
What do you recommend for reading and parsing 100s of RSS/Atom feeds on an hourly basis?<p>I'm able to write a custom script in PHP or preferably Rails, but wondering if there is a sweet app or tutorial that others have used and liked.

9 条评论

billturner超过 16 年前
Look at Sam Ruby's Venus (which I've used): <a href="http://intertwingly.net/code/venus/" rel="nofollow">http://intertwingly.net/code/venus/</a> (Python)<p>Or, his Mars version (haven't used): <a href="http://intertwingly.net/code/mars/" rel="nofollow">http://intertwingly.net/code/mars/</a> (written in Ruby, and newer)
petercooper超过 16 年前
Disclaimer: I built, ran (for two years), and sold a Web app that processed tens of thousands of feeds each hour and distributed summaries based on those feeds hundreds of millions of times per month.<p>That out of the way, the difficulty varies with the scale somewhat. With a large scale, you run into all sorts of issues including arbitrary blocks from feed providers, dealing with database locking, etc. If you're really just doing "100s" on an "hourly" basis, hopefully you'll stay well under that level, but if you think it'll need to scale up quickly, the decisions you make now will need to be different than if it's going to stay small.<p>I can't provide any code here, but just some quick pointers.<p>Our crawler (which is still running under the new owner) was entirely custom and written in Ruby. It performed very well. Instead of using a specific feed parsing library, it uses Hpricot (the Ruby library) and a set of custom built rules for parsing RSS and Atom. The reason for this is that we wanted speed, reliability (no shifting libraries), and it HAD to work with invalid (and even extremely broken) feeds - many "proper" RSS and Atom parsers have issues with busted feeds. Put it this way, though, Ruby is definitely up to the task, as long as you rely on a parsing library (Hpricot, in this case) and aren't just using regular expressions or something ;-)<p>One nasty thing you'll need to deal with is knowing whether items in feeds are new or not. You <i>could</i> delete all items associated with a feed before processing that feed each time.. but what if you want to keep an archive of older items? What if you need to maintain database performance? How are you going to track what's new, what was deleted, etc?<p>I used a hash that was <i>either</i> based on each item's GUID and the feed's ID OR (if no GUID present) the item's link and title. Unfortunately this was not failsafe. If someone changed the description of an item, the change wouldn't get picked up! And.. not all feeds use GUIDs - and some feeds have GUIDs that change when descriptions change.. some don't :)<p>Feed formats are really, really dirty, despite being specified officially. All sorts of nasty publishing systems are mangling the formats and you need to be able to deal with it. These are issues that go far beyond choosing a feed parsing library - it's about the organization of items. You need to do a lot of sanitizing to be 100% effective. You'll find feeds that use wholly inappropriate date / time formats, and the content provider will not care. You need to be able to deal with that. Oh, and watch out for feeds that have wacky dates way into the future.. which can then end up "stuck" at the top of your list of items if you're ordering by date ;-)<p>This all just scrapes the surface of how tricky it is. I was doing it fulltime for over two years and even now I feel I've only seen half the picture. You either strive for 100% effectiveness of processing and parsing feeds and drive yourself nuts - or settle for 90% and sleep at night ;-)
ridertech超过 16 年前
Sorry, I'm looking to build an app that aggregates feeds, not a normal "consumer app"
nreece超过 16 年前
SimplePie - <a href="http://simplepie.org" rel="nofollow">http://simplepie.org</a>
评论 #349535 未加载
评论 #349462 未加载
ridertech超过 16 年前
Thanks Peter! I was looking into FeedTools... <a href="http://sporkmonger.com/2008/2/1/feedtools-0-2-27" rel="nofollow">http://sporkmonger.com/2008/2/1/feedtools-0-2-27</a><p>But I'll probably just build something custom w/ Ruby. I was hoping someone else had already done the work and open sourced it ;)
评论 #349395 未加载
eventhough超过 16 年前
Zend Framework has an RSS reader. <a href="http://framework.zend.com/manual/en/zend.feed.html" rel="nofollow">http://framework.zend.com/manual/en/zend.feed.html</a>
qhoxie超过 16 年前
I use Google Reader with great success, but there are tons of options out there; desktop and web.
lpgauth超过 16 年前
Yahoo Pipes!
TweedHeads超过 16 年前
Magpie<p><a href="http://magpierss.sourceforge.net" rel="nofollow">http://magpierss.sourceforge.net</a>
评论 #349583 未加载