TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Best framework for parsing thousands of feeds?

6 点作者 kez超过 14 年前
I am currently working on what I hope will be a startup (lean, bootstrapped etc) and I am dealing with thousands of feeds.<p>Presently I am batching 5-10 feeds to download in batches of threads from Ruby using FeedZirra (https://github.com/pauldix/feedzirra) and then parse.<p><i></i>Has anyone been in a similar situation and done something particularly innovative they care to share?<i></i> I plan on ranking feeds by frequency of updates after some analysis, but in the mean time I am resigned to pulling everything down in as quick a time as possible.<p>I would love to use Superfeedr for this, but cost is prohibitive for me and I do not want to stump up the cash to pay for the credits whilst in development (although I could move to this in the future).<p>Not so bothered about the technology/language - this is a hodgepodge of Ruby, Ramaze, MySQL, Solr and good old file system storage.<p>Advanced thanks and appreciation of any and all comments!

3 条评论

swanson超过 14 年前
You might want to take a look at Samuel Clay's NewsBlur project: <a href="https://github.com/samuelclay/NewsBlur" rel="nofollow">https://github.com/samuelclay/NewsBlur</a> and see how he handles this problem.
评论 #2182624 未加载
dclaysmith超过 14 年前
Check out <a href="http://www.feedparser.org/" rel="nofollow">http://www.feedparser.org/</a>. It's for python and pretty robust, handles Etags and Last-Modified headers. Well documented and loads of unit tests.
bmelton超过 14 年前
I've done my fair share of it, and while I don't know that we ultimately tackled it 100%, there were plenty of gotchas.<p>1) Respect etags / last_updated tags. This will save you a ton of bandwidth, for one, and keep you from getting banned by the feeds you're pulling. It's important. What I ended up doing was a different method for new feeds vs. ones I already knew about -- on the initial parse (and subsequent ones too), I would check for an etag or last_modified indicator. If I can't detect anything, I set a poll frequency to something like a half an hour. This kept me from slamming servers that didn't properly implement etags, while I could check headers on the ones that did more frequently.<p>2) Hang on to your sockets. Opening / closing sockets is expensive for this particular task. What ended up working for us was queueing entries and using the same urllib handle for as many as needed polling at a time. Otherwise, we were flooding the box with open sockets.<p>3) Use a task queue. My environment was Python, so I had the beautiful Rabbit and Celery to work with. Never ended up having to scale, but the intention was that using a distributed task queue, it was built such that we could just add other nodes to do the fetching tasks if we needed to.