TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Using jQuery and node.js to scrape html pages in 5 lines

133 点作者 Ainab超过 14 年前

9 条评论

drats超过 14 年前
Strange popup: "Hello, i see you are coming from hacker news.<p>the article you clicked on was most certainly not submitted by nodejitsu.<p>news.ycombinator has a long history of squashing articles and submitters that aren't funded by y-comb.<p>most of this is done through their "silent" banning and censoring mechanisms, that leave people not even realizing they have been silenced.<p>i hope you enjoy this article, and remember that HN is extremely biased and that you should keep your horizons broad."<p>While I would agree that HN is bias towards YC-funded projects I would not agree that it is biased against non-YC projects or news. In fact, the majority of the items on HN are non-YC. This also follows for submitters and commenters for the year or more I've been here.<p>On a different note. Hpricot is not representative of Ruby scraping anymore - nokogiri (<a href="http://nokogiri.org/" rel="nofollow">http://nokogiri.org/</a>) is where it's at. Which has a Hpricot translation layer if you need to change. Even when I decided to solidify on Python for everything else I will still go back to Ruby just for nokogiri when it comes to scraping.
评论 #1666304 未加载
评论 #1666759 未加载
评论 #1666204 未加载
评论 #1666169 未加载
评论 #1666275 未加载
评论 #1666285 未加载
robinduckett超过 14 年前
Hey guys. The Nodejitsu team and Marak (<a href="http://www.github.com/Marak" rel="nofollow">http://www.github.com/Marak</a>), the guy behind Nodejitsu are perma-banned from HN and can't respond to your queries.<p>He sends his regards, and if you'd like to contact him visit the #Node.js IRC channel @ Freenode
il超过 14 年前
I have a question: Does scraping like this execute Javascript on the scraped page? Am I able to access the output of Javascript/AJAX on that page?<p>As far as I know this is impossible with any other server-side scraping technology.<p>If so, that would be amazingly useful for a couple of my side projects, much easier than parsing their Javascript code and extracting the info I need.
评论 #1666225 未加载
评论 #1666506 未加载
fmw超过 14 年前
The article lists BeautifulSoup as the Python choice for scraping, but that isn't necessarily true. I'm using <a href="http://scrapy.org/" rel="nofollow">http://scrapy.org/</a>, for example, which is a scraping framework that uses lxml and xpath by default.
评论 #1669567 未加载
fizx超过 14 年前
This reminds me, I ported the core ideas of the parsley scraping language to jQuery.<p><a href="http://github.com/fizx/pquery#readme" rel="nofollow">http://github.com/fizx/pquery#readme</a>
tcarnell超过 14 年前
Interesting, when I built <a href="http://cQuery.com" rel="nofollow">http://cQuery.com</a> (Content Query Engine), I investigated a number of options html parsing and content extraction options. I had played with Rhino and John Resigs env.js (<a href="http://ejohn.org/blog/bringing-the-browser-to-the-server/" rel="nofollow">http://ejohn.org/blog/bringing-the-browser-to-the-server/</a>) to run jQuery server-side.<p>For portability, performance and flexability I finally settled for writing my own HTML parser and CSS selection engine from scratch.
knowtheory超过 14 年前
The article reads "The challenge with using these libraries is that they all have their own quirks that can make working with HTML, CSS and Javascript challenging."<p>And that's true only if you only want to do page manipulation in Javascript. I'm perfectly happy with my page manipulation in Ruby w/ Nokogiri. Here's an example:<p>(code formatting on HN sucks, so it's on my blog, apologies)<p><a href="http://blog.knowtheory.net/post/1074676060/xml-manipulation-in-6-lines-of-ruby" rel="nofollow">http://blog.knowtheory.net/post/1074676060/xml-manipulation-...</a>
forsaken超过 14 年前
Site appears down. Is node popular enough yet for the "Node doesn't scale" talk? :)
评论 #1666165 未加载
评论 #1666221 未加载
jfager超过 14 年前
Ignoring the drama: my current favorite scraping combo is NekoHtml underneath Scala's completely kickass combo of pattern matching and XML literals.