科技回声

9 条评论

drats超过 14 年前

Strange popup: "Hello, i see you are coming from hacker news.the article you clicked on was most certainly not submitted by nodejitsu.news.ycombinator has a long history of squashing articles and submitters that aren't funded by y-comb.most of this is done through their "silent" banning and censoring mechanisms, that leave people not even realizing they have been silenced.i hope you enjoy this article, and remember that HN is extremely biased and that you should keep your horizons broad."While I would agree that HN is bias towards YC-funded projects I would not agree that it is biased against non-YC projects or news. In fact, the majority of the items on HN are non-YC. This also follows for submitters and commenters for the year or more I've been here.On a different note. Hpricot is not representative of Ruby scraping anymore - nokogiri (<a href="http://nokogiri.org/" rel="nofollow">http://nokogiri.org/</a>) is where it's at. Which has a Hpricot translation layer if you need to change. Even when I decided to solidify on Python for everything else I will still go back to Ruby just for nokogiri when it comes to scraping.

评论 #1666304 未加载

评论 #1666759 未加载

评论 #1666204 未加载

评论 #1666169 未加载

评论 #1666275 未加载

评论 #1666285 未加载

robinduckett超过 14 年前

Hey guys. The Nodejitsu team and Marak (<a href="http://www.github.com/Marak" rel="nofollow">http://www.github.com/Marak</a>), the guy behind Nodejitsu are perma-banned from HN and can't respond to your queries.He sends his regards, and if you'd like to contact him visit the #Node.js IRC channel @ Freenode

il超过 14 年前

I have a question: Does scraping like this execute Javascript on the scraped page? Am I able to access the output of Javascript/AJAX on that page?As far as I know this is impossible with any other server-side scraping technology.If so, that would be amazingly useful for a couple of my side projects, much easier than parsing their Javascript code and extracting the info I need.

评论 #1666225 未加载

评论 #1666506 未加载

fmw超过 14 年前

The article lists BeautifulSoup as the Python choice for scraping, but that isn't necessarily true. I'm using <a href="http://scrapy.org/" rel="nofollow">http://scrapy.org/</a>, for example, which is a scraping framework that uses lxml and xpath by default.

评论 #1669567 未加载

fizx超过 14 年前

This reminds me, I ported the core ideas of the parsley scraping language to jQuery.<a href="http://github.com/fizx/pquery#readme" rel="nofollow">http://github.com/fizx/pquery#readme</a>

tcarnell超过 14 年前

Interesting, when I built <a href="http://cQuery.com" rel="nofollow">http://cQuery.com</a> (Content Query Engine), I investigated a number of options html parsing and content extraction options. I had played with Rhino and John Resigs env.js (<a href="http://ejohn.org/blog/bringing-the-browser-to-the-server/" rel="nofollow">http://ejohn.org/blog/bringing-the-browser-to-the-server/</a>) to run jQuery server-side.For portability, performance and flexability I finally settled for writing my own HTML parser and CSS selection engine from scratch.

knowtheory超过 14 年前

The article reads "The challenge with using these libraries is that they all have their own quirks that can make working with HTML, CSS and Javascript challenging."And that's true only if you only want to do page manipulation in Javascript. I'm perfectly happy with my page manipulation in Ruby w/ Nokogiri. Here's an example:(code formatting on HN sucks, so it's on my blog, apologies)<a href="http://blog.knowtheory.net/post/1074676060/xml-manipulation-in-6-lines-of-ruby" rel="nofollow">http://blog.knowtheory.net/post/1074676060/xml-manipulation-...</a>

forsaken超过 14 年前

Site appears down. Is node popular enough yet for the "Node doesn't scale" talk? :)

评论 #1666165 未加载

评论 #1666221 未加载

jfager超过 14 年前

Ignoring the drama: my current favorite scraping combo is NekoHtml underneath Scala's completely kickass combo of pattern matching and XML literals.

9 条评论

drats超过 14 年前

评论 #1666304 未加载

评论 #1666759 未加载

评论 #1666204 未加载

评论 #1666169 未加载

评论 #1666275 未加载

评论 #1666285 未加载

robinduckett超过 14 年前

il超过 14 年前

评论 #1666225 未加载

评论 #1666506 未加载

fmw超过 14 年前

评论 #1669567 未加载

fizx超过 14 年前

This reminds me, I ported the core ideas of the parsley scraping language to jQuery.<a href="http://github.com/fizx/pquery#readme" rel="nofollow">http://github.com/fizx/pquery#readme</a>

tcarnell超过 14 年前

knowtheory超过 14 年前

forsaken超过 14 年前

Site appears down. Is node popular enough yet for the "Node doesn't scale" talk? :)

评论 #1666165 未加载

评论 #1666221 未加载

jfager超过 14 年前

Ignoring the drama: my current favorite scraping combo is NekoHtml underneath Scala's completely kickass combo of pattern matching and XML literals.

Using jQuery and node.js to scrape html pages in 5 lines

9 条评论

Using jQuery and node.js to scrape html pages in 5 lines

9 条评论