科技回声

10 条评论

boie0025超过 10 年前

I had to write scrapers in Ruby for a very large application that scraped all kinds of government information from various states. We found (after a lot of pain working with very procedural scrapers) that a modified producer/consumer pattern worked well. We found that making classes for the producers (they were classes that described each page to be scraped, with methods that matched the modeled data) allowed for easy maintenance. We then created consumers that could be passed any of the page specific producer classes, and knew how to persist the scraped data.Once I had a good pattern in place I could easily create subclasses of the data type I was trying to scrape, basically pointing each of the modeled data methods to an xpath that was specific to that page.

评论 #8913073 未加载

评论 #8913228 未加载

评论 #8912517 未加载

Doctor_Fegg超过 10 年前

I'd suggest going with mechanize from the off - not just, as the article says, "[when] the site you’re scraping requires you to login first, for those instances I recommend looking into mechanize".Mechanize allows you to write clean, efficient scraper code without all the boilerplate. It's the nicest scraping solution I've yet encountered.

评论 #8911499 未加载

wnm超过 10 年前

I recommend having a look at capybara [0]. It is build on top of nokogiri, and is actually a tool to write acceptence tests. But it can also be used for web scraping: you can open websites, click on links, fill in forms, find elements on a page (via xpath or css), get their values, etc... I prefer it over nokogiri because of its nice DSL and good documentation [1]. It also can execute javascript, which sometimes is handy for scraping.I've spend a lot of time working on web scrapers for two of my projects, <a href="http://themescroller.com" rel="nofollow">http://themescroller.com</a> (dead) and <a href="http://www.remoteworknewsletter.com" rel="nofollow">http://www.remoteworknewsletter.com</a>, and I think the holy grail is to build a rails app around your scraper. You can write your scrapers as libs, and then make them executable as rake tasks, or even cronjobs. And because its a rails app you can save all scraped data as actual models and have them persisted in a database. With rails its also super easy to build an api around your data, or build a quick backend for it via rails scaffolds.[0] <a href="https://github.com/jnicklas/capybara" rel="nofollow">https://github.com/jnicklas/capybara</a> [1] <a href="http://www.rubydoc.info/github/jnicklas/capybara/" rel="nofollow">http://www.rubydoc.info/github/jnicklas/capybara/</a>

joshmn超过 10 年前

I always see people using something like HTTParty or open-uri for pulling down the page. My preferred (by far) is typhoeus, as it supports parallel requests and wraps around libcurl.<a href="https://github.com/typhoeus/typhoeus" rel="nofollow">https://github.com/typhoeus/typhoeus</a>

jstoiko超过 10 年前

I'd suggest taking a look at Scrapy (<a href="http://scrapy.org" rel="nofollow">http://scrapy.org</a>). It is built on top of Twisted (asynchronous) and uses xPath which makes your "scraping" code a lot more readable.

评论 #8913194 未加载

评论 #8912282 未加载

pkmishra超过 10 年前

Scraping is generally easy but challenges come when you are scraping large amount of unstructured data and how well you respond to page changes pro-actively. Scrapy is very good. I couldn't find similar tool in Ruby though.

k__超过 10 年前

Can anyone list some good resources about scraping, with gotchas etc.?

评论 #8911900 未加载

评论 #8911731 未加载

评论 #8912080 未加载

programminggeek超过 10 年前

Why not just use like watir or selenium?

评论 #8914263 未加载

richardpetersen超过 10 年前

How do you get the script to save the json file?

评论 #8912352 未加载

评论 #8911496 未加载

mychaelangelo超过 10 年前

thanks for sharing this - great scraping intro for us newbies (I'm new to ruby and ROR).

10 条评论

boie0025超过 10 年前

评论 #8913073 未加载

评论 #8913228 未加载

评论 #8912517 未加载

Doctor_Fegg超过 10 年前

评论 #8911499 未加载

wnm超过 10 年前

joshmn超过 10 年前

jstoiko超过 10 年前

评论 #8913194 未加载

评论 #8912282 未加载

pkmishra超过 10 年前

k__超过 10 年前

Can anyone list some good resources about scraping, with gotchas etc.?

评论 #8911900 未加载

评论 #8911731 未加载

评论 #8912080 未加载

programminggeek超过 10 年前

Why not just use like watir or selenium?

评论 #8914263 未加载

richardpetersen超过 10 年前

How do you get the script to save the json file?

评论 #8912352 未加载

评论 #8911496 未加载

mychaelangelo超过 10 年前

thanks for sharing this - great scraping intro for us newbies (I'm new to ruby and ROR).

Web scraping with Ruby

10 条评论

Web scraping with Ruby

10 条评论