Hi. Wondering if there are any crawler experts on here who can help me. We're wanting to create a crawler to visit some sites that have forms, lists of items and detail of items. They're all in the real estate market and we want to capture the properties and pull out the latest ones. I'm being told that we need to create a specific crawler for each site, but I was wondering if we could create a generic crawler that has some kind of plug-in or pattern matching file (that we build manually) for each site. Anyone who is super-skilled in this area - I'd appreciate some advice. We're using Python also. One caveat - I'm not the tech guy as I tried to program and failed, but I do understand what we need and have a very good understanding of technology, I'm just inept at taking my ideas and doing anything with them :)
use beautiful soup (best python scraping system I know of). Maybe combine it with mechanize for navigating between the pages.
Don't try to create your own pattern-matching file. Write a generic crawler class, subclass it for each site. In the end you should just need to write a couple of short site-specific functions for each site.<p>It doesn't take long once you get going - that is, until you run into sites that are unnavigable piles of javascript and unstructured html.
I've developed a Twisted Python crawler that does something very similar to that. The possibility that it would work well seemed dubious at first, but I've been pleasantly surprised with the results.<p>Email me at johnwehr@gmail.com - I'd be happy to discuss the technology and progress I've made so far.
You might talk to the guys at New Idea Engineering about their xpump technology <a href="http://www.ideaeng.com/ds/xpump.html" rel="nofollow">http://www.ideaeng.com/ds/xpump.html</a>
I used it a few years ago to process all of the hardware data sheets on the Cisco website and extract 13 parameters such as height, width, depth, weight, power consumption (AC and DC), etc...Because Cisco's products come from many different acquisitions the datasheets were in many different formats, which sounds similar to your problem.<p>One point I would make is that a fast crawler is not always the best for this type of application: crawling at about the speed a user would click on pages is more friendly to a site and less likely to have them take steps to block your access.
I work for <a href="http://streeteasy.com" rel="nofollow">http://streeteasy.com</a> . We are experts in this area - especially in real estate ;) Feel free to contact me at ch AT streeteasy.com. It is <i>not</i> an easy task to build what you are looking for. We've been building and improving our system for two years now (ruby on rails.) Scraping the data is just one part of the problem. Validating the data is also a big issue. These sites often have incorrect or stale information. MLS's are good, but they may have restrictions on what you can do with the data, or (as in NYC) they may not even exist.
You can build a generic crawler that pulls pages from sites quickly and then process the pages offline with whatever lnguage you'd like. It's better to have a distributed way of doing things. Plus, there are standards that you need to comply with when crawling someone's website., like not crawl them too fast, or to check their robots.txt file to make sure that you're crawling "allowable" pages. Then once you've pulled their data off, you process it offline and do whatever you need to do with the data. It's not a simple procedure, but it's do-able if you want to spend some time doing it properly.
I've been thinking that there must be an easy way to tie the emacs webbrowser, macros, and regexs together to make powerfully customizable crawlers, but I haven't really investigated. Anyone know of this being done?
I've been working on this for a little while now. You can definitely write a plug-in or pattern matching file for each site. Building a specific crawler for each website doesn't make sense.<p>The custom bits you need are then ones that fill the form and then extract the results. For scraping results ruby's Scrubyt is best as you can write templates for each type of page.
If you need to do form stepping, you need to look at something like Perl's Mechanize package. (Ruby has one too.)<p>Spend time reading articles related to Mechanize. Your resulting code is going to be fairly terse, so you don't really need to spend much time worrying about making a generic crawler.
I'm with screen-scraper, (<a href="http://www.screen-scraper.com/" rel="nofollow">http://www.screen-scraper.com/</a>), and we've dealt a lot with scraping real estate data. Building a generic crawler for this kind of thing is quite a bit more complicated than it might seem. You might give our software and services a look, though. Our app integrates quite nicely with Python.
See this post from a couple weeks ago: <a href="http://news.ycombinator.com/item?id=96057" rel="nofollow">http://news.ycombinator.com/item?id=96057</a><p>Business idea? Selling smart crawlers to YCNews readers? There seems to be recurring interest :)
Also, there should be some way of getting MSLP (?) data from a service via RSS or something somewhere - which is a TON nicer than crawling several peoples' webpages.