Somewhat tangential, here is something I have long thought would be a very useful project, but unfortunately haven't had the time to build:<p>It is a scraping/crawling tool suite. Base it on webkit, with good scriptable plugin support (not just js, but expose the DOM to other languages too). It would consist of a few main parts.<p>1) what I call the Scrape-builder. This is essentially a fancy web browser, but, it has a rich UI that can be used to select portions of a web page, and expose the appropriate DOM elements, and how to find those elements in the page. By expose, I mean put into some sort of editor/ide - It could be raw html, or some sort of description language. in the editor, the elements one would want to scrap can then be selected, and put into some sort of object for later processing. This can include some form of scripting to mangle the data as needed. It can also include interactions with the javascript on the page, recording click macros (well event firing, and such). The point of this component is to allow content experts/non- or novice-programmers to easily arrange for the "interesting" data to be selected for scraping.<p>2) The second component for the suite is a scraping engine. It uses the description + macros + scripts from the Scrape-builder to actually pull data from the pages, and turn them into data objects. These objects can then be put on a queue for later processing with backend systems/code. The scraping engine is basically a stripped down webkit without the rendering/layout/display bits compiled in. It just builds the dom and executes the page's javascript to ultimately scrape the bits selected. This is driven by the spidering engine.<p>3) The spidering engine is what determines which pages to point the scraping engine at. It can be fed by external code, or it can be fed by scripts from the scraping engine, a feedback mechanism (some links on a page may be part of a single scraping, some may just be fodder for a later scraping). It can be thought of as a work queue for the scraping engines.<p>The main use-cases I see for this are specialized search and aggregation engines, which want to get at the interesting bits of sites which don't expose a good api, or where the data may be formatted, but hard to semantically infer without human intervention. Sure, it wouldn't be as efficient from a code execution point of view as say, custom scraping scripts, but it would allow for much faster response times to page changes, and allow better use of programmer time, by taking care of a lot of boilerplate or almost-boiler plate parts of scraping scenarios.