This looks like it might come in handy.<p>I started working with Web-scraping roughly around '95 (initially for a personalized newspaper metaphor for Web software agent reporting), and wrote HtmlChewer, an HTML parser in Java designed for that purpose. A while later, I moved my rapid R&D work to Scheme, where I wrote the `htmlprag` permissive parser, now known as the `html-parsing` package in Racket and other Scheme dialects.<p>By the time I was using Scheme, my scraping usually ended up starting with XPath, to get a starting point subtree of the DOM, then used a mix of arbitrary code and sometimes a proprietary pattern-based destructuring DSL, to extract info from the subtree. And sometimes filtering/transformation algorithms across a big free-form-ish text subtree (e.g., for simplifying the articles of a site a custom crawler scraped, for building a labeled corpus for an ML research project).<p>Of course we've always had resilience problems for Web scraping, even as the Web changed dramatically.<p>In general, my scraping methods usually ends up hand-crafted (and this was starting before in-browser development tools with element pickers and DOM editors), and much of the guesswork/art of it was in coming up with queries and transforms that seemed like they might keep working the next time the site changed its HTML. In 2004 I did make a small tool to automate a "starting point" for hand-crafting such an XPath query: <a href="https://www.neilvandyke.org/racket/webscraperhelper/" rel="nofollow">https://www.neilvandyke.org/racket/webscraperhelper/</a>
I recently discovered [ScrapeMate](<a href="https://github.com/hermit-crab/ScrapeMate#readme" rel="nofollow">https://github.com/hermit-crab/ScrapeMate#readme</a>) and [selectorgadget](<a href="https://github.com/cantino/selectorgadget" rel="nofollow">https://github.com/cantino/selectorgadget</a>) both available as chrome extensions that can come handy for quick scraping.<p>There are opportunities for better selectors that could possibly be found using machine learning (?)
Would be nice if readme would show some actual example; now it just trails off:<p><pre><code> let element = robulaPlus.getElementByXPath('/html/body/div/span/a',
document);
robulaPlus.getRobustXPath(element, document);
# what's the result?</code></pre>
Is there a name for the concept of automatically generating (potentially with machine learning?) selectors?<p>I feel like I’ve seen similar projects come across HN, but I’m at a loss for what to search for.
The algorithm described in the paper is outlined as follows (just for my curiosity):<p>"The algorithm starts with a generic XPath locator that returns
all nodes (‘//*’) and then it iteratively refines the locator until only the element of interest is selected. In such iterative refinement, ROBULA+ applies seven refinement transformations, according to a set of heuristic XPath specialization steps."<p>The algorithm seems to be a specialized heuristics for XPath generation.
I have been using pattern matching for web-scraping. I think it is more robust than XPath. At least more reliable to detect invalid input.<p>Let's look at some of the Robula test cases:<p>Input:<p><pre><code> <head></head><body><h1 class="false"></h1><h1 class="false"></h1><h1 class="true"></h1><h1 class="false"></h1></body>
</code></pre>
Task: get the true element, <h1 class="true"></h1><p>XPath:<p><pre><code> //*[@class='true']
</code></pre>
Pattern matching:<p><pre><code> <h1 class="true">{.}</h1>
</code></pre>
Input:<p><pre><code> <head></head><body><h1 class="false" title="foo"></h1><h1 class="false" title="bar"></h1><h1 class="true" title="foo"></h1><h1 class="true" title="bar"></h1></body>
</code></pre>
Get <h1 class="true" title="foo"></h1><p>XPath:<p><pre><code> //*[@class='true' and @title='foo']
</code></pre>
Pattern matching:<p><pre><code> <h1 class="true" title="foo">{.}</h1>
</code></pre>
As you see, you do not need a new syntax for attributes. Input and pattern are the same!<p>Input:<p><pre><code> <h1></h1><h1></h1><h1></h1><h1></h1>
</code></pre>
Get the third element.<p>XPath:<p><pre><code> //*[3]
</code></pre>
Pattern matching:<p><pre><code> <h1></h1><h1></h1><h1>{.}</h1>
</code></pre>
Input:<p><pre><code> <head></head><body><h1></h1><h1></h1><div><h1></h1></div><h1></h1></body>
</code></pre>
Get the h1 in the div<p>XPath:<p><pre><code> //div/*
</code></pre>
Pattern matching:<p><pre><code> <div><h1>{.}</h1></div>
</code></pre>
This last example is actually getting to the point of pattern matching. Because every part of the patterns must match. If the div is missing, it will report, "div not found". If the h1 is missing in the div, it will report "h1 not found". But the XPath will just report "found these elements" or "found nothing".