TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Robula+: an algorithm to generate robust XPath-based locators

77 pointsby kamocycalmost 5 years ago

7 comments

neilvalmost 5 years ago
This looks like it might come in handy.<p>I started working with Web-scraping roughly around &#x27;95 (initially for a personalized newspaper metaphor for Web software agent reporting), and wrote HtmlChewer, an HTML parser in Java designed for that purpose. A while later, I moved my rapid R&amp;D work to Scheme, where I wrote the `htmlprag` permissive parser, now known as the `html-parsing` package in Racket and other Scheme dialects.<p>By the time I was using Scheme, my scraping usually ended up starting with XPath, to get a starting point subtree of the DOM, then used a mix of arbitrary code and sometimes a proprietary pattern-based destructuring DSL, to extract info from the subtree. And sometimes filtering&#x2F;transformation algorithms across a big free-form-ish text subtree (e.g., for simplifying the articles of a site a custom crawler scraped, for building a labeled corpus for an ML research project).<p>Of course we&#x27;ve always had resilience problems for Web scraping, even as the Web changed dramatically.<p>In general, my scraping methods usually ends up hand-crafted (and this was starting before in-browser development tools with element pickers and DOM editors), and much of the guesswork&#x2F;art of it was in coming up with queries and transforms that seemed like they might keep working the next time the site changed its HTML. In 2004 I did make a small tool to automate a &quot;starting point&quot; for hand-crafting such an XPath query: <a href="https:&#x2F;&#x2F;www.neilvandyke.org&#x2F;racket&#x2F;webscraperhelper&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.neilvandyke.org&#x2F;racket&#x2F;webscraperhelper&#x2F;</a>
评论 #24058206 未加载
santa_boyalmost 5 years ago
I recently discovered [ScrapeMate](<a href="https:&#x2F;&#x2F;github.com&#x2F;hermit-crab&#x2F;ScrapeMate#readme" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;hermit-crab&#x2F;ScrapeMate#readme</a>) and [selectorgadget](<a href="https:&#x2F;&#x2F;github.com&#x2F;cantino&#x2F;selectorgadget" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;cantino&#x2F;selectorgadget</a>) both available as chrome extensions that can come handy for quick scraping.<p>There are opportunities for better selectors that could possibly be found using machine learning (?)
kabachaalmost 5 years ago
Would be nice if readme would show some actual example; now it just trails off:<p><pre><code> let element = robulaPlus.getElementByXPath(&#x27;&#x2F;html&#x2F;body&#x2F;div&#x2F;span&#x2F;a&#x27;, document); robulaPlus.getRobustXPath(element, document); # what&#x27;s the result?</code></pre>
评论 #24058396 未加载
WalterGRalmost 5 years ago
Is there a name for the concept of automatically generating (potentially with machine learning?) selectors?<p>I feel like I’ve seen similar projects come across HN, but I’m at a loss for what to search for.
评论 #24058913 未加载
评论 #24140155 未加载
neilvalmost 5 years ago
Non-paywall article PDF: <a href="https:&#x2F;&#x2F;www.researchgate.net&#x2F;publication&#x2F;299336358_Robula_An_algorithm_for_generating_robust_XPath_locators_for_web_testing" rel="nofollow">https:&#x2F;&#x2F;www.researchgate.net&#x2F;publication&#x2F;299336358_Robula_An...</a>
kamocycalmost 5 years ago
The algorithm described in the paper is outlined as follows (just for my curiosity):<p>&quot;The algorithm starts with a generic XPath locator that returns all nodes (‘&#x2F;&#x2F;*’) and then it iteratively refines the locator until only the element of interest is selected. In such iterative refinement, ROBULA+ applies seven refinement transformations, according to a set of heuristic XPath specialization steps.&quot;<p>The algorithm seems to be a specialized heuristics for XPath generation.
benibelaalmost 5 years ago
I have been using pattern matching for web-scraping. I think it is more robust than XPath. At least more reliable to detect invalid input.<p>Let&#x27;s look at some of the Robula test cases:<p>Input:<p><pre><code> &lt;head&gt;&lt;&#x2F;head&gt;&lt;body&gt;&lt;h1 class=&quot;false&quot;&gt;&lt;&#x2F;h1&gt;&lt;h1 class=&quot;false&quot;&gt;&lt;&#x2F;h1&gt;&lt;h1 class=&quot;true&quot;&gt;&lt;&#x2F;h1&gt;&lt;h1 class=&quot;false&quot;&gt;&lt;&#x2F;h1&gt;&lt;&#x2F;body&gt; </code></pre> Task: get the true element, &lt;h1 class=&quot;true&quot;&gt;&lt;&#x2F;h1&gt;<p>XPath:<p><pre><code> &#x2F;&#x2F;*[@class=&#x27;true&#x27;] </code></pre> Pattern matching:<p><pre><code> &lt;h1 class=&quot;true&quot;&gt;{.}&lt;&#x2F;h1&gt; </code></pre> Input:<p><pre><code> &lt;head&gt;&lt;&#x2F;head&gt;&lt;body&gt;&lt;h1 class=&quot;false&quot; title=&quot;foo&quot;&gt;&lt;&#x2F;h1&gt;&lt;h1 class=&quot;false&quot; title=&quot;bar&quot;&gt;&lt;&#x2F;h1&gt;&lt;h1 class=&quot;true&quot; title=&quot;foo&quot;&gt;&lt;&#x2F;h1&gt;&lt;h1 class=&quot;true&quot; title=&quot;bar&quot;&gt;&lt;&#x2F;h1&gt;&lt;&#x2F;body&gt; </code></pre> Get &lt;h1 class=&quot;true&quot; title=&quot;foo&quot;&gt;&lt;&#x2F;h1&gt;<p>XPath:<p><pre><code> &#x2F;&#x2F;*[@class=&#x27;true&#x27; and @title=&#x27;foo&#x27;] </code></pre> Pattern matching:<p><pre><code> &lt;h1 class=&quot;true&quot; title=&quot;foo&quot;&gt;{.}&lt;&#x2F;h1&gt; </code></pre> As you see, you do not need a new syntax for attributes. Input and pattern are the same!<p>Input:<p><pre><code> &lt;h1&gt;&lt;&#x2F;h1&gt;&lt;h1&gt;&lt;&#x2F;h1&gt;&lt;h1&gt;&lt;&#x2F;h1&gt;&lt;h1&gt;&lt;&#x2F;h1&gt; </code></pre> Get the third element.<p>XPath:<p><pre><code> &#x2F;&#x2F;*[3] </code></pre> Pattern matching:<p><pre><code> &lt;h1&gt;&lt;&#x2F;h1&gt;&lt;h1&gt;&lt;&#x2F;h1&gt;&lt;h1&gt;{.}&lt;&#x2F;h1&gt; </code></pre> Input:<p><pre><code> &lt;head&gt;&lt;&#x2F;head&gt;&lt;body&gt;&lt;h1&gt;&lt;&#x2F;h1&gt;&lt;h1&gt;&lt;&#x2F;h1&gt;&lt;div&gt;&lt;h1&gt;&lt;&#x2F;h1&gt;&lt;&#x2F;div&gt;&lt;h1&gt;&lt;&#x2F;h1&gt;&lt;&#x2F;body&gt; </code></pre> Get the h1 in the div<p>XPath:<p><pre><code> &#x2F;&#x2F;div&#x2F;* </code></pre> Pattern matching:<p><pre><code> &lt;div&gt;&lt;h1&gt;{.}&lt;&#x2F;h1&gt;&lt;&#x2F;div&gt; </code></pre> This last example is actually getting to the point of pattern matching. Because every part of the patterns must match. If the div is missing, it will report, &quot;div not found&quot;. If the h1 is missing in the div, it will report &quot;h1 not found&quot;. But the XPath will just report &quot;found these elements&quot; or &quot;found nothing&quot;.