This is very nice!<p>For reasoning about tree-based data such as HTML, I also highly recommend the declarative programming language Prolog. HTML documents map naturally to Prolog terms and can be readily reasoned about with built-in language mechanisms. For instance, here is the sample query from the htmlq README, fetching all elements with id <i>get-help</i> from <a href="https://www.rust-lang.org" rel="nofollow">https://www.rust-lang.org</a>, using Scryer Prolog and its SGML and HTTP libraries in combination with the XPath-inspired query language from library(xpath):<p><pre><code> ?- http_open("https://www.rust-lang.org", Stream, []),
load_html(stream(Stream), DOM, []),
xpath(DOM, //(*(@id="get-help")), E).
</code></pre>
Yielding:<p><pre><code> E = element(div,[class="flex flex-colum ...",id="get-help"],["\n ",element(h4,[],["Get help!"]),"\n ",element(ul,[],["\n ...",element(li,[],[element(a,[... = ...],[...])]),"\n ...",element(li,[],[...]),...|...]),"\n ...",element(div,[class="la ..."],["\n ...",element(label,[...],[...]),...|...]),"\n ..."])
; false.
</code></pre>
The selector //(*(@id="get-help")) is used to obtain all HTML elements whose <i>id</i> attribute is get-help. On backtracking, all solutions are reported.<p>The other example from the README, extracting all <i>links</i> from the page, can be obtained with Scryer Prolog like this:<p><pre><code> ?- http_open("https://www.rust-lang.org", Stream, []),
load_html(stream(Stream), DOM, []),
xpath(DOM, //a(@href), Link),
portray_clause(Link),
false.
</code></pre>
This query uses forced backtracking to write all links on standard output, yielding:<p><pre><code> "/".
"/tools/install".
"/learn".
"https://play.rust-lang.org/".
"/tools".
"/governance".
"/community".
"https://blog.rust-lang.org/".
"/learn/get-started".
etc.</code></pre>