TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Htmlq: like jq, but for html

961 pointsby jaboover 3 years ago

41 comments

triskaover 3 years ago
This is very nice!<p>For reasoning about tree-based data such as HTML, I also highly recommend the declarative programming language Prolog. HTML documents map naturally to Prolog terms and can be readily reasoned about with built-in language mechanisms. For instance, here is the sample query from the htmlq README, fetching all elements with id <i>get-help</i> from <a href="https:&#x2F;&#x2F;www.rust-lang.org" rel="nofollow">https:&#x2F;&#x2F;www.rust-lang.org</a>, using Scryer Prolog and its SGML and HTTP libraries in combination with the XPath-inspired query language from library(xpath):<p><pre><code> ?- http_open(&quot;https:&#x2F;&#x2F;www.rust-lang.org&quot;, Stream, []), load_html(stream(Stream), DOM, []), xpath(DOM, &#x2F;&#x2F;(*(@id=&quot;get-help&quot;)), E). </code></pre> Yielding:<p><pre><code> E = element(div,[class=&quot;flex flex-colum ...&quot;,id=&quot;get-help&quot;],[&quot;\n &quot;,element(h4,[],[&quot;Get help!&quot;]),&quot;\n &quot;,element(ul,[],[&quot;\n ...&quot;,element(li,[],[element(a,[... = ...],[...])]),&quot;\n ...&quot;,element(li,[],[...]),...|...]),&quot;\n ...&quot;,element(div,[class=&quot;la ...&quot;],[&quot;\n ...&quot;,element(label,[...],[...]),...|...]),&quot;\n ...&quot;]) ; false. </code></pre> The selector &#x2F;&#x2F;(*(@id=&quot;get-help&quot;)) is used to obtain all HTML elements whose <i>id</i> attribute is get-help. On backtracking, all solutions are reported.<p>The other example from the README, extracting all <i>links</i> from the page, can be obtained with Scryer Prolog like this:<p><pre><code> ?- http_open(&quot;https:&#x2F;&#x2F;www.rust-lang.org&quot;, Stream, []), load_html(stream(Stream), DOM, []), xpath(DOM, &#x2F;&#x2F;a(@href), Link), portray_clause(Link), false. </code></pre> This query uses forced backtracking to write all links on standard output, yielding:<p><pre><code> &quot;&#x2F;&quot;. &quot;&#x2F;tools&#x2F;install&quot;. &quot;&#x2F;learn&quot;. &quot;https:&#x2F;&#x2F;play.rust-lang.org&#x2F;&quot;. &quot;&#x2F;tools&quot;. &quot;&#x2F;governance&quot;. &quot;&#x2F;community&quot;. &quot;https:&#x2F;&#x2F;blog.rust-lang.org&#x2F;&quot;. &quot;&#x2F;learn&#x2F;get-started&quot;. etc.</code></pre>
评论 #28442607 未加载
评论 #28442471 未加载
评论 #28443388 未加载
评论 #28444088 未加载
dfederschmidtover 3 years ago
This looks very useful, big fan of all the ^[a-z]+q$ utilities out there. But as a user, I would probably want to use XPath[0] notation here. Maybe that is just me. A quick search revealed xidel[1] which seems to be similar, but supports XPath.<p>[0]<a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;XPath" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;XPath</a> [1]<a href="https:&#x2F;&#x2F;github.com&#x2F;benibela&#x2F;xidel" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;benibela&#x2F;xidel</a>
评论 #28442485 未加载
评论 #28442438 未加载
评论 #28442622 未加载
评论 #28450695 未加载
评论 #28445255 未加载
gizdanover 3 years ago
Once upon a time I was using pup[0] for such thing as well as later I changed to cascadia[1] which seemed much more advanced.<p>Comparing the two repos, it seems pup is dead, but cascadia may not be.<p>These tools, including htmlq, seem to sell themselves as &quot;jq for html&quot;, which is far from the truth. Jq is closer to the awk where you can do just about everything with json. Cascadia, htmlq, and pup seem closer to grep for html. They can essentially only select data from a html source.<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;EricChiang&#x2F;pup" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;EricChiang&#x2F;pup</a> [1] <a href="https:&#x2F;&#x2F;github.com&#x2F;suntong&#x2F;cascadia" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;suntong&#x2F;cascadia</a>
评论 #28442299 未加载
评论 #28447650 未加载
harperleeover 3 years ago
Nice!<p>This is the kind of obvious tool that once it exists, you can’t really grok the fact it did not earlier, and that it took until now to exist.
评论 #28442206 未加载
评论 #28442847 未加载
评论 #28442491 未加载
评论 #28442243 未加载
评论 #28456848 未加载
srg0over 3 years ago
&quot;htmlq: like jq, but for HTML&quot;<p>&quot;jq is like sed for JSON data&quot;<p>sed: &quot;While in some ways similar to an editor which permits scripted edits (_such as ed_), sed works by making only one pass over the input(s)&quot;<p>ed: &quot;ed is a line-oriented text editor&quot;.<p>Software definition through a reference to another software is somewhat confusing. Potential users come from different backgrounds (I had no idea what is jq), and it is not clear what are the defining features of each project. Is jq line oriented? Is htmlq operating in a single pass?
评论 #28444044 未加载
评论 #28444698 未加载
评论 #28444198 未加载
评论 #28443962 未加载
评论 #28443859 未加载
评论 #28444265 未加载
评论 #28446836 未加载
评论 #28445914 未加载
desktopninjaover 3 years ago
Very nice tool. I&#x27;ve long spoiled myself with Powershell&#x27;s:<p><pre><code> Invoke-WebRequest eg. # what is the latest release of apache-tomcat? $LINKS=$(Invoke-WebRequest -Uri &#x27;https:&#x2F;&#x2F;tomcat.apache.org&#x2F;download-80.cgi&#x27; | Select-Object -ExpandProperty Links) $LATEST=$($Links | Where-Object -Property href -Match &#x27;#8.5.[0-9]+&#x27;).href.substring(1) $FETCH=$($Links | Where-Object -Property href -match &quot;apache-tomcat-${LATEST}.zip$&quot;).href</code></pre>
评论 #28445858 未加载
dredmorbiusover 3 years ago
See also the html-xml-utils from w3c.<p>hxextract and hxselect perform similar extract functions.<p>hxclean and hxnormalize (combined) will pretty-print HTML.<p><a href="https:&#x2F;&#x2F;www.w3.org&#x2F;Tools&#x2F;HTML-XML-utils&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.w3.org&#x2F;Tools&#x2F;HTML-XML-utils&#x2F;</a>
评论 #28442689 未加载
ducktectiveover 3 years ago
Looks nice! Any comparisons with pup?<p><a href="https:&#x2F;&#x2F;github.com&#x2F;ericchiang&#x2F;pup" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;ericchiang&#x2F;pup</a>
soheilover 3 years ago
I&#x27;d use something like this script that you can put together yourself:<p><pre><code> #!&#x2F;usr&#x2F;bin&#x2F;env ruby require &#x27;nokogiri&#x27;; p Nokogiri::HTML(STDIN.read).css(ARGV[0]).text </code></pre> Just save it to a file in your <i>&#x2F;usr&#x2F;local&#x2F;bin&#x2F;hq</i> and do <i>chmod +x !$</i><p>Then you can do:<p><pre><code> curl -s &quot;https:&#x2F;&#x2F;news.ycombinator.com&#x2F;news&quot;|hq &quot;tr:first-child .storylink&quot; </code></pre> It uses Nokogiri[0], which is much more battle tested and works with CSS and XPath selectors.<p>[0] <a href="https:&#x2F;&#x2F;nokogiri.org&#x2F;tutorials&#x2F;parsing_an_html_xml_document.html#from-a-string" rel="nofollow">https:&#x2F;&#x2F;nokogiri.org&#x2F;tutorials&#x2F;parsing_an_html_xml_document....</a>
评论 #28454821 未加载
pkruminsover 3 years ago
Call it &quot;hq&quot;.
d--bover 3 years ago
Just being that guy: is there a reason you didn&#x27;t call it hq?
评论 #28444873 未加载
评论 #28444296 未加载
mcovaltover 3 years ago
I’d like to see a tool using lol-html [0] and their CSS selector API as a streaming HTML editor.<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;cloudflare&#x2F;lol-html" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;cloudflare&#x2F;lol-html</a>
chefandyover 3 years ago
If anyone is looking for a good library to do this in Python, PyQuery works well:<p><a href="https:&#x2F;&#x2F;pythonhosted.org&#x2F;pyquery&#x2F;" rel="nofollow">https:&#x2F;&#x2F;pythonhosted.org&#x2F;pyquery&#x2F;</a>
hyperpallium2over 3 years ago
From examples, this is only like jq in the sense that the q stands for the same thing. Even the way it does that is different.<p>An xmlq that was really like jq would be fun, about 20 years ago.
评论 #28442959 未加载
评论 #28442665 未加载
andybakover 3 years ago
&gt; like jg<p>&quot;jq is a lightweight and flexible command-line JSON processor&quot;
jillesvangurpover 3 years ago
If you make the html well formed, xpath also works great. Great stuff if you ever need to pick html apart. Used this quite a bit when microformats were still a thing together with jtidy.<p>Jq is very loosely inspired by that, I guess. Might come full circle here and use some XSL transformations ...
评论 #28443786 未加载
mro_nameover 3 years ago
it&#x27;s statically linkable rust, isn&#x27;t it? Awesome. I&#x27;m looking for a successor to<p>$ xmllint --html --xpath …<p>that doesn&#x27;t choke on inline svg.
pabs3over 3 years ago
I tend to reach for XPath selectors before CSS ones when querying HTML.
necovekover 3 years ago
Nice, I expected something based on XPath (like xpd), but web developers dealing with HTML are infinitely more familiar with CSS selectors, so a great choice!
评论 #28444560 未加载
bandie91over 3 years ago
shameless self promotion: parsel[0] is a python script in front of the identically named python lib, and extracts parts of the HTML by CSS selector. the advantage of it compared to most similar tools is that you can navigate in the DOM tree up and down to find precisely what you want if the HTML is poorly marked up, or the searched parts are not close to each other.<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;bAndie91&#x2F;tools&#x2F;blob&#x2F;master&#x2F;usr&#x2F;bin&#x2F;parsel" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;bAndie91&#x2F;tools&#x2F;blob&#x2F;master&#x2F;usr&#x2F;bin&#x2F;parsel</a>
parhamnover 3 years ago
Ive been looking for a library that can find the best set of selectors to most consistently find the element youre looking for in a page.<p>Any pointers to something that exists? Interestingly I&#x27;ve also found very little for dom extraction in the OS ML space.
oaueaover 3 years ago
<a href="https:&#x2F;&#x2F;jsoup.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;jsoup.org&#x2F;</a> has been around for a long time and seems a bit more mature &amp; maintained than this two-code-files 2-year-old repo. Highly recommend.
downWidOutaFiteover 3 years ago
Why? I find xpath&#x27;s syntax much simpler and regular than jq&#x27;s.
ludovicianulover 3 years ago
And a Java version with pre-compiled binaries: <a href="https:&#x2F;&#x2F;github.com&#x2F;ludovicianul&#x2F;hq" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;ludovicianul&#x2F;hq</a>
firefoxdover 3 years ago
Super useful. You&#x27;ve created a fantastic tool here. Thank you.
notoranditover 3 years ago
Next is xmlq: <a href="https:&#x2F;&#x2F;github.com&#x2F;dscape&#x2F;xmlq" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dscape&#x2F;xmlq</a>
elifover 3 years ago
When I saw the title I thought this was some machine learning-specific rmq&#x2F;0mq message passing tech called HT. Very excited to zero.
systemvoltageover 3 years ago
This is nifty! Python + bs4 takes some googling to remember how to parse a webpage. This is just straight forward, thanks so much.
gravypodover 3 years ago
This looks really cool! I&#x27;d love to see a generic query language&#x2F;tool library for structured data.
purplecatsover 3 years ago
brilliant. does this spin up a heavy DOM implementation in the background or do something lighter such as regexp?
评论 #28442216 未加载
评论 #28442202 未加载
评论 #28442205 未加载
评论 #28442210 未加载
评论 #28442211 未加载
评论 #28442201 未加载
jhatemyjobover 3 years ago
Crazy how a 300-line codebase manages to amass 2000 stars on Github and 700 upvotes on HN. Amazing ROI.
abledonover 3 years ago
is anyone else using the <a href="https:&#x2F;&#x2F;github.com&#x2F;json-path&#x2F;JsonPath" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;json-path&#x2F;JsonPath</a> over the jq route?<p>I hope we standardize on some jq query language, like we have with a base set of SQL syntax
teitoklienover 3 years ago
Maybe call it hq ?
评论 #28443649 未加载
gigatexalover 3 years ago
This is very cool. This will make scraping the web even easier!
bamdaddover 3 years ago
is there a brew install command ?
rendallover 3 years ago
What is jq?
who-shot-jrover 3 years ago
Good work!
Snd_over 3 years ago
This is great! Thanks
unityByFreedomover 3 years ago
Why not just jquery?
avereveardover 3 years ago
what&#x27;s wrong with using html tidy + xmllint ?
评论 #28443106 未加载
zatkinover 3 years ago
Why not incorporate this into jq itself, like perhaps adding some command line arguments to switch to HTML mode?
评论 #28442260 未加载
评论 #28443556 未加载