Htmlq: like jq, but for html

961 pointsby jaboover 3 years ago

41 comments

triskaover 3 years ago

This is very nice!For reasoning about tree-based data such as HTML, I also highly recommend the declarative programming language Prolog. HTML documents map naturally to Prolog terms and can be readily reasoned about with built-in language mechanisms. For instance, here is the sample query from the htmlq README, fetching all elements with id get-help from <a href="https://www.rust-lang.org" rel="nofollow">https://www.rust-lang.org</a>, using Scryer Prolog and its SGML and HTTP libraries in combination with the XPath-inspired query language from library(xpath):<pre><code> ?- http_open("https://www.rust-lang.org", Stream, []), load_html(stream(Stream), DOM, []), xpath(DOM, //(*(@id="get-help")), E). </code></pre> Yielding:<pre><code> E = element(div,[class="flex flex-colum ...",id="get-help"],["\n ",element(h4,[],["Get help!"]),"\n ",element(ul,[],["\n ...",element(li,[],[element(a,[... = ...],[...])]),"\n ...",element(li,[],[...]),...|...]),"\n ...",element(div,[class="la ..."],["\n ...",element(label,[...],[...]),...|...]),"\n ..."]) ; false. </code></pre> The selector //(*(@id="get-help")) is used to obtain all HTML elements whose id attribute is get-help. On backtracking, all solutions are reported.The other example from the README, extracting all links from the page, can be obtained with Scryer Prolog like this:<pre><code> ?- http_open("https://www.rust-lang.org", Stream, []), load_html(stream(Stream), DOM, []), xpath(DOM, //a(@href), Link), portray_clause(Link), false. </code></pre> This query uses forced backtracking to write all links on standard output, yielding:<pre><code> "/". "/tools/install". "/learn". "https://play.rust-lang.org/". "/tools". "/governance". "/community". "https://blog.rust-lang.org/". "/learn/get-started". etc.</code></pre>

评论 #28442607 未加载

评论 #28442471 未加载

评论 #28443388 未加载

评论 #28444088 未加载

dfederschmidtover 3 years ago

This looks very useful, big fan of all the ^[a-z]+q$ utilities out there. But as a user, I would probably want to use XPath[0] notation here. Maybe that is just me. A quick search revealed xidel[1] which seems to be similar, but supports XPath.[0]<a href="https://en.wikipedia.org/wiki/XPath" rel="nofollow">https://en.wikipedia.org/wiki/XPath</a> [1]<a href="https://github.com/benibela/xidel" rel="nofollow">https://github.com/benibela/xidel</a>

评论 #28442485 未加载

评论 #28442438 未加载

评论 #28442622 未加载

评论 #28450695 未加载

评论 #28445255 未加载

gizdanover 3 years ago

Once upon a time I was using pup[0] for such thing as well as later I changed to cascadia[1] which seemed much more advanced.Comparing the two repos, it seems pup is dead, but cascadia may not be.These tools, including htmlq, seem to sell themselves as "jq for html", which is far from the truth. Jq is closer to the awk where you can do just about everything with json. Cascadia, htmlq, and pup seem closer to grep for html. They can essentially only select data from a html source.[0] <a href="https://github.com/EricChiang/pup" rel="nofollow">https://github.com/EricChiang/pup</a> [1] <a href="https://github.com/suntong/cascadia" rel="nofollow">https://github.com/suntong/cascadia</a>

评论 #28442299 未加载

评论 #28447650 未加载

harperleeover 3 years ago

Nice!This is the kind of obvious tool that once it exists, you can’t really grok the fact it did not earlier, and that it took until now to exist.

评论 #28442206 未加载

评论 #28442847 未加载

评论 #28442491 未加载

评论 #28442243 未加载

评论 #28456848 未加载

srg0over 3 years ago

"htmlq: like jq, but for HTML""jq is like sed for JSON data"sed: "While in some ways similar to an editor which permits scripted edits (_such as ed_), sed works by making only one pass over the input(s)"ed: "ed is a line-oriented text editor".Software definition through a reference to another software is somewhat confusing. Potential users come from different backgrounds (I had no idea what is jq), and it is not clear what are the defining features of each project. Is jq line oriented? Is htmlq operating in a single pass?

评论 #28444044 未加载

评论 #28444698 未加载

评论 #28444198 未加载

评论 #28443962 未加载

评论 #28443859 未加载

评论 #28444265 未加载

评论 #28446836 未加载

评论 #28445914 未加载

desktopninjaover 3 years ago

Very nice tool. I've long spoiled myself with Powershell's:<pre><code> Invoke-WebRequest eg. # what is the latest release of apache-tomcat? $LINKS=$(Invoke-WebRequest -Uri 'https://tomcat.apache.org/download-80.cgi' | Select-Object -ExpandProperty Links) $LATEST=$($Links | Where-Object -Property href -Match '#8.5.[0-9]+').href.substring(1) $FETCH=$($Links | Where-Object -Property href -match "apache-tomcat-${LATEST}.zip$").href</code></pre>

评论 #28445858 未加载

dredmorbiusover 3 years ago

See also the html-xml-utils from w3c.hxextract and hxselect perform similar extract functions.hxclean and hxnormalize (combined) will pretty-print HTML.<a href="https://www.w3.org/Tools/HTML-XML-utils/" rel="nofollow">https://www.w3.org/Tools/HTML-XML-utils/</a>

评论 #28442689 未加载

ducktectiveover 3 years ago

Looks nice! Any comparisons with pup?<a href="https://github.com/ericchiang/pup" rel="nofollow">https://github.com/ericchiang/pup</a>

soheilover 3 years ago

I'd use something like this script that you can put together yourself:<pre><code> #!/usr/bin/env ruby require 'nokogiri'; p Nokogiri::HTML(STDIN.read).css(ARGV[0]).text </code></pre> Just save it to a file in your /usr/local/bin/hq and do chmod +x !$Then you can do:<pre><code> curl -s "https://news.ycombinator.com/news"|hq "tr:first-child .storylink" </code></pre> It uses Nokogiri[0], which is much more battle tested and works with CSS and XPath selectors.[0] <a href="https://nokogiri.org/tutorials/parsing_an_html_xml_document.html#from-a-string" rel="nofollow">https://nokogiri.org/tutorials/parsing_an_html_xml_document....</a>

评论 #28454821 未加载

pkruminsover 3 years ago

Call it "hq".

d--bover 3 years ago

Just being that guy: is there a reason you didn't call it hq?

评论 #28444873 未加载

评论 #28444296 未加载

mcovaltover 3 years ago

I’d like to see a tool using lol-html [0] and their CSS selector API as a streaming HTML editor.[0] <a href="https://github.com/cloudflare/lol-html" rel="nofollow">https://github.com/cloudflare/lol-html</a>

chefandyover 3 years ago

If anyone is looking for a good library to do this in Python, PyQuery works well:<a href="https://pythonhosted.org/pyquery/" rel="nofollow">https://pythonhosted.org/pyquery/</a>

hyperpallium2over 3 years ago

From examples, this is only like jq in the sense that the q stands for the same thing. Even the way it does that is different.An xmlq that was really like jq would be fun, about 20 years ago.

评论 #28442959 未加载

评论 #28442665 未加载

andybakover 3 years ago

> like jg"jq is a lightweight and flexible command-line JSON processor"

jillesvangurpover 3 years ago

If you make the html well formed, xpath also works great. Great stuff if you ever need to pick html apart. Used this quite a bit when microformats were still a thing together with jtidy.Jq is very loosely inspired by that, I guess. Might come full circle here and use some XSL transformations ...

评论 #28443786 未加载

mro_nameover 3 years ago

it's statically linkable rust, isn't it? Awesome. I'm looking for a successor to$ xmllint --html --xpath …that doesn't choke on inline svg.

pabs3over 3 years ago

I tend to reach for XPath selectors before CSS ones when querying HTML.

necovekover 3 years ago

Nice, I expected something based on XPath (like xpd), but web developers dealing with HTML are infinitely more familiar with CSS selectors, so a great choice!

评论 #28444560 未加载

bandie91over 3 years ago

shameless self promotion: parsel[0] is a python script in front of the identically named python lib, and extracts parts of the HTML by CSS selector. the advantage of it compared to most similar tools is that you can navigate in the DOM tree up and down to find precisely what you want if the HTML is poorly marked up, or the searched parts are not close to each other.[0] <a href="https://github.com/bAndie91/tools/blob/master/usr/bin/parsel" rel="nofollow">https://github.com/bAndie91/tools/blob/master/usr/bin/parsel</a>

parhamnover 3 years ago

Ive been looking for a library that can find the best set of selectors to most consistently find the element youre looking for in a page.Any pointers to something that exists? Interestingly I've also found very little for dom extraction in the OS ML space.

oaueaover 3 years ago

<a href="https://jsoup.org/" rel="nofollow">https://jsoup.org/</a> has been around for a long time and seems a bit more mature & maintained than this two-code-files 2-year-old repo. Highly recommend.

downWidOutaFiteover 3 years ago

Why? I find xpath's syntax much simpler and regular than jq's.

ludovicianulover 3 years ago

And a Java version with pre-compiled binaries: <a href="https://github.com/ludovicianul/hq" rel="nofollow">https://github.com/ludovicianul/hq</a>

firefoxdover 3 years ago

Super useful. You've created a fantastic tool here. Thank you.

notoranditover 3 years ago

Next is xmlq: <a href="https://github.com/dscape/xmlq" rel="nofollow">https://github.com/dscape/xmlq</a>

elifover 3 years ago

When I saw the title I thought this was some machine learning-specific rmq/0mq message passing tech called HT. Very excited to zero.

systemvoltageover 3 years ago

This is nifty! Python + bs4 takes some googling to remember how to parse a webpage. This is just straight forward, thanks so much.

gravypodover 3 years ago

This looks really cool! I'd love to see a generic query language/tool library for structured data.

purplecatsover 3 years ago

brilliant. does this spin up a heavy DOM implementation in the background or do something lighter such as regexp?

评论 #28442216 未加载

评论 #28442202 未加载

评论 #28442205 未加载

评论 #28442210 未加载

评论 #28442211 未加载

评论 #28442201 未加载

jhatemyjobover 3 years ago

Crazy how a 300-line codebase manages to amass 2000 stars on Github and 700 upvotes on HN. Amazing ROI.

abledonover 3 years ago

is anyone else using the <a href="https://github.com/json-path/JsonPath" rel="nofollow">https://github.com/json-path/JsonPath</a> over the jq route?I hope we standardize on some jq query language, like we have with a base set of SQL syntax

teitoklienover 3 years ago

Maybe call it hq ?

评论 #28443649 未加载

gigatexalover 3 years ago

This is very cool. This will make scraping the web even easier!

bamdaddover 3 years ago

is there a brew install command ?

rendallover 3 years ago

What is jq?

who-shot-jrover 3 years ago

Good work!

Snd_over 3 years ago

This is great! Thanks

unityByFreedomover 3 years ago

Why not just jquery?

avereveardover 3 years ago

what's wrong with using html tidy + xmllint ?

评论 #28443106 未加载

zatkinover 3 years ago

Why not incorporate this into jq itself, like perhaps adding some command line arguments to switch to HTML mode?

评论 #28442260 未加载

评论 #28443556 未加载