TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Parsing Awk Is Tricky

110 pointsby oliverkwebb9 months ago

9 comments

benhoyt9 months ago
Brian Kernighan sent Gawk maintainer Arnold Robbins an email linking to this blog post with the comment &quot;Hindsight has a lot of benefits, it would appear.&quot;<p>Peter Weinberger (quoted with permission) responded:<p>&gt; That&#x27;s interesting, Here&#x27;s some thoughts&#x2F;recollections. (remember that human memory is fallible.)<p>&gt; 1. Using whitespace for string concatenation, in retrospect, was probably not the ideal choice (but &#x27;+&#x27; would not have worked).<p>&gt; 2. Syntax choices were in part driven by the desire for our local C programmers to find it familiar.<p>&gt; 3. As creatures of a specific time and place awk shared with C the (then endearing, now irritating) property of being underspecified.<p>&gt; I think that collectively we understood YACC reasonably well. We tortured the grammar until the parser came close to doing what we wanted, and then we stopped. The tools then were more primitive, but they did fit in 64K of memory.<p>Al Aho also replied (quoted with permission):<p>&gt; Peter&#x27;s observation about torturing the grammar is apt! As awk grew in its early years, the grammar evolved with it and I remember agonizing to make changes to the grammar to keep it under control (understanding and minimizing the number of yacc-generated parsing-action conflicts) as awk evolved. I found yacc&#x27;s ability to point out parsing-action conflicts very helpful during awk&#x27;s development. Good grammar design was very much an art in those days (maybe even today).<p>It&#x27;s fun to hear the perspectives of the original AWK creators. I&#x27;ve had some correspondence with Kernighan and Weinberger before, but I think that&#x27;s the first time I&#x27;ve been on an email thread with all three of A, W, and K.
评论 #41428939 未加载
评论 #41428930 未加载
mmsc9 months ago
Awk is something that I think every programmer and especially every sysadmin should learn. 8 like the comparison at the end and have never heard of nnawk or bbawk before.<p>I recently made a dashboard to compare four versions of awk output together, since not all awk scripts I&#x27;ll run the same on each version: <a href="https:&#x2F;&#x2F;megamansec.github.io&#x2F;awk-compare&#x2F;" rel="nofollow">https:&#x2F;&#x2F;megamansec.github.io&#x2F;awk-compare&#x2F;</a> I&#x27;ll have to add those:)
评论 #41426914 未加载
评论 #41428370 未加载
评论 #41424837 未加载
RodgerTheGreat9 months ago
I think this is a good illustration of why parser-generator middleware like yacc is fundamentally misguided; they create <i>totally unnecessary gaps</i> between design intent and the action of the parser. In a hand-rolled recursive descent parser, or even a set of PEG productions, ambiguities and complex lookahead or backtracking leap out at the programmer immediately.
评论 #41422815 未加载
评论 #41423735 未加载
评论 #41422969 未加载
teleforce9 months ago
If you think AWK is hard to parse then try C++. The latter is so hard to parse thus very slow compile time that most probably inspired a funny programmer skit like this, one of the most popular XKCDs of all time [1].<p>Then come along fast compilation modern languages like Go and D. The latter is such a fresh air is that even though it&#x27;s a complex language like C++ and Rust but it managed to compile very fast. Heck it even has RDMD facility that can perform compiled REPL as you interacting with the prompt similar to interpreted programming languages like Python and Matlab.<p>According to its author, the main reason D has very fast compile time (as long as you avoid the CTFE) is because of the language design decisions avoid the notorious symbols that can complicated symbol table just like happened in C++ and the popular &lt;&lt; and &gt;&gt; overloading for I&#x2F;O and shifting. But the fact that Rust come much later than C++ and D but still slow to compile is bewildering to say the least.<p>[1] Compiling:<p><a href="https:&#x2F;&#x2F;xkcd.com&#x2F;303&#x2F;" rel="nofollow">https:&#x2F;&#x2F;xkcd.com&#x2F;303&#x2F;</a>
评论 #41423833 未加载
评论 #41423921 未加载
评论 #41428001 未加载
kazinator9 months ago
If you are parsing awk, you must treat any ream of whitespace that contains a newline as a visible token, which you have to reference in various places in the grammar. Your implementation will likely benefit from a switch, in the lexical analyzer, which sometimes turns off the visible newline.
ufo9 months ago
Another tricky bit is deciding whether &quot;&#x2F;&quot; is the division operator or the start of a regular expression.<p>IIRC, awk does this in a context sensitive manner, by looking at the previous token.
jangliss9 months ago
Surely it is AWKward?
librasteve9 months ago
just use raku
v3ss0n9 months ago
Reading awk as a human is hard too. And performance of awk is crap. A lot slower than most interpreter language out there. I had replaced all the awk scripts in python and everything is a lot faster.
评论 #41423965 未加载
评论 #41423389 未加载
评论 #41423687 未加载
评论 #41423312 未加载