TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Parsing Awk Is Tricky

110 点作者 oliverkwebb9 个月前

9 条评论

benhoyt9 个月前
Brian Kernighan sent Gawk maintainer Arnold Robbins an email linking to this blog post with the comment &quot;Hindsight has a lot of benefits, it would appear.&quot;<p>Peter Weinberger (quoted with permission) responded:<p>&gt; That&#x27;s interesting, Here&#x27;s some thoughts&#x2F;recollections. (remember that human memory is fallible.)<p>&gt; 1. Using whitespace for string concatenation, in retrospect, was probably not the ideal choice (but &#x27;+&#x27; would not have worked).<p>&gt; 2. Syntax choices were in part driven by the desire for our local C programmers to find it familiar.<p>&gt; 3. As creatures of a specific time and place awk shared with C the (then endearing, now irritating) property of being underspecified.<p>&gt; I think that collectively we understood YACC reasonably well. We tortured the grammar until the parser came close to doing what we wanted, and then we stopped. The tools then were more primitive, but they did fit in 64K of memory.<p>Al Aho also replied (quoted with permission):<p>&gt; Peter&#x27;s observation about torturing the grammar is apt! As awk grew in its early years, the grammar evolved with it and I remember agonizing to make changes to the grammar to keep it under control (understanding and minimizing the number of yacc-generated parsing-action conflicts) as awk evolved. I found yacc&#x27;s ability to point out parsing-action conflicts very helpful during awk&#x27;s development. Good grammar design was very much an art in those days (maybe even today).<p>It&#x27;s fun to hear the perspectives of the original AWK creators. I&#x27;ve had some correspondence with Kernighan and Weinberger before, but I think that&#x27;s the first time I&#x27;ve been on an email thread with all three of A, W, and K.
评论 #41428939 未加载
评论 #41428930 未加载
mmsc9 个月前
Awk is something that I think every programmer and especially every sysadmin should learn. 8 like the comparison at the end and have never heard of nnawk or bbawk before.<p>I recently made a dashboard to compare four versions of awk output together, since not all awk scripts I&#x27;ll run the same on each version: <a href="https:&#x2F;&#x2F;megamansec.github.io&#x2F;awk-compare&#x2F;" rel="nofollow">https:&#x2F;&#x2F;megamansec.github.io&#x2F;awk-compare&#x2F;</a> I&#x27;ll have to add those:)
评论 #41426914 未加载
评论 #41428370 未加载
评论 #41424837 未加载
RodgerTheGreat9 个月前
I think this is a good illustration of why parser-generator middleware like yacc is fundamentally misguided; they create <i>totally unnecessary gaps</i> between design intent and the action of the parser. In a hand-rolled recursive descent parser, or even a set of PEG productions, ambiguities and complex lookahead or backtracking leap out at the programmer immediately.
评论 #41422815 未加载
评论 #41423735 未加载
评论 #41422969 未加载
teleforce9 个月前
If you think AWK is hard to parse then try C++. The latter is so hard to parse thus very slow compile time that most probably inspired a funny programmer skit like this, one of the most popular XKCDs of all time [1].<p>Then come along fast compilation modern languages like Go and D. The latter is such a fresh air is that even though it&#x27;s a complex language like C++ and Rust but it managed to compile very fast. Heck it even has RDMD facility that can perform compiled REPL as you interacting with the prompt similar to interpreted programming languages like Python and Matlab.<p>According to its author, the main reason D has very fast compile time (as long as you avoid the CTFE) is because of the language design decisions avoid the notorious symbols that can complicated symbol table just like happened in C++ and the popular &lt;&lt; and &gt;&gt; overloading for I&#x2F;O and shifting. But the fact that Rust come much later than C++ and D but still slow to compile is bewildering to say the least.<p>[1] Compiling:<p><a href="https:&#x2F;&#x2F;xkcd.com&#x2F;303&#x2F;" rel="nofollow">https:&#x2F;&#x2F;xkcd.com&#x2F;303&#x2F;</a>
评论 #41423833 未加载
评论 #41423921 未加载
评论 #41428001 未加载
kazinator9 个月前
If you are parsing awk, you must treat any ream of whitespace that contains a newline as a visible token, which you have to reference in various places in the grammar. Your implementation will likely benefit from a switch, in the lexical analyzer, which sometimes turns off the visible newline.
ufo9 个月前
Another tricky bit is deciding whether &quot;&#x2F;&quot; is the division operator or the start of a regular expression.<p>IIRC, awk does this in a context sensitive manner, by looking at the previous token.
jangliss9 个月前
Surely it is AWKward?
librasteve9 个月前
just use raku
v3ss0n9 个月前
Reading awk as a human is hard too. And performance of awk is crap. A lot slower than most interpreter language out there. I had replaced all the awk scripts in python and everything is a lot faster.
评论 #41423965 未加载
评论 #41423389 未加载
评论 #41423687 未加载
评论 #41423312 未加载