Modernizing AWK, a 45-year old language, by adding CSV support

258 点作者 benhoyt大约 3 年前

28 条评论

I often use csvquote [1] whenever I need to process CSV with a command-line tool that doesn't support it. For example:<pre><code> csvquote test.csv | awk '{print $1, $2}' | csvquote -u </code></pre> [1] <a href="https://github.com/dbro/csvquote" rel="nofollow">https://github.com/dbro/csvquote</a>

评论 #31359282 未加载

评论 #31357282 未加载

db65edfc7996大约 3 年前

I have grown fond of using miller[0] to handle command line data processing. Handles the standard tabular formats (csv, tsv, json) and has all of the standard data cleanup options. Works on streams so (most operations) are not limited by memory.[0]: <a href="https://github.com/johnkerl/miller" rel="nofollow">https://github.com/johnkerl/miller</a>

评论 #31357541 未加载

评论 #31355117 未加载

mro_name大约 3 年前

I recently learned via <a href="https://news.ycombinator.com/item?id=31257248" rel="nofollow">https://news.ycombinator.com/item?id=31257248</a> that ASCII has the idea of records and fields ever since. It's just not used, but workaround CSV.No improvement of CSV handling will ever improved on that.

评论 #31352401 未加载

评论 #31352170 未加载

评论 #31355262 未加载

评论 #31352616 未加载

评论 #31353019 未加载

adamgordonbell大约 3 年前

It's somewhat a chore to use but gawkextlib has a CSV extension. so you can do this in gawk if the extension is loaded.<pre><code> @include "csv" BEGIN { CSVMODE = 1 } { print $2 } </code></pre> <a href="https://earthly.dev/blog/awk-csv/#gawkextlib" rel="nofollow">https://earthly.dev/blog/awk-csv/#gawkextlib</a>

jph大约 3 年前

Ben this is great, thank you. Would you consider adding Unicode Separated Values (USV)?<a href="https://github.com/sixarm/usv" rel="nofollow">https://github.com/sixarm/usv</a>USV is like CSV and simpler because of no escaping and no quoting. I can donate $50 to you or your charity of choice as a token of thanks and encouragement.

评论 #31352006 未加载

评论 #31352284 未加载

评论 #31352305 未加载

cb321大约 3 年前

When you have "format wars", the best idea is usually to have a converter program change to the easiest to work with format - unless this incurs a space explosion as per some image/video formats.With CSV-like data, bulk conversion from quoted-escaped RFC4180 CSV to a simpler-to-parse format is the best plan for several reasons. First, it may "catch on", help Microsoft/R/whoever embrace the format and in doing so squash many bugs written by "data analyst/scientist coders". Second, in a shell "a|b" runs programs a & b in parallel on multi-core and allow things like csv2x|head -n10000|b or popen("csv2x foo.csv"). Third, bulk conversion to a random access file where literal delimiters cannot occur as non-delimiters allows trivial file segmentation to be nCores times faster (under often satisfied assumptions). There are some D tools for this bulk convert in <a href="https://github.com/eBay/tsv-utils" rel="nofollow">https://github.com/eBay/tsv-utils</a> and a much smaller stand-alone Nim tool <a href="https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim" rel="nofollow">https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim</a> . Optional quoting was always going to be a PITA due to its non-locality. What if there is no quote anywhere? Fourth, by using a program as the unit of modularity in this case, you make things programming language agnostic. Someone could go to town and write a pure SIMD/AVX512 converter in assembly even and solve the problem "once and for all" on a given CPU. The problem is actually just simple enough that this smells possible.I am unaware of any "document" that "standardizes" this escaped/lossless TSV format. { Maybe call it "DSV" for delimiter separated values where "delimiters actually separate"? Ironically redundant. ;-) } Someone want to write an RFC or point to one? It can be just as "general/lossless" (see <a href="https://news.ycombinator.com/item?id=31352170" rel="nofollow">https://news.ycombinator.com/item?id=31352170</a>).Of course, if you are going to do a lot of data processing against some data, it is even better to parse all the way to down to binary so that you never have to parse again (well, unless you call CPUs loading registers "parsing") which is what database systems have been doing since the 1960s.

评论 #31356342 未加载

rgoulter大约 3 年前

I'd be curious to see a comparison with the csvkit suite. <a href="https://csvkit.readthedocs.io/en/latest/index.html" rel="nofollow">https://csvkit.readthedocs.io/en/latest/index.html</a>

评论 #31351354 未加载

parasense大约 3 年前

I alway just use awk to process csv files.<pre><code> awk -F '^"|","|"$|,' '{print $2,$3}' whatever.csv </code></pre> The above works perfectly well, it handles quoted fields, or even just unquoted fields.... This snippet is taken from a presentation I give on AWK and BASH scripting.That's the thing about AWK, it's already does everything. No need to extended it much at all.

评论 #31445866 未加载

andi999大约 3 年前

Can you also set the decimal separator? Some countries use ',' in numbere like 10,5

评论 #31356232 未加载

评论 #31351752 未加载

malkocoglu大约 3 年前

Modernizing ... by adding CSV support !?!

评论 #31351194 未加载

评论 #31351443 未加载

hawski大约 3 年前

It is a nice addition, but I would like to see this taken further - structural regular expression awk. It is waiting to be implemented for 35 years now.

asicsp大约 3 年前

>A big thank-you to the library of the University of Antwerp, who sponsored this feature. They’re one of two major teams or projects I know of that use GoAWK – the other one is the Benthos stream processor.That's great to hear.Are you planning to add support for xml, json, etc next? Something like Python's `json` module that gives you a dictionary object.

评论 #31361745 未加载

rufugee大约 3 年前

What’s the best resource for learning modern awk these days? I’ve used it for decades, but only via memorized snippets…

评论 #31355825 未加载

altairprime大约 3 年前

For non-awk tools, csvformat (from csvkit) will unquote and re-delimeter a CSV file (-D\034 -U -B) into something that UNIX pipes can handle (cut -d\034, etc). It’s worth setting up as an alias, and you can store \034 in $D or whatever for convenience.

adolph大约 3 年前

For anything down and dirty, what's wrong with -F'"'? For anything fancy there are plenty of things like the below.eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.includes csv to tsv: <a href="https://github.com/eBay/tsv-utils" rel="nofollow">https://github.com/eBay/tsv-utils</a>HT: <a href="https://simonwillison.net/" rel="nofollow">https://simonwillison.net/</a>

gpvos大约 3 年前

During a recent HN discussion on pipes and text versus structured objects to transfer data between programs, I started wondering if CSV wouldn't be a nice middle ground.

评论 #31355087 未加载

评论 #31353699 未加载

ognyankulev大约 3 年前

Instead of yet another limited parser, it would be best if universal tabular data parsing is supported by allowing one to specify all important parsing parameters, as described in <a href="https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#dialect-descriptions" rel="nofollow">https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#di...</a>

评论 #31351611 未加载

tyingq大约 3 年前

Gnu awk also has a csv extension that comes with gawkextlib. I think it may even be installed on many Linux distros by default.

torginus大约 3 年前

I can't tell whether the UNIX people have lost their way, or just the demands of modern shell scripts cannot be met by typical shell philosophy - that is, piping together the output of small, orthogonal utilities.The emergence and constantly increasing complexity of these small, bespoke DSLs like this or jq does not inspire confidence in me.

评论 #31355551 未加载

评论 #31354029 未加载

评论 #31353300 未加载

uhtred大约 3 年前

Why not use FPAT: <a href="https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html" rel="nofollow">https://www.gnu.org/software/gawk/manual/html_node/Splitting...</a>

评论 #31354222 未加载

wef大约 3 年前

Here's another library I've been using for several years:<a href="http://lorance.freeshell.org/csv/" rel="nofollow">http://lorance.freeshell.org/csv/</a>

forgotpwd16大约 3 年前

A good and useful addition. There's a mention to CSVMODE, a gawk library. I wonder if it could be extended to support the functionality that goawk's `-i csv` has.

motohagiography大约 3 年前

Thank you! In non-technical environments, shell scripting with awk is a superpower, and it's almost always on csv data.

mattewong大约 3 年前

thanks for this. am looking at the benchmarks. how do I get huge.csv? Don't see how to fetch or generate

评论 #31345149 未加载

skanga大约 3 年前

I always use mawk for its performance. This may be worth a try ...

评论 #31351233 未加载

igtztorrero大约 3 年前

one step further towards GoUnix, Gonux or any name Gophers like!

评论 #31354039 未加载

评论 #31360301 未加载

WFHRenaissance大约 3 年前

Always had CSV support, it's called `awk --field-separator=","`.

评论 #31358487 未加载

评论 #31358376 未加载

w0de0大约 3 年前

-F ','