'sort' and 'uniq' should also be near the top of the list. And once your doing more on the command-line, 'join' and 'comm' can help you merge data from multiple files.
I'd add jq (<a href="https://stedolan.github.io/jq/" rel="nofollow">https://stedolan.github.io/jq/</a>) to the list. JSON data is so common, and jq makes working with it a breeze.
Folks may want to have a look at <a href="https://www.gnu.org/software/datamash/manual/datamash.html" rel="nofollow">https://www.gnu.org/software/datamash/manual/datamash.html</a>
I suppose it violates the Unix philosophy of one tool doing one thing well but it may nevertheless be useful. See also the examples page <a href="https://www.gnu.org/software/datamash/examples/" rel="nofollow">https://www.gnu.org/software/datamash/examples/</a>
I would be a little bit shocked if any of the data scientists at my day job didn't know all seven of these, so, I guess that's an accurate title.
I cannot recommend this enough:<p>The Awk Programming Language - Aho, Kernighan, Weinberger<p><a href="https://ia802309.us.archive.org/25/items/pdfy-MgN0H1joIoDVoIC7/The_AWK_Programming_Language.pdf" rel="nofollow">https://ia802309.us.archive.org/25/items/pdfy-MgN0H1joIoDVoI...</a><p>The book is amazingly well written, and is invaluable.
<a href="http://visidata.org/" rel="nofollow">http://visidata.org/</a> is a nice one for quickly getting an overview of some tabular data – you can even just stick it at the end of your pipe. If<p>bzcat foo.bz2|sort|uniq -c|sort -nr | awk -f munge.awk |blah<p>produces a tsv, then<p>bzcat foo.bz2|sort|uniq -c|sort -nr | awk -f munge.awk |blah|vd<p>makes that tsv an interactive (if you think ncurses is interactive) spreadsheet with plotting and pivot tables and mouse support :)<p>You can also save your keypresses in vd to a file and then re-run them at a later stage – I've got some scripts to re-run an analysis and then run vd on it and immediately set all columns to floats and open the frequency table so I can see if I managed to lower the median this time.
If you have a lot of files that may be processed by a `find` command and speed is important, it’s worth knowing about the plus-sign variation of the `-exec` expression. The command in the original article<p><pre><code> find . -name setup.py -type f -exec grep -Hn boto3 {} \;
</code></pre>
could be written as<p><pre><code> find . -name setup.py -type f -exec grep -Hn boto3 {} +
</code></pre>
The difference is that the first version (the `-exec` expression is terminated with a semi-colon) forks a new process to run the `grep` command for each individual file “found” by the preceding expressions. So, if there were 50 such `setup.py` files, the `grep` command would be invoked 50 times. Some times this is desired behaviour but in this case, `grep` can accept multiple pathnames as arguments.<p>With the second version (expression is terminated with a plus-sign), the pathnames of the files are collected into sets so that the `grep` command is only called once for each set (similar to how the `xargs` utility works to avoid exceeding the limits on the number of arguments that can be passed to a command). This is much more efficient because only 1 `grep` child process is forked – instead of 50.<p>This functionality was added to the POSIX specification [1] a number of years ago and I’ve been using it for at least 10 year on GNU/Linux systems. I imagine it should be available on other Unix-like environments (including BSD [2]) that data scientists are likely to be using – though the last time I had to work on a friend’s Mac the installed versions of the BSD utilities were quite old.<p>[1]: <a href="http://pubs.opengroup.org/onlinepubs/9699919799/" rel="nofollow">http://pubs.opengroup.org/onlinepubs/9699919799/</a><p>[2]: <a href="https://man.openbsd.org/find.1" rel="nofollow">https://man.openbsd.org/find.1</a>
I have an example based tutorial for all these commands plus other cli text processing commands<p><a href="https://github.com/learnbyexample/Command-line-text-processing" rel="nofollow">https://github.com/learnbyexample/Command-line-text-processi...</a>
Problem: Given a CSV file, we want to know the number of columns just by analyzing its header.<p><pre><code> $ head -n 1 data.csv | awk -F ',' '{print NF}'
</code></pre>
Or spare a process:<p><pre><code> awk -F ',' 'NR <= 1 {print NF; quit}' data.csv
</code></pre>
One of numerous weak points to this article.
> Prints on the screen (or to the standard output) the contents of files. Simple like that.<p>While it's not exactly false, it's also not a good explanation for cat. If you just want to operate on the contents of a single file, you should use redirection. The cat utility is for concatenating files.
tldr: grep, cat, find, head / tail, wc, awk, shuf with bonuses of xargs, and man.<p>I've never needed shuf, and awk is a bit out of place in the list, but head and tail have saved me from many a large file. The interesting data is usually in head, tail, or grep anyway.