If you're interested in this space, a great resource can be found at <a href="https://www.datascienceatthecommandline.com/" rel="nofollow">https://www.datascienceatthecommandline.com/</a> (a free guide to go along with an orielly book)
A plug of my tools:<p>To visualize data coming in from a pipe, can pipe it to<p><a href="https://github.com/dkogan/feedgnuplot" rel="nofollow">https://github.com/dkogan/feedgnuplot</a><p>Very useful in conjunction with other tools to provide filtering and manipulation. For instance (the first one is mine):<p><a href="https://github.com/dkogan/vnlog" rel="nofollow">https://github.com/dkogan/vnlog</a><p><a href="https://www.gnu.org/software/datamash/" rel="nofollow">https://www.gnu.org/software/datamash/</a><p><a href="https://csvkit.readthedocs.io/" rel="nofollow">https://csvkit.readthedocs.io/</a><p><a href="https://github.com/johnkerl/miller" rel="nofollow">https://github.com/johnkerl/miller</a><p><a href="https://github.com/eBay/tsv-utils-dlang" rel="nofollow">https://github.com/eBay/tsv-utils-dlang</a><p><a href="http://harelba.github.io/q/" rel="nofollow">http://harelba.github.io/q/</a><p><a href="https://github.com/BatchLabs/charlatan" rel="nofollow">https://github.com/BatchLabs/charlatan</a><p><a href="https://github.com/dinedal/textql" rel="nofollow">https://github.com/dinedal/textql</a><p><a href="https://github.com/BurntSushi/xsv" rel="nofollow">https://github.com/BurntSushi/xsv</a><p><a href="https://github.com/dbohdan/sqawk" rel="nofollow">https://github.com/dbohdan/sqawk</a><p><a href="https://stedolan.github.io/jq/" rel="nofollow">https://stedolan.github.io/jq/</a><p><a href="https://github.com/benbernard/RecordStream" rel="nofollow">https://github.com/benbernard/RecordStream</a>
Regarding more than one mentions of UUOC in this thread:<p>- The original award started in 1995. Even though pentium was already out, I think it is safe to say that was the era of 486 PCs. In 2019, for day-to-day shell work (meaning no GBs of file-processing or anything like that), isn't invoking UUOC and pointing out inefficiencies an example of premature optimization [1]?<p>- Isn't readability a matter of subjectivity, and that for some folks 'cat file' is more readable than '<file' or a direct use of a processing command (like grep, tail, head, etc) [2] ? (The whole stackoverflow page is fairly illuminating [3]).<p>[1] <a href="http://wiki.c2.com/?PrematureOptimization" rel="nofollow">http://wiki.c2.com/?PrematureOptimization</a><p>[2] <a href="https://chat.stackoverflow.com/rooms/182573/discussion-on-answer-by-jonathan-leffler-useless-use-of-cat" rel="nofollow">https://chat.stackoverflow.com/rooms/182573/discussion-on-an...</a><p>[3] <a href="https://stackoverflow.com/questions/11710552" rel="nofollow">https://stackoverflow.com/questions/11710552</a>
Not really where the author is heading, but I like to configure a backend for mathplot lib to render graphics in a terminal so when I am SSHed to a remote system I can get inlined plots.
Here are some ways you could simplify some of the tasks in the article, saving on typing:<p><pre><code> cat data.csv | sed 's/"//g'
</code></pre>
can be simplified by doing this instead:<p><pre><code> cat data.csv | tr d '"'
</code></pre>
This awk command:<p><pre><code> cat sales.csv | awk -F',' '{print $1}' | sort | uniq
</code></pre>
Can be replaced with a simpler (IMO) cut instead:<p><pre><code> cat sales.csv | cut -d , -f 1 | sort | uniq
</code></pre>
When using head or tail like this:<p><pre><code> head -n 3
</code></pre>
You don't need the -n:<p><pre><code> head -3
</code></pre>
Also shout out to jq, xsv, and zsh (extended glob), all nice complements to the typical command line utils.
When I was at a genetics lab, I was helping some researchers on something and spent 3 days writing a perl script, which kept failing. I sent an email to one of the guys who wrote the paper the research was being based on, and he said, why not try awk like this? With a little work, I turned 3 days of perl into a 1 line awk that was faster than anything else for the job at the time. That was an inspirational moment for the fundamental power of the unix philosophy and the core utilities in linux for me.<p>Good introductory article here!
This is a great list and well-written. As a data professional, I use these commands all the time and my job would be much harder without them. I also learned a few new things here (`tee` and `comm`).<p>I was lucky that my first job was as a support engineer at a data-centric tech company, which is where I learned these. I've often thought about how to teach them to data analysts coming from a non-engineering background. This is comprehensive but clear and would be a perfect resource for training someone like that. Thank you!
I'll just leave one of my past comments [1] here.<p>[1] <a href="https://news.ycombinator.com/item?id=17324222" rel="nofollow">https://news.ycombinator.com/item?id=17324222</a><p>P.S.: Not essential, but it really becomes a joy when, as a touch typist, I have turned on vi mode in the shell (e.g., with 'set -o vi'). My fingers never have to leave the home row while I do my shell piping work from start to finish. (no mouse, no arrow keys, etc.)
Why people use Linux in place of *nix ?<p>Even worst, most of the tools (cat, grep, awk) are Unix commands, redeveloped by the GNU project in most of the GNULinux distros.
Very useful article. Learned a couple of new things here.<p>While reading the idea that I know most of this, would that made me a data scientist? Jumped at me.<p>But then I quickly recovered from that thought that surely knowing some of the tools someone could use for a certain domain does not make you expert at that domain.<p>Might just be the case of same ingredients, different recipes.
This is also useful:
<a href="https://www.gnu.org/software/datamash/" rel="nofollow">https://www.gnu.org/software/datamash/</a>
I wouldn't use awk for simple things such as<p><pre><code> cat sales.csv | awk -F',' '{print $1}'
</code></pre>
but I'd prefer<p><pre><code> cut -d, -f1 sales.csv</code></pre>
Useless use of cat detected!<p><i>Rememeber, nearly all cases where you have:</i><p><pre><code> cat file | some_command and its args ...
</code></pre>
<i>you can rewrite it as:</i><p><pre><code> <file some_command and its args ...
</code></pre>
<i>and in some cases, such as this one, you can move the filename to the arglist as in:</i><p><pre><code> some_command and its args ... file
</code></pre>
— Randal L. Schwartz (<a href="http://porkmail.org/era/unix/award.html#cat" rel="nofollow">http://porkmail.org/era/unix/award.html#cat</a>)
Hi, (I wrote the article). A few people commented noting that I included "Data Science" in the title, but the content doesn't include any statistics or machine learning which is closer to the core definition of 'data science'. I still think the title is appropriate since any kind of low-fidelity data science task you do on some had-hoc data (log files, heaps of text, web pages) is going to start with setting up a processing pipeline that involves these commands. I could have re-named it "An intro to text processing" or "An intro to data processing", but then the people who need to see this content won't associate the title with something they're interested in, so they never benefit from it. The list of commands was chosen specifically with the question "What Linux commands would someone answering data science/business intelligence questions use?" in mind. These commands are also among the list of ones that are usually already installed on every system.
Ugly UUOC (Useless Use Of Cat). Damn peoples, please i appreciate your will to share, but share good contents and stop spreading bad shell patterns....