An introduction to data processing on the Linux command line

203 pointsby robertelderover 5 years ago

22 comments

tjlav5over 5 years ago

If you're interested in this space, a great resource can be found at <a href="https://www.datascienceatthecommandline.com/" rel="nofollow">https://www.datascienceatthecommandline.com/</a> (a free guide to go along with an orielly book)

dima55over 5 years ago

A plug of my tools:To visualize data coming in from a pipe, can pipe it to<a href="https://github.com/dkogan/feedgnuplot" rel="nofollow">https://github.com/dkogan/feedgnuplot</a>Very useful in conjunction with other tools to provide filtering and manipulation. For instance (the first one is mine):<a href="https://github.com/dkogan/vnlog" rel="nofollow">https://github.com/dkogan/vnlog</a><a href="https://www.gnu.org/software/datamash/" rel="nofollow">https://www.gnu.org/software/datamash/</a><a href="https://csvkit.readthedocs.io/" rel="nofollow">https://csvkit.readthedocs.io/</a><a href="https://github.com/johnkerl/miller" rel="nofollow">https://github.com/johnkerl/miller</a><a href="https://github.com/eBay/tsv-utils-dlang" rel="nofollow">https://github.com/eBay/tsv-utils-dlang</a><a href="http://harelba.github.io/q/" rel="nofollow">http://harelba.github.io/q/</a><a href="https://github.com/BatchLabs/charlatan" rel="nofollow">https://github.com/BatchLabs/charlatan</a><a href="https://github.com/dinedal/textql" rel="nofollow">https://github.com/dinedal/textql</a><a href="https://github.com/BurntSushi/xsv" rel="nofollow">https://github.com/BurntSushi/xsv</a><a href="https://github.com/dbohdan/sqawk" rel="nofollow">https://github.com/dbohdan/sqawk</a><a href="https://stedolan.github.io/jq/" rel="nofollow">https://stedolan.github.io/jq/</a><a href="https://github.com/benbernard/RecordStream" rel="nofollow">https://github.com/benbernard/RecordStream</a>

haddrover 5 years ago

Command Line tools are powerful beasts (e.g. awk) and they were always central to data preprocessing. But do we need to call it now a data science?

评论 #21613513 未加载

fizixerover 5 years ago

Regarding more than one mentions of UUOC in this thread:- The original award started in 1995. Even though pentium was already out, I think it is safe to say that was the era of 486 PCs. In 2019, for day-to-day shell work (meaning no GBs of file-processing or anything like that), isn't invoking UUOC and pointing out inefficiencies an example of premature optimization [1]?- Isn't readability a matter of subjectivity, and that for some folks 'cat file' is more readable than '<file' or a direct use of a processing command (like grep, tail, head, etc) [2] ? (The whole stackoverflow page is fairly illuminating [3]).[1] <a href="http://wiki.c2.com/?PrematureOptimization" rel="nofollow">http://wiki.c2.com/?PrematureOptimization</a>[2] <a href="https://chat.stackoverflow.com/rooms/182573/discussion-on-answer-by-jonathan-leffler-useless-use-of-cat" rel="nofollow">https://chat.stackoverflow.com/rooms/182573/discussion-on-an...</a>[3] <a href="https://stackoverflow.com/questions/11710552" rel="nofollow">https://stackoverflow.com/questions/11710552</a>

mark_l_watsonover 5 years ago

Not really where the author is heading, but I like to configure a backend for mathplot lib to render graphics in a terminal so when I am SSHed to a remote system I can get inlined plots.

评论 #21616621 未加载

评论 #21617811 未加载

ibernover 5 years ago

Here are some ways you could simplify some of the tasks in the article, saving on typing:<pre><code> cat data.csv | sed 's/"//g' </code></pre> can be simplified by doing this instead:<pre><code> cat data.csv | tr d '"' </code></pre> This awk command:<pre><code> cat sales.csv | awk -F',' '{print $1}' | sort | uniq </code></pre> Can be replaced with a simpler (IMO) cut instead:<pre><code> cat sales.csv | cut -d , -f 1 | sort | uniq </code></pre> When using head or tail like this:<pre><code> head -n 3 </code></pre> You don't need the -n:<pre><code> head -3 </code></pre> Also shout out to jq, xsv, and zsh (extended glob), all nice complements to the typical command line utils.

评论 #21616995 未加载

评论 #21616703 未加载

评论 #21617668 未加载

arminiusreturnsover 5 years ago

When I was at a genetics lab, I was helping some researchers on something and spent 3 days writing a perl script, which kept failing. I sent an email to one of the guys who wrote the paper the research was being based on, and he said, why not try awk like this? With a little work, I turned 3 days of perl into a 1 line awk that was faster than anything else for the job at the time. That was an inspirational moment for the fundamental power of the unix philosophy and the core utilities in linux for me.Good introductory article here!

mjirvover 5 years ago

This is a great list and well-written. As a data professional, I use these commands all the time and my job would be much harder without them. I also learned a few new things here (`tee` and `comm`).I was lucky that my first job was as a support engineer at a data-centric tech company, which is where I learned these. I've often thought about how to teach them to data analysts coming from a non-engineering background. This is comprehensive but clear and would be a perfect resource for training someone like that. Thank you!

fizixerover 5 years ago

I'll just leave one of my past comments [1] here.[1] <a href="https://news.ycombinator.com/item?id=17324222" rel="nofollow">https://news.ycombinator.com/item?id=17324222</a>P.S.: Not essential, but it really becomes a joy when, as a touch typist, I have turned on vi mode in the shell (e.g., with 'set -o vi'). My fingers never have to leave the home row while I do my shell piping work from start to finish. (no mouse, no arrow keys, etc.)

评论 #21615744 未加载

pferdeover 5 years ago

Huh, so it turns out that I've been a 'data scientist' for over 20 years. Who knew?

评论 #21614261 未加载

rodrigo975over 5 years ago

Why people use Linux in place of *nix ?Even worst, most of the tools (cat, grep, awk) are Unix commands, redeveloped by the GNU project in most of the GNULinux distros.

评论 #21615597 未加载

评论 #21614631 未加载

评论 #21615548 未加载

评论 #21614721 未加载

wolfhumbleover 5 years ago

Very nice video, and I like the way you combine it with text and examples! :-) Looking forward to reading the other articles on your page as well!

hackerm0nkeyover 5 years ago

Very useful article. Learned a couple of new things here.While reading the idea that I know most of this, would that made me a data scientist? Jumped at me.But then I quickly recovered from that thought that surely knowing some of the tools someone could use for a certain domain does not make you expert at that domain.Might just be the case of same ingredients, different recipes.

pedro84over 5 years ago

This is a little more awk-ish:awk -F, '$2 == "F" {$0=(($1-32)*5/9)",C"} {print}'

评论 #21616716 未加载

pnutjamover 5 years ago

This is still useful information for data scientist who end up on Linux.

oburbover 5 years ago

This is also useful: <a href="https://www.gnu.org/software/datamash/" rel="nofollow">https://www.gnu.org/software/datamash/</a>

mnaydinover 5 years ago

I wouldn't use awk for simple things such as<pre><code> cat sales.csv | awk -F',' '{print $1}' </code></pre> but I'd prefer<pre><code> cut -d, -f1 sales.csv</code></pre>

teddyhover 5 years ago

Useless use of cat detected!Rememeber, nearly all cases where you have:<pre><code> cat file | some_command and its args ... </code></pre> you can rewrite it as:<pre><code> <file some_command and its args ... </code></pre> and in some cases, such as this one, you can move the filename to the arglist as in:<pre><code> some_command and its args ... file </code></pre> — Randal L. Schwartz (<a href="http://porkmail.org/era/unix/award.html#cat" rel="nofollow">http://porkmail.org/era/unix/award.html#cat</a>)

评论 #21614655 未加载

评论 #21615699 未加载

lonelappdeover 5 years ago

Good intro to data processing.tsort and comm were news to me.

c06nover 5 years ago

Can somebody explain the advantage of doing it on the command line vs in Python or R? What would a practical use case look like?

评论 #21614555 未加载

评论 #21615358 未加载

评论 #21614573 未加载

robertelderover 5 years ago

Hi, (I wrote the article). A few people commented noting that I included "Data Science" in the title, but the content doesn't include any statistics or machine learning which is closer to the core definition of 'data science'. I still think the title is appropriate since any kind of low-fidelity data science task you do on some had-hoc data (log files, heaps of text, web pages) is going to start with setting up a processing pipeline that involves these commands. I could have re-named it "An intro to text processing" or "An intro to data processing", but then the people who need to see this content won't associate the title with something they're interested in, so they never benefit from it. The list of commands was chosen specifically with the question "What Linux commands would someone answering data science/business intelligence questions use?" in mind. These commands are also among the list of ones that are usually already installed on every system.

评论 #21614648 未加载

评论 #21616089 未加载

评论 #21614801 未加载

评论 #21616519 未加载

评论 #21616203 未加载

netmonkover 5 years ago

Ugly UUOC (Useless Use Of Cat). Damn peoples, please i appreciate your will to share, but share good contents and stop spreading bad shell patterns....

评论 #21615876 未加载

评论 #21621759 未加载

评论 #21617182 未加载

评论 #21615031 未加载

22 comments

tjlav5over 5 years ago

dima55over 5 years ago

haddrover 5 years ago

Command Line tools are powerful beasts (e.g. awk) and they were always central to data preprocessing. But do we need to call it now a data science?

评论 #21613513 未加载

fizixerover 5 years ago

mark_l_watsonover 5 years ago

Not really where the author is heading, but I like to configure a backend for mathplot lib to render graphics in a terminal so when I am SSHed to a remote system I can get inlined plots.

评论 #21616621 未加载

评论 #21617811 未加载

ibernover 5 years ago

评论 #21616995 未加载

评论 #21616703 未加载

评论 #21617668 未加载

arminiusreturnsover 5 years ago

mjirvover 5 years ago

fizixerover 5 years ago

评论 #21615744 未加载

pferdeover 5 years ago

Huh, so it turns out that I've been a 'data scientist' for over 20 years. Who knew?

评论 #21614261 未加载

rodrigo975over 5 years ago

Why people use Linux in place of *nix ?Even worst, most of the tools (cat, grep, awk) are Unix commands, redeveloped by the GNU project in most of the GNULinux distros.

评论 #21615597 未加载

评论 #21614631 未加载

评论 #21615548 未加载

评论 #21614721 未加载

wolfhumbleover 5 years ago

Very nice video, and I like the way you combine it with text and examples! :-) Looking forward to reading the other articles on your page as well!

hackerm0nkeyover 5 years ago

pedro84over 5 years ago

This is a little more awk-ish:awk -F, '$2 == "F" {$0=(($1-32)*5/9)",C"} {print}'

评论 #21616716 未加载

pnutjamover 5 years ago

This is still useful information for data scientist who end up on Linux.

oburbover 5 years ago

This is also useful: <a href="https://www.gnu.org/software/datamash/" rel="nofollow">https://www.gnu.org/software/datamash/</a>

mnaydinover 5 years ago

I wouldn't use awk for simple things such as<pre><code> cat sales.csv | awk -F',' '{print $1}' </code></pre> but I'd prefer<pre><code> cut -d, -f1 sales.csv</code></pre>

teddyhover 5 years ago

评论 #21614655 未加载

评论 #21615699 未加载

lonelappdeover 5 years ago

Good intro to data processing.tsort and comm were news to me.

c06nover 5 years ago

Can somebody explain the advantage of doing it on the command line vs in Python or R? What would a practical use case look like?

评论 #21614555 未加载

评论 #21615358 未加载

评论 #21614573 未加载

robertelderover 5 years ago

评论 #21614648 未加载

评论 #21616089 未加载

评论 #21614801 未加载

评论 #21616519 未加载

评论 #21616203 未加载

netmonkover 5 years ago

Ugly UUOC (Useless Use Of Cat). Damn peoples, please i appreciate your will to share, but share good contents and stop spreading bad shell patterns....

评论 #21615876 未加载

评论 #21621759 未加载

评论 #21617182 未加载

评论 #21615031 未加载