I use unix in the same way and for the same purpose as the described in the blog, but I have come to the opinion that once you get into the describe and visualize phase, it's much easier to just drop into R. Reading in the kind of file being worked on here is often as simple as<p>foo <-read.csv("foo.csv")<p>Getting summary descriptive statistics, item counts, scatter plots and histograms is often as easy as<p>summary(foo)<p>table(foo$col)<p>plot(foo$xcol, foo$ycol)<p>hist(foo$col).<p>I think that is lot simpler than a 4 or 5 command pipeline that can be mistake-prone to edit when you want to change column names or things like that. I still do these kinds of things in the shell sometimes, and I don't know if I can put my finger on when exactly I would drop into R vs write out a pipeline, but there IS a line somewhere...
A quote: "... As if this wasn't enough, he [i.e.Tukey] also <i>invented</i> what is probably the most influential algorithm of all time." (emphasis added)<p>No, Tukey did not "invent" the FFT. He rediscovered it, as did a number of others over the years since -- who else? -- Gauss first created it.<p><a href="http://en.wikipedia.org/wiki/Fast_Fourier_transform" rel="nofollow">http://en.wikipedia.org/wiki/Fast_Fourier_transform</a><p>A quote: "This method (and the general idea of an FFT) was popularized by a publication of J. W. Cooley and J. W. Tukey in 1965,[2] but it was later discovered (Heideman & Burrus, 1984) that those two authors had independently re-invented an algorithm known to Carl Friedrich Gauss around 1805 (and subsequently rediscovered several times in limited forms)."
Hits close to home. I do a lot of data conversion, arrangement and manipulation on the CLI.
When some coworker inherits any of those tasks and I explain how to do it, the answer tends to be "Aaaaallright, I'll use Excel".
Up for unix and "EDA is the lingua franca of data science". What you can do and discard on the unix CLI takes many times longer on certain GUI based OSes.
head -3 data* | cat
has the same result as
head -3 data*<p>Pipe sends stdout to stdin of the next process. cat sends stdin back to stdout. Piping to cat is rarely eventful (unless you use a flag like cat -n).
He writes<p><pre><code> (head -5; tail -5) <data
</code></pre>
but that's a bit misleading. These don't work.<p><pre><code> seq 20 | (head -5; tail -5)
(head -5; tail -5) < <(seq 20)
</code></pre>
Both giving just the first five lines.