TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Seven Unix Commands Every Data Scientist Should Know

109 pointsby leraxabout 6 years ago

17 comments

cybersolabout 6 years ago
'sort' and 'uniq' should also be near the top of the list. And once your doing more on the command-line, 'join' and 'comm' can help you merge data from multiple files.
评论 #19335837 未加载
评论 #19328273 未加载
评论 #19327157 未加载
评论 #19325801 未加载
aviraldgabout 6 years ago
I&#x27;d add jq (<a href="https:&#x2F;&#x2F;stedolan.github.io&#x2F;jq&#x2F;" rel="nofollow">https:&#x2F;&#x2F;stedolan.github.io&#x2F;jq&#x2F;</a>) to the list. JSON data is so common, and jq makes working with it a breeze.
评论 #19326058 未加载
评论 #19325743 未加载
foundartabout 6 years ago
Folks may want to have a look at <a href="https:&#x2F;&#x2F;www.gnu.org&#x2F;software&#x2F;datamash&#x2F;manual&#x2F;datamash.html" rel="nofollow">https:&#x2F;&#x2F;www.gnu.org&#x2F;software&#x2F;datamash&#x2F;manual&#x2F;datamash.html</a> I suppose it violates the Unix philosophy of one tool doing one thing well but it may nevertheless be useful. See also the examples page <a href="https:&#x2F;&#x2F;www.gnu.org&#x2F;software&#x2F;datamash&#x2F;examples&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.gnu.org&#x2F;software&#x2F;datamash&#x2F;examples&#x2F;</a>
评论 #19327541 未加载
fwipabout 6 years ago
I would be a little bit shocked if any of the data scientists at my day job didn&#x27;t know all seven of these, so, I guess that&#x27;s an accurate title.
评论 #19325717 未加载
评论 #19326167 未加载
评论 #19326118 未加载
评论 #19325757 未加载
msraviabout 6 years ago
I cannot recommend this enough:<p>The Awk Programming Language - Aho, Kernighan, Weinberger<p><a href="https:&#x2F;&#x2F;ia802309.us.archive.org&#x2F;25&#x2F;items&#x2F;pdfy-MgN0H1joIoDVoIC7&#x2F;The_AWK_Programming_Language.pdf" rel="nofollow">https:&#x2F;&#x2F;ia802309.us.archive.org&#x2F;25&#x2F;items&#x2F;pdfy-MgN0H1joIoDVoI...</a><p>The book is amazingly well written, and is invaluable.
评论 #19326471 未加载
评论 #19329430 未加载
ams6110about 6 years ago
They left out <i>rm</i>, used to clean up all their files when they are done so other users can work.
评论 #19325818 未加载
unhammerabout 6 years ago
<a href="http:&#x2F;&#x2F;visidata.org&#x2F;" rel="nofollow">http:&#x2F;&#x2F;visidata.org&#x2F;</a> is a nice one for quickly getting an overview of some tabular data – you can even just stick it at the end of your pipe. If<p>bzcat foo.bz2|sort|uniq -c|sort -nr | awk -f munge.awk |blah<p>produces a tsv, then<p>bzcat foo.bz2|sort|uniq -c|sort -nr | awk -f munge.awk |blah|vd<p>makes that tsv an interactive (if you think ncurses is interactive) spreadsheet with plotting and pivot tables and mouse support :)<p>You can also save your keypresses in vd to a file and then re-run them at a later stage – I&#x27;ve got some scripts to re-run an analysis and then run vd on it and immediately set all columns to floats and open the frequency table so I can see if I managed to lower the median this time.
评论 #19330593 未加载
Anthony-Gabout 6 years ago
If you have a lot of files that may be processed by a `find` command and speed is important, it’s worth knowing about the plus-sign variation of the `-exec` expression. The command in the original article<p><pre><code> find . -name setup.py -type f -exec grep -Hn boto3 {} \; </code></pre> could be written as<p><pre><code> find . -name setup.py -type f -exec grep -Hn boto3 {} + </code></pre> The difference is that the first version (the `-exec` expression is terminated with a semi-colon) forks a new process to run the `grep` command for each individual file “found” by the preceding expressions. So, if there were 50 such `setup.py` files, the `grep` command would be invoked 50 times. Some times this is desired behaviour but in this case, `grep` can accept multiple pathnames as arguments.<p>With the second version (expression is terminated with a plus-sign), the pathnames of the files are collected into sets so that the `grep` command is only called once for each set (similar to how the `xargs` utility works to avoid exceeding the limits on the number of arguments that can be passed to a command). This is much more efficient because only 1 `grep` child process is forked – instead of 50.<p>This functionality was added to the POSIX specification [1] a number of years ago and I’ve been using it for at least 10 year on GNU&#x2F;Linux systems. I imagine it should be available on other Unix-like environments (including BSD [2]) that data scientists are likely to be using – though the last time I had to work on a friend’s Mac the installed versions of the BSD utilities were quite old.<p>[1]: <a href="http:&#x2F;&#x2F;pubs.opengroup.org&#x2F;onlinepubs&#x2F;9699919799&#x2F;" rel="nofollow">http:&#x2F;&#x2F;pubs.opengroup.org&#x2F;onlinepubs&#x2F;9699919799&#x2F;</a><p>[2]: <a href="https:&#x2F;&#x2F;man.openbsd.org&#x2F;find.1" rel="nofollow">https:&#x2F;&#x2F;man.openbsd.org&#x2F;find.1</a>
asicspabout 6 years ago
I have an example based tutorial for all these commands plus other cli text processing commands<p><a href="https:&#x2F;&#x2F;github.com&#x2F;learnbyexample&#x2F;Command-line-text-processing" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;learnbyexample&#x2F;Command-line-text-processi...</a>
dredmorbiusabout 6 years ago
Problem: Given a CSV file, we want to know the number of columns just by analyzing its header.<p><pre><code> $ head -n 1 data.csv | awk -F &#x27;,&#x27; &#x27;{print NF}&#x27; </code></pre> Or spare a process:<p><pre><code> awk -F &#x27;,&#x27; &#x27;NR &lt;= 1 {print NF; quit}&#x27; data.csv </code></pre> One of numerous weak points to this article.
yakshaving_jgtabout 6 years ago
&gt; Prints on the screen (or to the standard output) the contents of files. Simple like that.<p>While it&#x27;s not exactly false, it&#x27;s also not a good explanation for cat. If you just want to operate on the contents of a single file, you should use redirection. The cat utility is for concatenating files.
6keZbCECT2uBabout 6 years ago
tldr: grep, cat, find, head &#x2F; tail, wc, awk, shuf with bonuses of xargs, and man.<p>I&#x27;ve never needed shuf, and awk is a bit out of place in the list, but head and tail have saved me from many a large file. The interesting data is usually in head, tail, or grep anyway.
评论 #19325859 未加载
评论 #19325704 未加载
评论 #19326412 未加载
sgillenabout 6 years ago
Is there a real advantage to using awk over python for most tasks? Or is just a little faster&#x2F;more convenient if you already know it?
评论 #19325887 未加载
评论 #19326111 未加载
colechristensenabout 6 years ago
Is &quot;data science&quot; so undeveloped that pipes and grep need to be on an everyone-should-know list?
评论 #19326363 未加载
评论 #19326333 未加载
评论 #19329567 未加载
pumanoirabout 6 years ago
Any book recommendations to understand and master the use of UNIX commands?
评论 #19326403 未加载
stakhanovabout 6 years ago
Let&#x27;s not forget cut for dealing with csv.
评论 #19326112 未加载
评论 #19326180 未加载
nooberminabout 6 years ago
s&#x2F;Data Scientist&#x2F;Unix User&#x2F;
评论 #19326914 未加载
评论 #19328656 未加载