TechEcho

17 comments

cybersolabout 6 years ago

'sort' and 'uniq' should also be near the top of the list. And once your doing more on the command-line, 'join' and 'comm' can help you merge data from multiple files.

评论 #19335837 未加载

评论 #19328273 未加载

评论 #19327157 未加载

评论 #19325801 未加载

aviraldgabout 6 years ago

I'd add jq (<a href="https://stedolan.github.io/jq/" rel="nofollow">https://stedolan.github.io/jq/</a>) to the list. JSON data is so common, and jq makes working with it a breeze.

评论 #19326058 未加载

评论 #19325743 未加载

foundartabout 6 years ago

Folks may want to have a look at <a href="https://www.gnu.org/software/datamash/manual/datamash.html" rel="nofollow">https://www.gnu.org/software/datamash/manual/datamash.html</a> I suppose it violates the Unix philosophy of one tool doing one thing well but it may nevertheless be useful. See also the examples page <a href="https://www.gnu.org/software/datamash/examples/" rel="nofollow">https://www.gnu.org/software/datamash/examples/</a>

评论 #19327541 未加载

fwipabout 6 years ago

I would be a little bit shocked if any of the data scientists at my day job didn't know all seven of these, so, I guess that's an accurate title.

评论 #19325717 未加载

评论 #19326167 未加载

评论 #19326118 未加载

评论 #19325757 未加载

msraviabout 6 years ago

I cannot recommend this enough:The Awk Programming Language - Aho, Kernighan, Weinberger<a href="https://ia802309.us.archive.org/25/items/pdfy-MgN0H1joIoDVoIC7/The_AWK_Programming_Language.pdf" rel="nofollow">https://ia802309.us.archive.org/25/items/pdfy-MgN0H1joIoDVoI...</a>The book is amazingly well written, and is invaluable.

评论 #19326471 未加载

评论 #19329430 未加载

ams6110about 6 years ago

They left out rm, used to clean up all their files when they are done so other users can work.

评论 #19325818 未加载

unhammerabout 6 years ago

<a href="http://visidata.org/" rel="nofollow">http://visidata.org/</a> is a nice one for quickly getting an overview of some tabular data – you can even just stick it at the end of your pipe. Ifbzcat foo.bz2|sort|uniq -c|sort -nr | awk -f munge.awk |blahproduces a tsv, thenbzcat foo.bz2|sort|uniq -c|sort -nr | awk -f munge.awk |blah|vdmakes that tsv an interactive (if you think ncurses is interactive) spreadsheet with plotting and pivot tables and mouse support :)You can also save your keypresses in vd to a file and then re-run them at a later stage – I've got some scripts to re-run an analysis and then run vd on it and immediately set all columns to floats and open the frequency table so I can see if I managed to lower the median this time.

评论 #19330593 未加载

Anthony-Gabout 6 years ago

If you have a lot of files that may be processed by a `find` command and speed is important, it’s worth knowing about the plus-sign variation of the `-exec` expression. The command in the original article<pre><code> find . -name setup.py -type f -exec grep -Hn boto3 {} \; </code></pre> could be written as<pre><code> find . -name setup.py -type f -exec grep -Hn boto3 {} + </code></pre> The difference is that the first version (the `-exec` expression is terminated with a semi-colon) forks a new process to run the `grep` command for each individual file “found” by the preceding expressions. So, if there were 50 such `setup.py` files, the `grep` command would be invoked 50 times. Some times this is desired behaviour but in this case, `grep` can accept multiple pathnames as arguments.With the second version (expression is terminated with a plus-sign), the pathnames of the files are collected into sets so that the `grep` command is only called once for each set (similar to how the `xargs` utility works to avoid exceeding the limits on the number of arguments that can be passed to a command). This is much more efficient because only 1 `grep` child process is forked – instead of 50.This functionality was added to the POSIX specification [1] a number of years ago and I’ve been using it for at least 10 year on GNU/Linux systems. I imagine it should be available on other Unix-like environments (including BSD [2]) that data scientists are likely to be using – though the last time I had to work on a friend’s Mac the installed versions of the BSD utilities were quite old.[1]: <a href="http://pubs.opengroup.org/onlinepubs/9699919799/" rel="nofollow">http://pubs.opengroup.org/onlinepubs/9699919799/</a>[2]: <a href="https://man.openbsd.org/find.1" rel="nofollow">https://man.openbsd.org/find.1</a>

asicspabout 6 years ago

I have an example based tutorial for all these commands plus other cli text processing commands<a href="https://github.com/learnbyexample/Command-line-text-processing" rel="nofollow">https://github.com/learnbyexample/Command-line-text-processi...</a>

dredmorbiusabout 6 years ago

Problem: Given a CSV file, we want to know the number of columns just by analyzing its header.<pre><code> $ head -n 1 data.csv | awk -F ',' '{print NF}' </code></pre> Or spare a process:<pre><code> awk -F ',' 'NR <= 1 {print NF; quit}' data.csv </code></pre> One of numerous weak points to this article.

yakshaving_jgtabout 6 years ago

> Prints on the screen (or to the standard output) the contents of files. Simple like that.While it's not exactly false, it's also not a good explanation for cat. If you just want to operate on the contents of a single file, you should use redirection. The cat utility is for concatenating files.

6keZbCECT2uBabout 6 years ago

tldr: grep, cat, find, head / tail, wc, awk, shuf with bonuses of xargs, and man.I've never needed shuf, and awk is a bit out of place in the list, but head and tail have saved me from many a large file. The interesting data is usually in head, tail, or grep anyway.

评论 #19325859 未加载

评论 #19325704 未加载

评论 #19326412 未加载

sgillenabout 6 years ago

Is there a real advantage to using awk over python for most tasks? Or is just a little faster/more convenient if you already know it?

评论 #19325887 未加载

评论 #19326111 未加载

colechristensenabout 6 years ago

Is "data science" so undeveloped that pipes and grep need to be on an everyone-should-know list?

评论 #19326363 未加载

评论 #19326333 未加载

评论 #19329567 未加载

pumanoirabout 6 years ago

Any book recommendations to understand and master the use of UNIX commands?

评论 #19326403 未加载

stakhanovabout 6 years ago

Let's not forget cut for dealing with csv.

评论 #19326112 未加载

评论 #19326180 未加载

nooberminabout 6 years ago

s/Data Scientist/Unix User/

评论 #19326914 未加载

评论 #19328656 未加载

17 comments

cybersolabout 6 years ago

'sort' and 'uniq' should also be near the top of the list. And once your doing more on the command-line, 'join' and 'comm' can help you merge data from multiple files.

评论 #19335837 未加载

评论 #19328273 未加载

评论 #19327157 未加载

评论 #19325801 未加载

aviraldgabout 6 years ago

I'd add jq (<a href="https://stedolan.github.io/jq/" rel="nofollow">https://stedolan.github.io/jq/</a>) to the list. JSON data is so common, and jq makes working with it a breeze.

评论 #19326058 未加载

评论 #19325743 未加载

foundartabout 6 years ago

评论 #19327541 未加载

fwipabout 6 years ago

I would be a little bit shocked if any of the data scientists at my day job didn't know all seven of these, so, I guess that's an accurate title.

评论 #19325717 未加载

评论 #19326167 未加载

评论 #19326118 未加载

评论 #19325757 未加载

msraviabout 6 years ago

评论 #19326471 未加载

评论 #19329430 未加载

ams6110about 6 years ago

They left out rm, used to clean up all their files when they are done so other users can work.

评论 #19325818 未加载

unhammerabout 6 years ago

评论 #19330593 未加载

Anthony-Gabout 6 years ago

asicspabout 6 years ago

dredmorbiusabout 6 years ago

yakshaving_jgtabout 6 years ago

6keZbCECT2uBabout 6 years ago

评论 #19325859 未加载

评论 #19325704 未加载

评论 #19326412 未加载

sgillenabout 6 years ago

Is there a real advantage to using awk over python for most tasks? Or is just a little faster/more convenient if you already know it?

评论 #19325887 未加载

评论 #19326111 未加载

colechristensenabout 6 years ago

Is "data science" so undeveloped that pipes and grep need to be on an everyone-should-know list?

Seven Unix Commands Every Data Scientist Should Know

17 comments

Seven Unix Commands Every Data Scientist Should Know

17 comments