Problem solving with Unix commands

356 pointsby v3gasover 6 years ago

39 comments

pdkl95over 6 years ago

Gary Bernhardt[1] gave a great talk about practical problem solving with the unix shell: "The Unix Chainsaw"[2]."Half-assed is OK when you only need half of an ass."In the talk, he gives several demonstrations a key aspect of why unix pipelines are so practically useful: you build them interactively. A complicated 4 line pipeline started as a single command that was gradually refined into something that actually solves a complicated problem. This talk demonstrates the part that isn't included in the the usual tutorials or "cool 1-line command" lists: the cycle of "Try something. Hit up to get the command back. Make one iterative change and try again."[1] You might know him from his other hilarious talks like "The Birth & Death of JavaScript" or "Wat".[2] <a href="https://www.youtube.com/watch?v=sCZJblyT_XM" rel="nofollow">https://www.youtube.com/watch?v=sCZJblyT_XM</a>

评论 #19161687 未加载

评论 #19164302 未加载

评论 #19166608 未加载

fforfloover 6 years ago

This [0] the most complete post I've read on the topic. Lays out all the relevant tools. Spending some time going through each tool's documentation/options, pays off tremendously.[0]: <a href="https://www.ibm.com/developerworks/aix/library/au-unixtext/index.html" rel="nofollow">https://www.ibm.com/developerworks/aix/library/au-unixtext/i...</a>

评论 #19161653 未加载

评论 #19162279 未加载

评论 #19161723 未加载

评论 #19161560 未加载

评论 #19161347 未加载

skywhopperover 6 years ago

The brilliant fun of working with the Unix CLI toolset is that there are millions of valid ways to solve a problem. I also thought of a “better” solution of my own that took an entirely different approach than most of the ones posted here. That’s not really the point.What’s great about this article is that it follows the process of solving the problem step by step. I find that lots of programmers I work with struggle with CLI problem solving, which I find a little surprising. But I think it all depends on how you think about problems like this.If you start from “how can I build a function to operate on this raw data?” or “what data structure would best express the relationship between these filenames?” then you will have a hard time. But if you think in terms of “how can I mutate this data to eliminate extraneous details?” and “what tools do I have handy that can solve problems on data like this given a bit of mungeing, and how can I accomplish that bit of mungeing?” and if you can accept taking several baby steps of small operations on every line of the full dataset rather than building and manipulating abstract logical structures, then you’re well on your way to making efficient use of this remarkable toolset to solve ad hoc problems like this one in minutes instead of hours.

评论 #19161562 未加载

yurikoover 6 years ago

If you bother to write a python script to parse the integers, why not use python to solve the whole problem?

评论 #19161544 未加载

评论 #19161231 未加载

评论 #19161066 未加载

dnetover 6 years ago

Removing leading zeroes doesn't require Python. One easy solution would be sed:<pre><code> $ echo -e '0001\n0010\n0002' | sed 's/^0*//' 1 10 2</code></pre>

评论 #19160898 未加载

评论 #19161830 未加载

评论 #19161898 未加载

评论 #19161824 未加载

评论 #19161023 未加载

评论 #19162779 未加载

评论 #19160881 未加载

评论 #19167070 未加载

boomlindeover 6 years ago

A change in structure might be helpful:<pre><code> $ ls data 0001.csv 0002.csv 0003.csv 0004.csv ... $ ls algorithm_a 0001.csv 0002.csv 0004.csv ... $ diff -q algorithm_a data |grep ^Only |sed 's/.*: //g' 0003.csv ...</code></pre>

评论 #19161525 未加载

stiffover 6 years ago

For learning to get things done with Unix, I recommend the two old books "Unix Programming Environment" and "The AWK Programming Language". There are many resources to learn the various commands etc., but there is still no better place than those books to learn the "unix philosophy". This series is also good:<a href="https://sanctum.geek.nz/arabesque/series/unix-as-ide/" rel="nofollow">https://sanctum.geek.nz/arabesque/series/unix-as-ide/</a>

nickjjover 6 years ago

I think the best part about using Unix tools is it forces you to break down the problem into tiny steps.You can see feedback every step of the way by removing and adding back new piped commands so you're never really dealing with more than 1 operation at a time which makes debugging and making progress a lot easier than trying to fit everything together at once.

评论 #19164434 未加载

ben509over 6 years ago

I've often done this, usually not for a large dataset, but it's sometimes helpful to pipe text through Unix commands in Emacs. C-u M-| sort, for instance, will run the selection through sort and replace it in place.If you're going the all python route, and even want to be able to run bash commands, and want something where you can feed the output into the input, I'd strongly recommend jupyter. (If you want to stay in a terminal, ipython is part of jupyter and heavily upgrades the built-in REPL and does 90% of what I'm mentioning here.)You can break out each step into its own cell, save variables (though cell 5 will be auto-saved as a variable named _5) but the nicest thing is you can move cells around (check the keyboard shortcuts) and restart the entire kernel and rerun all your operations, essentially what you're getting with a long pipeline, only spread out over parts. And there are shortcuts like func? to pop up help on a function or func?? to see the source.It's got some dependencies, so I'd recommend running it in a virtualenv via pipenv:<pre><code> pipenv install jupyter # setup new virtualenv and add package pipenv run jupyter notebook pipenv --rm # Blow away the virtualenv </code></pre> Also, look into pandas if you want to slurp a CSV and query it.

评论 #19165306 未加载

jancsikaover 6 years ago

The problem with this is that there isn't a standard format forced on the args that following the command name "cut".What makes it worse is that there are seemingly patterns of standard format that get violated by other patterns. It's often based on when the utility was first authored and whatever ideas were floating around during the time. So sometimes characters can "clump" together behind a flag, under the assumption that multi-character flags will get two hyphens. Then some utilities or programs use a single flag for multi-character flags. Plus many other inconsistencies-- if I learn the basic range syntax for cut do I know the basic range syntax for imagemagick?Those inconsistencies don't technically conflict since each only exists in the context of a particular utility. But it's a real pain to sanity to see those inconsistencies sitting on either side of a pipe, especially when one of them is wrong. (Or even when it's a single command you need but you use the wrong flag syntax.) That all adds to the cognitive load and can easily make a dev tired before its time to go to sleep.Oh, and that language switch from bash to python is a huge risk. If you're scripting with Python on a daily basis it probably doesn't seem like it. But for someone reading along, that language boundary is huge. Because the user is no longer limited to runtime errors and finicky arg formatting errors, but also language errors. If the command line barfs up an exception or syntax error at that boundary I'd bet most users would just give up and quit reading the rest of the blog.Edit: clarification

评论 #19164460 未加载

评论 #19163831 未加载

评论 #19165929 未加载

almostarockstarover 6 years ago

This was a nice read and a good introduction to text processing with unix commands.I agree with the other user re python usage - that you may as well use it for the whole task if you're going to use it at all - but I don't think it's a major flaw. It worked for you right? I would suggest naming the python file a bit more descriptively though.Interesting to read the other suggestions about dealing with this without python.

评论 #19160976 未加载

maratcover 6 years ago

评论 #19161985 未加载

评论 #19161419 未加载

sagartewari01over 6 years ago

My favourite one is 'pkill -9 java'. Fixes my laptop if it starts lagging.

评论 #19161357 未加载

评论 #19161423 未加载

评论 #19161373 未加载

samwhiteUKover 6 years ago

I thought this was a neat demo of building up a command with UNIX tools. The python inclusion was a bit odd, yes.I learned about sys.stdin in Python and cutting characters using the -c flag

评论 #19161309 未加载

jclayover 6 years ago

After moving back to working on a Windows machine the last several years and being “forced” into using PowerShell, I now find myself using it for these sorts of tasks on Linux.I now use PowerShell for any tasks of equal or greater complexity than the article. It’s such a massive upgrade over struggling to recall the peculiar bash syntax every time and the benefits of piping typed objects around are vast.As a nice bonus, all of my PowerShell scripts run cross-platform without issue.

评论 #19164721 未加载

darrenfover 6 years ago

All the pipes and non-builtin commands (especially python!) look like overkill to me, I must say.<pre><code> for set in *_data.csv ; do num=${set/_*/} success=${set/data/A} if [ ! -e $success ] ; then echo $num ; fi done </code></pre> ETA: likely specific to bash, since I have no experience with other shells except for dalliances with ksh and csh in the mid-90s.

评论 #19161358 未加载

ciucanuover 6 years ago

I usually do text processing in Bash, Notepad++ and Excel. Each has its own pros and cons, that's why I usually combine them.Here you have the tools I use in Bash:grep, tail, head, cat, cut, less, awk, sed, sort, uniq, wc, xargs, watch ...

评论 #19161207 未加载

评论 #19161573 未加载

评论 #19161363 未加载

ortekkover 6 years ago

If you are using python in your pipeline, might as well go all in!<pre><code> from pathlib import Path all_possible_filenames = {f'{i:04}A.csv' for i in range(1,10)} cur_dir_filenames = {Path('.').iterdir()} missing_filenames = all_possible_filenames - cur_dir_filenames print(*missing_filenames, sep='\n')</code></pre>

omarantoover 6 years ago

The article solves the problem: for which numbers x between 1 and 500 is there no file x_A.csv? It looks like in this case it is equivalent to the easier problem: for which x_data.csv is there no corresponding x_A.csv?<pre><code> cd dataset-directory comm -23 <(ls *_data.csv | sed s/data/A/) <(ls *_A.csv)</code></pre>

评论 #19162399 未加载

iheartpotatoesover 6 years ago

I got paid $175/hr as a data analyst contractor to basically run bash, grep, sed, awk, perl. The people that hired me weren't dumb, just non-programmers and became giddy as I explained regular expressions. The gig only lasted 3 months, but I taught myself out of a job: once they got the gist of it they didn't need me. Yay?

kritixilithosover 6 years ago

Nicely done using Unix utils. You can have a pure sed solution (save the `ls` invocation) that is much simpler, albeit obscure, that hinges on the fact that every number has a `data.csv` file.Given a sorted list of these files (through `ls` or otherwise) the following sed code will print out the data files for which A did not succeed on them.<pre><code> /data/!N /A/d P;D </code></pre> This works on the fact that there exists a data file for all successful and unsuccessful runs on data, so sed simply prints the files for which there does not exist an `A` counterpart.If you want to only print out the numbers, you can add a substitution or two towards the end.<pre><code> /data/!N /A/d s/^0*\|_.*//g;P;D </code></pre> Edit: fixed the sed program

评论 #19169117 未加载

LogicXover 6 years ago

given the limited scope of files in the direcctory... not sure why it was necessary to use grep, instead of the built in glob?<pre><code> ls dataset-directory | egrep '\d\d\d\d_A.csv'</code></pre> which FWIW wouldn't even work, on multiple levels: you need -1 on ls and no files end with A.csv<pre><code> vs ls -1 dataset-directory/*_A?.csv </code></pre> ref: <a href="http://man7.org/linux/man-pages/man7/glob.7.html" rel="nofollow">http://man7.org/linux/man-pages/man7/glob.7.html</a>Update: apologies, apparently my client cached an older version of this page. at that time the files were named A1.csv and A2.csv

评论 #19161441 未加载

评论 #19160935 未加载

inpover 6 years ago

Instead to create a script in Python to convert numbers in integers, you can use awk: "python3 parse.py" becomes "awk '{printf "%d\n", $0}'"

评论 #19161376 未加载

评论 #19161348 未加载

评论 #19161355 未加载

mklmover 6 years ago

If you don't mind "cd dataset-directory" beforehand, a shorter and possibly more correct version would be:<pre><code> comm -1 -3 <(ls *_A.csv | sed 's/_.*$//') <(seq -w 0500) | sed 's/^0*//' </code></pre> The OP's solution doesn't seem correct because of the different ordering of the two inputs of `comm': lexicographical (ls) and numeric (seq).

评论 #19166445 未加载

wmuover 6 years ago

Easier would be just use 'cat list_of_numbers | sort | uniq -u' to get the unique entries.

评论 #19161087 未加载

pletnesover 6 years ago

Useless use of seq spotted. Seq does not exist on many systems. Bash has {0001..0500} instead.Nice writeup though.

评论 #19161577 未加载

评论 #19161470 未加载

adamchainzover 6 years ago

I learnt a lot from the book Data Science at the Command Line, now free and online at <a href="https://www.datascienceatthecommandline.com/" rel="nofollow">https://www.datascienceatthecommandline.com/</a>

pixelbeat__over 6 years ago

Set operations are very useful. Here's a summary:<a href="http://https://www.pixelbeat.org/cmdline.html#sets" rel="nofollow">http://https://www.pixelbeat.org/cmdline.html#sets</a>

评论 #19166778 未加载

js2over 6 years ago

Not the most efficient solution but this is what springs to mind for me:<pre><code> seq 1000 | xargs printf '%04d_A.csv\n' | while read -r f; do test -f $f || echo $f; done</code></pre>

jon49about 6 years ago

Use F# with a TypeProvider. Of course, I imagine it would take some work learning F# but once you learn it the sky is the limit in what you can do with this data.

Dowwieover 6 years ago

More power to those who enjoy writing control flow in shell, but if I need anything beyond a single line I'm going with an interactive ipython session.

dahfizzover 6 years ago

You could use one sed command to replace your grep, cut, and python. It feels cheap to use python do massage data in a post about Unix command line.

oh5nxoover 6 years ago

Is there a nice alternative for seq or jot ? Something neater than for-loop in awk ?

评论 #19161431 未加载

redkaover 6 years ago

ls | rb 'group_by { |x| x[/\d+/] }.select { |_, y| y.one? }.keys'<a href="https://github.com/thisredone/rb" rel="nofollow">https://github.com/thisredone/rb</a>

BentFranklinover 6 years ago

For heavier duty text processing, tryemacs -e myfuns.elWhen it comes to mashing text, nothing beats emacs.

Upvoter33over 6 years ago

awk one liner: ls | awk '{split($1,x,"_"); split(x[2],y,"."); a[x[1]]+=1} END {for (i in a) {if (a[i] < 2) {print i}}}'

评论 #19162576 未加载

iheartpotatoesover 6 years ago

The people that created the command line weren't L33T H4XOR NOOBS. They were brilliant PhD scientists. Let's not confuse the two.

sureaboutthisover 6 years ago

> I am starting to realize that the Unix command-line toolbox can fix absolutely any problem related to text wrangling.Am I the only one who thought, "No shit, Sherlock"?. This is a fundamental of UNIX that many people don't seem to grasp.

评论 #19161092 未加载

评论 #19161202 未加载

SomethingOrNotover 6 years ago

> I am starting to realize that the Unix command-line toolbox can fix absolutely any problem related to text wrangling.How many problems related to text wrangling arise simply by working with Unix tools?“This philosophical framework will help you solve problems internal to philosophy.”

评论 #19161356 未加载

39 comments

pdkl95over 6 years ago

评论 #19161687 未加载

评论 #19164302 未加载

评论 #19166608 未加载

fforfloover 6 years ago

评论 #19161653 未加载

评论 #19162279 未加载

评论 #19161723 未加载

评论 #19161560 未加载

评论 #19161347 未加载

skywhopperover 6 years ago

评论 #19161562 未加载

yurikoover 6 years ago

If you bother to write a python script to parse the integers, why not use python to solve the whole problem?

评论 #19161544 未加载

评论 #19161231 未加载

评论 #19161066 未加载

dnetover 6 years ago

Removing leading zeroes doesn't require Python. One easy solution would be sed:<pre><code> $ echo -e '0001\n0010\n0002' | sed 's/^0*//' 1 10 2</code></pre>

评论 #19160898 未加载

评论 #19161830 未加载

评论 #19161898 未加载

评论 #19161824 未加载

评论 #19161023 未加载

评论 #19162779 未加载

评论 #19160881 未加载

评论 #19167070 未加载

boomlindeover 6 years ago

评论 #19161525 未加载

stiffover 6 years ago

nickjjover 6 years ago

评论 #19164434 未加载

ben509over 6 years ago

评论 #19165306 未加载

jancsikaover 6 years ago

评论 #19164460 未加载

评论 #19163831 未加载

评论 #19165929 未加载

almostarockstarover 6 years ago

评论 #19160976 未加载

maratcover 6 years ago

评论 #19161985 未加载

评论 #19161419 未加载

sagartewari01over 6 years ago

My favourite one is 'pkill -9 java'. Fixes my laptop if it starts lagging.

评论 #19161357 未加载

评论 #19161423 未加载

评论 #19161373 未加载

samwhiteUKover 6 years ago

I thought this was a neat demo of building up a command with UNIX tools. The python inclusion was a bit odd, yes.I learned about sys.stdin in Python and cutting characters using the -c flag

评论 #19161309 未加载

jclayover 6 years ago

评论 #19164721 未加载

darrenfover 6 years ago

评论 #19161358 未加载

ciucanuover 6 years ago

评论 #19161207 未加载

评论 #19161573 未加载

评论 #19161363 未加载

ortekkover 6 years ago

omarantoover 6 years ago

评论 #19162399 未加载

iheartpotatoesover 6 years ago

kritixilithosover 6 years ago

评论 #19169117 未加载

LogicXover 6 years ago

评论 #19161441 未加载

评论 #19160935 未加载

inpover 6 years ago

Instead to create a script in Python to convert numbers in integers, you can use awk: "python3 parse.py" becomes "awk '{printf "%d\n", $0}'"

评论 #19161376 未加载

评论 #19161348 未加载

评论 #19161355 未加载

mklmover 6 years ago

评论 #19166445 未加载

wmuover 6 years ago

Easier would be just use 'cat list_of_numbers | sort | uniq -u' to get the unique entries.

评论 #19161087 未加载

pletnesover 6 years ago

Useless use of seq spotted. Seq does not exist on many systems. Bash has {0001..0500} instead.Nice writeup though.

评论 #19161577 未加载

评论 #19161470 未加载

adamchainzover 6 years ago

pixelbeat__over 6 years ago

Set operations are very useful. Here's a summary:<a href="http://https://www.pixelbeat.org/cmdline.html#sets" rel="nofollow">http://https://www.pixelbeat.org/cmdline.html#sets</a>

评论 #19166778 未加载

js2over 6 years ago

Not the most efficient solution but this is what springs to mind for me:<pre><code> seq 1000 | xargs printf '%04d_A.csv\n' | while read -r f; do test -f $f || echo $f; done</code></pre>

jon49about 6 years ago

Use F# with a TypeProvider. Of course, I imagine it would take some work learning F# but once you learn it the sky is the limit in what you can do with this data.

Dowwieover 6 years ago

More power to those who enjoy writing control flow in shell, but if I need anything beyond a single line I'm going with an interactive ipython session.

dahfizzover 6 years ago

You could use one sed command to replace your grep, cut, and python. It feels cheap to use python do massage data in a post about Unix command line.

oh5nxoover 6 years ago

Is there a nice alternative for seq or jot ? Something neater than for-loop in awk ?

评论 #19161431 未加载

redkaover 6 years ago

ls | rb 'group_by { |x| x[/\d+/] }.select { |_, y| y.one? }.keys'<a href="https://github.com/thisredone/rb" rel="nofollow">https://github.com/thisredone/rb</a>

BentFranklinover 6 years ago

For heavier duty text processing, tryemacs -e myfuns.elWhen it comes to mashing text, nothing beats emacs.

Upvoter33over 6 years ago

awk one liner: ls | awk '{split($1,x,"_"); split(x[2],y,"."); a[x[1]]+=1} END {for (i in a) {if (a[i] < 2) {print i}}}'

评论 #19162576 未加载

iheartpotatoesover 6 years ago

The people that created the command line weren't L33T H4XOR NOOBS. They were brilliant PhD scientists. Let's not confuse the two.