TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Problem solving with Unix commands

356 pointsby v3gasover 6 years ago

39 comments

pdkl95over 6 years ago
Gary Bernhardt[1] gave a great talk about practical problem solving with the unix shell: &quot;The Unix Chainsaw&quot;[2].<p>&quot;Half-assed is OK when you only need half of an ass.&quot;<p>In the talk, he gives several demonstrations a key aspect of <i>why</i> unix pipelines are so practically useful: you build them <i>interactively</i>. A complicated 4 line pipeline started as a single command that was gradually refined into something that actually solves a complicated problem. This talk demonstrates the part that isn&#x27;t included in the the usual tutorials or &quot;cool 1-line command&quot; lists: the cycle of &quot;Try something. Hit up to get the command back. Make one iterative change and try again.&quot;<p>[1] You might know him from his other hilarious talks like &quot;The Birth &amp; Death of JavaScript&quot; or &quot;Wat&quot;.<p>[2] <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=sCZJblyT_XM" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=sCZJblyT_XM</a>
评论 #19161687 未加载
评论 #19164302 未加载
评论 #19166608 未加载
fforfloover 6 years ago
This [0] the most complete post I&#x27;ve read on the topic. Lays out all the relevant tools. Spending some time going through each tool&#x27;s documentation&#x2F;options, pays off tremendously.<p>[0]: <a href="https:&#x2F;&#x2F;www.ibm.com&#x2F;developerworks&#x2F;aix&#x2F;library&#x2F;au-unixtext&#x2F;index.html" rel="nofollow">https:&#x2F;&#x2F;www.ibm.com&#x2F;developerworks&#x2F;aix&#x2F;library&#x2F;au-unixtext&#x2F;i...</a>
评论 #19161653 未加载
评论 #19162279 未加载
评论 #19161723 未加载
评论 #19161560 未加载
评论 #19161347 未加载
skywhopperover 6 years ago
The brilliant fun of working with the Unix CLI toolset is that there are millions of valid ways to solve a problem. I also thought of a “better” solution of my own that took an entirely different approach than most of the ones posted here. That’s not really the point.<p>What’s great about this article is that it follows the process of solving the problem step by step. I find that lots of programmers I work with struggle with CLI problem solving, which I find a little surprising. But I think it all depends on how you think about problems like this.<p>If you start from “how can I build a function to operate on this raw data?” or “what data structure would best express the relationship between these filenames?” then you will have a hard time. But if you think in terms of “how can I mutate this data to eliminate extraneous details?” and “what tools do I have handy that can solve problems on data like this given a bit of mungeing, and how can I accomplish that bit of mungeing?” and if you can accept taking several baby steps of small operations on every line of the full dataset rather than building and manipulating abstract logical structures, then you’re well on your way to making efficient use of this remarkable toolset to solve ad hoc problems like this one in minutes instead of hours.
评论 #19161562 未加载
yurikoover 6 years ago
If you bother to write a python script to parse the integers, why not use python to solve the whole problem?
评论 #19161544 未加载
评论 #19161231 未加载
评论 #19161066 未加载
dnetover 6 years ago
Removing leading zeroes doesn&#x27;t require Python. One easy solution would be sed:<p><pre><code> $ echo -e &#x27;0001\n0010\n0002&#x27; | sed &#x27;s&#x2F;^0*&#x2F;&#x2F;&#x27; 1 10 2</code></pre>
评论 #19160898 未加载
评论 #19161830 未加载
评论 #19161898 未加载
评论 #19161824 未加载
评论 #19161023 未加载
评论 #19162779 未加载
评论 #19160881 未加载
评论 #19167070 未加载
boomlindeover 6 years ago
A change in structure might be helpful:<p><pre><code> $ ls data 0001.csv 0002.csv 0003.csv 0004.csv ... $ ls algorithm_a 0001.csv 0002.csv 0004.csv ... $ diff -q algorithm_a data |grep ^Only |sed &#x27;s&#x2F;.*: &#x2F;&#x2F;g&#x27; 0003.csv ...</code></pre>
评论 #19161525 未加载
stiffover 6 years ago
For learning to get things done with Unix, I recommend the two old books &quot;Unix Programming Environment&quot; and &quot;The AWK Programming Language&quot;. There are many resources to learn the various commands etc., but there is still no better place than those books to learn the &quot;unix philosophy&quot;. This series is also good:<p><a href="https:&#x2F;&#x2F;sanctum.geek.nz&#x2F;arabesque&#x2F;series&#x2F;unix-as-ide&#x2F;" rel="nofollow">https:&#x2F;&#x2F;sanctum.geek.nz&#x2F;arabesque&#x2F;series&#x2F;unix-as-ide&#x2F;</a>
nickjjover 6 years ago
I think the best part about using Unix tools is it forces you to break down the problem into tiny steps.<p>You can see feedback every step of the way by removing and adding back new piped commands so you&#x27;re never really dealing with more than 1 operation at a time which makes debugging and making progress a lot easier than trying to fit everything together at once.
评论 #19164434 未加载
ben509over 6 years ago
I&#x27;ve often done this, usually not for a large dataset, but it&#x27;s sometimes helpful to pipe text through Unix commands in Emacs. C-u M-| sort, for instance, will run the selection through sort and replace it in place.<p>If you&#x27;re going the all python route, and even want to be able to run bash commands, and want something where you can feed the output into the input, I&#x27;d strongly recommend jupyter. (If you want to stay in a terminal, ipython is part of jupyter and heavily upgrades the built-in REPL and does 90% of what I&#x27;m mentioning here.)<p>You can break out each step into its own cell, save variables (though cell 5 will be auto-saved as a variable named _5) but the nicest thing is you can move cells around (check the keyboard shortcuts) and restart the entire kernel and rerun all your operations, essentially what you&#x27;re getting with a long pipeline, only spread out over parts. And there are shortcuts like func? to pop up help on a function or func?? to see the source.<p>It&#x27;s got some dependencies, so I&#x27;d recommend running it in a virtualenv via pipenv:<p><pre><code> pipenv install jupyter # setup new virtualenv and add package pipenv run jupyter notebook pipenv --rm # Blow away the virtualenv </code></pre> Also, look into pandas if you want to slurp a CSV and query it.
评论 #19165306 未加载
jancsikaover 6 years ago
The problem with this is that there isn&#x27;t a standard format forced on the args that following the command name &quot;cut&quot;.<p>What makes it worse is that there are seemingly patterns of standard format that get violated by other patterns. It&#x27;s often based on when the utility was first authored and whatever ideas were floating around during the time. So sometimes characters can &quot;clump&quot; together behind a flag, under the assumption that multi-character flags will get two hyphens. Then some utilities or programs use a single flag for multi-character flags. Plus many other inconsistencies-- if I learn the basic range syntax for cut do I know the basic range syntax for imagemagick?<p>Those inconsistencies don&#x27;t <i>technically</i> conflict since each only exists in the context of a particular utility. But it&#x27;s a real pain to sanity to see those inconsistencies sitting on either side of a pipe, especially when one of them is wrong. (Or even when it&#x27;s a single command you need but you use the wrong flag syntax.) That all adds to the cognitive load and can easily make a dev tired before its time to go to sleep.<p>Oh, and that language switch from bash to python is a huge risk. If you&#x27;re scripting with Python on a daily basis it probably doesn&#x27;t seem like it. But for someone reading along, that language boundary is huge. Because the user is no longer limited to runtime errors and finicky arg formatting errors, but also language errors. If the command line barfs up an exception or syntax error at that boundary I&#x27;d bet most users would just give up and quit reading the rest of the blog.<p>Edit: clarification
评论 #19164460 未加载
评论 #19163831 未加载
评论 #19165929 未加载
almostarockstarover 6 years ago
This was a nice read and a good introduction to text processing with unix commands.<p>I agree with the other user re python usage - that you may as well use it for the whole task if you&#x27;re going to use it at all - but I don&#x27;t think it&#x27;s a major flaw. It worked for you right? I would suggest naming the python file a bit more descriptively though.<p>Interesting to read the other suggestions about dealing with this without python.
评论 #19160976 未加载
maratcover 6 years ago
$ join -v 2 &lt;(ls | grep _A | sort | cut -c-4) &lt;(ls | grep -v _A | sort | cut -c-4)<p>The shortest one I could come up with, no need to use python.<p>`join -v 2` shows the entries in the second sorted stream that don&#x27;t have match in the first sorted stream, the rest is self-explanatory I hope.<p>Edit: $ join -v2 -t_ -j1 &lt;(ls | grep _A | sort ) &lt;(ls | grep -v _A | sort)<p>Is even shorter, it takes first field (-j1) where fields are separated by &#x27;_&#x27; (-t_)
评论 #19161985 未加载
评论 #19161419 未加载
sagartewari01over 6 years ago
My favourite one is &#x27;pkill -9 java&#x27;. Fixes my laptop if it starts lagging.
评论 #19161357 未加载
评论 #19161423 未加载
评论 #19161373 未加载
samwhiteUKover 6 years ago
I thought this was a neat demo of building up a command with UNIX tools. The python inclusion was a bit odd, yes.<p>I learned about sys.stdin in Python and cutting characters using the -c flag
评论 #19161309 未加载
jclayover 6 years ago
After moving back to working on a Windows machine the last several years and being “forced” into using PowerShell, I now find myself using it for these sorts of tasks on Linux.<p>I now use PowerShell for any tasks of equal or greater complexity than the article. It’s such a massive upgrade over struggling to recall the peculiar bash syntax every time and the benefits of piping typed objects around are vast.<p>As a nice bonus, all of my PowerShell scripts run cross-platform without issue.
评论 #19164721 未加载
darrenfover 6 years ago
All the pipes and non-builtin commands (especially python!) look like overkill to me, I must say.<p><pre><code> for set in *_data.csv ; do num=${set&#x2F;_*&#x2F;} success=${set&#x2F;data&#x2F;A} if [ ! -e $success ] ; then echo $num ; fi done </code></pre> ETA: likely specific to bash, since I have no experience with other shells except for dalliances with ksh and csh in the mid-90s.
评论 #19161358 未加载
ciucanuover 6 years ago
I usually do text processing in Bash, Notepad++ and Excel. Each has its own pros and cons, that&#x27;s why I usually combine them.<p>Here you have the tools I use in Bash:<p>grep, tail, head, cat, cut, less, awk, sed, sort, uniq, wc, xargs, watch ...
评论 #19161207 未加载
评论 #19161573 未加载
评论 #19161363 未加载
ortekkover 6 years ago
If you are using python in your pipeline, might as well go all in!<p><pre><code> from pathlib import Path all_possible_filenames = {f&#x27;{i:04}A.csv&#x27; for i in range(1,10)} cur_dir_filenames = {Path(&#x27;.&#x27;).iterdir()} missing_filenames = all_possible_filenames - cur_dir_filenames print(*missing_filenames, sep=&#x27;\n&#x27;)</code></pre>
omarantoover 6 years ago
The article solves the problem: for which numbers x between 1 and 500 is there no file x_A.csv? It looks like in this case it is equivalent to the easier problem: for which x_data.csv is there no corresponding x_A.csv?<p><pre><code> cd dataset-directory comm -23 &lt;(ls *_data.csv | sed s&#x2F;data&#x2F;A&#x2F;) &lt;(ls *_A.csv)</code></pre>
评论 #19162399 未加载
iheartpotatoesover 6 years ago
I got paid $175&#x2F;hr as a data analyst contractor to basically run bash, grep, sed, awk, perl. The people that hired me weren&#x27;t dumb, just non-programmers and became giddy as I explained regular expressions. The gig only lasted 3 months, but I taught myself out of a job: once they got the gist of it they didn&#x27;t need me. Yay?
kritixilithosover 6 years ago
Nicely done using Unix utils. You can have a pure sed solution (save the `ls` invocation) that is much simpler, albeit obscure, that hinges on the fact that every number has a `data.csv` file.<p>Given a sorted list of these files (through `ls` or otherwise) the following sed code will print out the data files for which A did not succeed on them.<p><pre><code> &#x2F;data&#x2F;!N &#x2F;A&#x2F;d P;D </code></pre> This works on the fact that there exists a data file for all successful and unsuccessful runs on data, so sed simply prints the files for which there does not exist an `A` counterpart.<p>If you want to only print out the numbers, you can add a substitution or two towards the end.<p><pre><code> &#x2F;data&#x2F;!N &#x2F;A&#x2F;d s&#x2F;^0*\|_.*&#x2F;&#x2F;g;P;D </code></pre> Edit: fixed the sed program
评论 #19169117 未加载
LogicXover 6 years ago
given the limited scope of files in the direcctory... not sure why it was necessary to use grep, instead of the built in glob?<p><pre><code> ls dataset-directory | egrep &#x27;\d\d\d\d_A.csv&#x27;</code></pre> which FWIW wouldn&#x27;t even work, on multiple levels: you need -1 on ls and no files end with A.csv<p><pre><code> vs ls -1 dataset-directory&#x2F;*_A?.csv </code></pre> ref: <a href="http:&#x2F;&#x2F;man7.org&#x2F;linux&#x2F;man-pages&#x2F;man7&#x2F;glob.7.html" rel="nofollow">http:&#x2F;&#x2F;man7.org&#x2F;linux&#x2F;man-pages&#x2F;man7&#x2F;glob.7.html</a><p>Update: apologies, apparently my client cached an older version of this page. at that time the files were named A1.csv and A2.csv
评论 #19161441 未加载
评论 #19160935 未加载
inpover 6 years ago
Instead to create a script in Python to convert numbers in integers, you can use awk: &quot;python3 parse.py&quot; becomes &quot;awk &#x27;{printf &quot;%d\n&quot;, $0}&#x27;&quot;
评论 #19161376 未加载
评论 #19161348 未加载
评论 #19161355 未加载
mklmover 6 years ago
If you don&#x27;t mind &quot;cd dataset-directory&quot; beforehand, a shorter and possibly more correct version would be:<p><pre><code> comm -1 -3 &lt;(ls *_A.csv | sed &#x27;s&#x2F;_.*$&#x2F;&#x2F;&#x27;) &lt;(seq -w 0500) | sed &#x27;s&#x2F;^0*&#x2F;&#x2F;&#x27; </code></pre> The OP&#x27;s solution doesn&#x27;t seem correct because of the different ordering of the two inputs of `comm&#x27;: lexicographical (ls) and numeric (seq).
评论 #19166445 未加载
wmuover 6 years ago
Easier would be just use &#x27;cat list_of_numbers | sort | uniq -u&#x27; to get the unique entries.
评论 #19161087 未加载
pletnesover 6 years ago
Useless use of seq spotted. Seq does not exist on many systems. Bash has {0001..0500} instead.<p>Nice writeup though.
评论 #19161577 未加载
评论 #19161470 未加载
adamchainzover 6 years ago
I learnt a lot from the book Data Science at the Command Line, now free and online at <a href="https:&#x2F;&#x2F;www.datascienceatthecommandline.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.datascienceatthecommandline.com&#x2F;</a>
pixelbeat__over 6 years ago
Set operations are very useful. Here&#x27;s a summary:<p><a href="http:&#x2F;&#x2F;https:&#x2F;&#x2F;www.pixelbeat.org&#x2F;cmdline.html#sets" rel="nofollow">http:&#x2F;&#x2F;https:&#x2F;&#x2F;www.pixelbeat.org&#x2F;cmdline.html#sets</a>
评论 #19166778 未加载
js2over 6 years ago
Not the most efficient solution but this is what springs to mind for me:<p><pre><code> seq 1000 | xargs printf &#x27;%04d_A.csv\n&#x27; | while read -r f; do test -f $f || echo $f; done</code></pre>
jon49about 6 years ago
Use F# with a TypeProvider. Of course, I imagine it would take some work learning F# but once you learn it the sky is the limit in what you can do with this data.
Dowwieover 6 years ago
More power to those who enjoy writing control flow in shell, but if I need anything beyond a single line I&#x27;m going with an interactive ipython session.
dahfizzover 6 years ago
You could use one sed command to replace your grep, cut, and python. It feels cheap to use python do massage data in a post about Unix command line.
oh5nxoover 6 years ago
Is there a nice alternative for seq or jot ? Something neater than for-loop in awk ?
评论 #19161431 未加载
redkaover 6 years ago
ls | rb &#x27;group_by { |x| x[&#x2F;\d+&#x2F;] }.select { |_, y| y.one? }.keys&#x27;<p><a href="https:&#x2F;&#x2F;github.com&#x2F;thisredone&#x2F;rb" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;thisredone&#x2F;rb</a>
BentFranklinover 6 years ago
For heavier duty text processing, try<p>emacs -e myfuns.el<p>When it comes to mashing text, nothing beats emacs.
Upvoter33over 6 years ago
awk one liner: ls | awk &#x27;{split($1,x,&quot;_&quot;); split(x[2],y,&quot;.&quot;); a[x[1]]+=1} END {for (i in a) {if (a[i] &lt; 2) {print i}}}&#x27;
评论 #19162576 未加载
iheartpotatoesover 6 years ago
The people that created the command line weren&#x27;t L33T H4XOR NOOBS. They were brilliant PhD scientists. Let&#x27;s not confuse the two.
sureaboutthisover 6 years ago
&gt; I am starting to realize that the Unix command-line toolbox can fix absolutely any problem related to text wrangling.<p>Am I the only one who thought, &quot;No shit, Sherlock&quot;?. This is a fundamental of UNIX that many people don&#x27;t seem to grasp.
评论 #19161092 未加载
评论 #19161202 未加载
SomethingOrNotover 6 years ago
&gt; I am starting to realize that the Unix command-line toolbox can fix absolutely any problem related to text wrangling.<p>How many problems related to text wrangling arise simply by working with Unix tools?<p>“This philosophical framework will help you solve problems internal to philosophy.”
评论 #19161356 未加载