Gary Bernhardt[1] gave a great talk about practical problem solving with the unix shell: "The Unix Chainsaw"[2].<p>"Half-assed is OK when you only need half of an ass."<p>In the talk, he gives several demonstrations a key aspect of <i>why</i> unix pipelines are so practically useful: you build them <i>interactively</i>. A complicated 4 line pipeline started as a single command that was gradually refined into something that actually solves a complicated problem. This talk demonstrates the part that isn't included in the the usual tutorials or "cool 1-line command" lists: the cycle of "Try something. Hit up to get the command back. Make one iterative change and try again."<p>[1] You might know him from his other hilarious talks like "The Birth & Death of JavaScript" or "Wat".<p>[2] <a href="https://www.youtube.com/watch?v=sCZJblyT_XM" rel="nofollow">https://www.youtube.com/watch?v=sCZJblyT_XM</a>
This [0] the most complete post I've read on the topic.
Lays out all the relevant tools.
Spending some time going through each tool's documentation/options, pays off tremendously.<p>[0]: <a href="https://www.ibm.com/developerworks/aix/library/au-unixtext/index.html" rel="nofollow">https://www.ibm.com/developerworks/aix/library/au-unixtext/i...</a>
The brilliant fun of working with the Unix CLI toolset is that there are millions of valid ways to solve a problem. I also thought of a “better” solution of my own that took an entirely different approach than most of the ones posted here. That’s not really the point.<p>What’s great about this article is that it follows the process of solving the problem step by step. I find that lots of programmers I work with struggle with CLI problem solving, which I find a little surprising. But I think it all depends on how you think about problems like this.<p>If you start from “how can I build a function to operate on this raw data?” or “what data structure would best express the relationship between these filenames?” then you will have a hard time. But if you think in terms of “how can I mutate this data to eliminate extraneous details?” and “what tools do I have handy that can solve problems on data like this given a bit of mungeing, and how can I accomplish that bit of mungeing?” and if you can accept taking several baby steps of small operations on every line of the full dataset rather than building and manipulating abstract logical structures, then you’re well on your way to making efficient use of this remarkable toolset to solve ad hoc problems like this one in minutes instead of hours.
Removing leading zeroes doesn't require Python. One easy solution would be sed:<p><pre><code> $ echo -e '0001\n0010\n0002' | sed 's/^0*//'
1
10
2</code></pre>
A change in structure might be helpful:<p><pre><code> $ ls data
0001.csv 0002.csv 0003.csv 0004.csv ...
$ ls algorithm_a
0001.csv 0002.csv 0004.csv ...
$ diff -q algorithm_a data |grep ^Only |sed 's/.*: //g'
0003.csv ...</code></pre>
For learning to get things done with Unix, I recommend the two old books "Unix Programming Environment" and "The AWK Programming Language". There are many resources to learn the various commands etc., but there is still no better place than those books to learn the "unix philosophy". This series is also good:<p><a href="https://sanctum.geek.nz/arabesque/series/unix-as-ide/" rel="nofollow">https://sanctum.geek.nz/arabesque/series/unix-as-ide/</a>
I think the best part about using Unix tools is it forces you to break down the problem into tiny steps.<p>You can see feedback every step of the way by removing and adding back new piped commands so you're never really dealing with more than 1 operation at a time which makes debugging and making progress a lot easier than trying to fit everything together at once.
I've often done this, usually not for a large dataset, but it's sometimes helpful to pipe text through Unix commands in Emacs. C-u M-| sort, for instance, will run the selection through sort and replace it in place.<p>If you're going the all python route, and even want to be able to run bash commands, and want something where you can feed the output into the input, I'd strongly recommend jupyter. (If you want to stay in a terminal, ipython is part of jupyter and heavily upgrades the built-in REPL and does 90% of what I'm mentioning here.)<p>You can break out each step into its own cell, save variables (though cell 5 will be auto-saved as a variable named _5) but the nicest thing is you can move cells around (check the keyboard shortcuts) and restart the entire kernel and rerun all your operations, essentially what you're getting with a long pipeline, only spread out over parts. And there are shortcuts like func? to pop up help on a function or func?? to see the source.<p>It's got some dependencies, so I'd recommend running it in a virtualenv via pipenv:<p><pre><code> pipenv install jupyter # setup new virtualenv and add package
pipenv run jupyter notebook
pipenv --rm # Blow away the virtualenv
</code></pre>
Also, look into pandas if you want to slurp a CSV and query it.
The problem with this is that there isn't a standard format forced on the args that following the command name "cut".<p>What makes it worse is that there are seemingly patterns of standard format that get violated by other patterns. It's often based on when the utility was first authored and whatever ideas were floating around during the time. So sometimes characters can "clump" together behind a flag, under the assumption that multi-character flags will get two hyphens. Then some utilities or programs use a single flag for multi-character flags. Plus many other inconsistencies-- if I learn the basic range syntax for cut do I know the basic range syntax for imagemagick?<p>Those inconsistencies don't <i>technically</i> conflict since each only exists in the context of a particular utility. But it's a real pain to sanity to see those inconsistencies sitting on either side of a pipe, especially when one of them is wrong. (Or even when it's a single command you need but you use the wrong flag syntax.) That all adds to the cognitive load and can easily make a dev tired before its time to go to sleep.<p>Oh, and that language switch from bash to python is a huge risk. If you're scripting with Python on a daily basis it probably doesn't seem like it. But for someone reading along, that language boundary is huge. Because the user is no longer limited to runtime errors and finicky arg formatting errors, but also language errors. If the command line barfs up an exception or syntax error at that boundary I'd bet most users would just give up and quit reading the rest of the blog.<p>Edit: clarification
This was a nice read and a good introduction to text processing with unix commands.<p>I agree with the other user re python usage - that you may as well use it for the whole task if you're going to use it at all - but I don't think it's a major flaw. It worked for you right? I would suggest naming the python file a bit more descriptively though.<p>Interesting to read the other suggestions about dealing with this without python.
$ join -v 2 <(ls | grep _A | sort | cut -c-4) <(ls | grep -v _A | sort | cut -c-4)<p>The shortest one I could come up with, no need to use python.<p>`join -v 2` shows the entries in the second sorted stream that don't have match in the first sorted stream, the rest is self-explanatory I hope.<p>Edit:
$ join -v2 -t_ -j1 <(ls | grep _A | sort ) <(ls | grep -v _A | sort)<p>Is even shorter, it takes first field (-j1) where fields are separated by '_' (-t_)
I thought this was a neat demo of building up a command with UNIX tools. The python inclusion was a bit odd, yes.<p>I learned about sys.stdin in Python and cutting characters using the -c flag
After moving back to working on a Windows machine the last several years and being “forced” into using PowerShell, I now find myself using it for these sorts of tasks on Linux.<p>I now use PowerShell for any tasks of equal or greater complexity than the article. It’s such a massive upgrade over struggling to recall the peculiar bash syntax every time and the benefits of piping typed objects around are vast.<p>As a nice bonus, all of my PowerShell scripts run cross-platform without issue.
All the pipes and non-builtin commands (especially python!) look like overkill to me, I must say.<p><pre><code> for set in *_data.csv ; do
num=${set/_*/}
success=${set/data/A}
if [ ! -e $success ] ; then echo $num ; fi
done
</code></pre>
ETA: likely specific to bash, since I have no experience with other shells except for dalliances with ksh and csh in the mid-90s.
I usually do text processing in Bash, Notepad++ and Excel. Each has its own pros and cons, that's why I usually combine them.<p>Here you have the tools I use in Bash:<p>grep,
tail,
head,
cat,
cut,
less,
awk,
sed,
sort,
uniq,
wc,
xargs,
watch
...
If you are using python in your pipeline, might as well go all in!<p><pre><code> from pathlib import Path
all_possible_filenames = {f'{i:04}A.csv' for i in range(1,10)}
cur_dir_filenames = {Path('.').iterdir()}
missing_filenames = all_possible_filenames - cur_dir_filenames
print(*missing_filenames, sep='\n')</code></pre>
The article solves the problem: for which numbers x between 1 and 500 is there no file x_A.csv? It looks like in this case it is equivalent to the easier problem: for which x_data.csv is there no corresponding x_A.csv?<p><pre><code> cd dataset-directory
comm -23 <(ls *_data.csv | sed s/data/A/) <(ls *_A.csv)</code></pre>
I got paid $175/hr as a data analyst contractor to basically run bash, grep, sed, awk, perl. The people that hired me weren't dumb, just non-programmers and became giddy as I explained regular expressions. The gig only lasted 3 months, but I taught myself out of a job: once they got the gist of it they didn't need me. Yay?
Nicely done using Unix utils. You can have a pure sed solution (save the `ls` invocation) that is much simpler, albeit obscure, that hinges on the fact that every number has a `data.csv` file.<p>Given a sorted list of these files (through `ls` or otherwise) the following sed code will print out the data files for which A did not succeed on them.<p><pre><code> /data/!N
/A/d
P;D
</code></pre>
This works on the fact that there exists a data file for all successful and unsuccessful runs on data, so sed simply prints the files for which there does not exist an `A` counterpart.<p>If you want to only print out the numbers, you can add a substitution or two towards the end.<p><pre><code> /data/!N
/A/d
s/^0*\|_.*//g;P;D
</code></pre>
Edit: fixed the sed program
given the limited scope of files in the direcctory... not sure why it was necessary to use grep, instead of the built in glob?<p><pre><code> ls dataset-directory | egrep '\d\d\d\d_A.csv'</code></pre>
which FWIW wouldn't even work, on multiple levels: you need -1 on ls and no files end with A.csv<p><pre><code> vs
ls -1 dataset-directory/*_A?.csv
</code></pre>
ref: <a href="http://man7.org/linux/man-pages/man7/glob.7.html" rel="nofollow">http://man7.org/linux/man-pages/man7/glob.7.html</a><p>Update: apologies, apparently my client cached an older version of this page. at that time the files were named A1.csv and A2.csv
Instead to create a script in Python to convert numbers in integers, you can use awk: "python3 parse.py" becomes "awk '{printf "%d\n", $0}'"
If you don't mind "cd dataset-directory" beforehand, a shorter and possibly more correct version would be:<p><pre><code> comm -1 -3 <(ls *_A.csv | sed 's/_.*$//') <(seq -w 0500) | sed 's/^0*//'
</code></pre>
The OP's solution doesn't seem correct because of the different ordering of the two inputs of `comm': lexicographical (ls) and numeric (seq).
I learnt a lot from the book Data Science at the Command Line, now free and online at <a href="https://www.datascienceatthecommandline.com/" rel="nofollow">https://www.datascienceatthecommandline.com/</a>
Set operations are very useful. Here's a summary:<p><a href="http://https://www.pixelbeat.org/cmdline.html#sets" rel="nofollow">http://https://www.pixelbeat.org/cmdline.html#sets</a>
Not the most efficient solution but this is what springs to mind for me:<p><pre><code> seq 1000 | xargs printf '%04d_A.csv\n' | while read -r f; do test -f $f || echo $f; done</code></pre>
Use F# with a TypeProvider. Of course, I imagine it would take some work learning F# but once you learn it the sky is the limit in what you can do with this data.
More power to those who enjoy writing control flow in shell, but if I need anything beyond a single line I'm going with an interactive ipython session.
awk one liner:
ls | awk '{split($1,x,"_"); split(x[2],y,"."); a[x[1]]+=1} END {for (i in a) {if (a[i] < 2) {print i}}}'
> I am starting to realize that the Unix command-line toolbox can fix absolutely any problem related to text wrangling.<p>Am I the only one who thought, "No shit, Sherlock"?. This is a fundamental of UNIX that many people don't seem to grasp.
> I am starting to realize that the Unix command-line toolbox can fix absolutely any problem related to text wrangling.<p>How many problems related to text wrangling arise simply by working with Unix tools?<p>“This philosophical framework will help you solve problems internal to philosophy.”