For me, the most surprising one was paste.<p>paste allowed me to interleave to streams or to split out a single stream into two columns. I'd been writing custom scripting monstrosities before I discovered paste:<p><pre><code> $ paste <( echo -e 'foo\nbar' ) <( echo -e 'baz\nqux' )
foo baz
bar qux
$ echo -e 'foo\nbar\nbaz\nqux' | paste - -
foo bar
baz qux
</code></pre>
I wonder what other unix gems I've been missing...
> Hidden inside WWB (writer's workbench), Lorinda Cherry's Parts annotated English text with parts of speech, based on only a smidgen of English vocabulary, orthography, and grammar.<p>Writer's Workbench was indeed a marvel of 1970's limited-space engineering. You can see it for yourself [1]: the generic part-of-speech rules are in end.l, the exceptions in edict.c and ydict.c, and the part-of-speech disambiguator in pscan.c. Such compact, rule-based NLP has fallen out of favor these days but (shameless plug alert!) Writer's Workbench inspired my 2018 IOCCC entry that highlights passive constructions in English texts [2].<p>[1] <a href="https://github.com/dspinellis/unix-history-repo/tree/BSD-4_1_snap-Snapshot-Development/.ref-BSD-4/usr/src/cmd/diction" rel="nofollow">https://github.com/dspinellis/unix-history-repo/tree/BSD-4_1...</a><p>[2] <a href="https://ioccc.org/2018/ciura/hint.html" rel="nofollow">https://ioccc.org/2018/ciura/hint.html</a>
One of the useful applications of trigram-based analysis I have done is the following: for a large web-based application form where about 200000 online applications were made, we had to filter out the dummy applications - often, people would try out the interface using "aaa" as a name, for example.<p>Since the names were mostly Indian, we did not even have a standard database of names to test against.<p>What we did was the following: go through the entire database of all applications, and build a trigram frequency table. Then, using that trigram table, do a second pass over the database of names to find names with anomalous trigrams - if the percentage of trigram frequency anomaly in a name was too high (if the name was long enough), or the absolute number of trigrams in name was too high (if the name was short), we flagged the application and examined it manually. Using this alone, we were able to filter out a large number of dummy application forms.<p>Of course, it is not a comprehensive tool since what forms a valid name is very vague, but I think this kind of a tool is useful and culture-neutral.
I didn't knew about typo. One surprising unix program I discovered this year is cal (or ncal). Having a calendar in your terminal is sometimes useful and I wish I knew earlier I could type things like <i>ncal -w 2020</i>
And people say theoretical computer science isn’t useful in “the real world”…<p>I am curious about this one, though, has anyone used it?<p>> The syntax diagnostics from the compiler made by Sue Graham's group at Berkeley were the mmost helpful I have ever seen--and they were generated automatically. At a syntax error the compiler would suggest a token that could be inserted that would allow parsing to proceed further. No attempt was made to explain what was wrong.<p>On the surface it sounds a lot like it would produce error messages like “expected ‘;’” that most beginner programmers come to hate: was it any better than this, or was that the extent of its intelligence and everything else at the time was even worse?
The author is THE Doug McIlroy. It's wonderful to learn that he's still around and spreading the good word.<p><a href="https://en.wikipedia.org/wiki/Douglas_McIlroy" rel="nofollow">https://en.wikipedia.org/wiki/Douglas_McIlroy</a>
« Typo was as surprising inside as it was outside. Its similarity measure was based on trigram frequencies, which it counted in a 26x26x26 array. The small memory, which had barely room enough for 1-byte counters, spurred a scheme for squeezing large numbers into small counters. To avoid overflow, counters were updated probabilistically to maintain an estimate of the logarithm of the count. »<p>This sounds like something from the same family as hyperloglog<p>Wikipedia traces that back to the Flajolet–Martin algorithm in 1984. When would typo have been written?
How about GNU parallel? <a href="https://www.gnu.org/software/parallel/" rel="nofollow">https://www.gnu.org/software/parallel/</a>
What about "comm" - compare two sorted files line by line.
You can easily get occurrences only in file 1, in both files, only in file 2.<p>Super powerful and saved me hours of work.
The fact that dc does (or at least tries to) guarantee error bounds on the <i>result</i> is news to me.<p>And if that does indeed work, that's pretty cool.
> <i>struct - Brenda Baker undertook her Fortan-to-Ratfor converter against the advice of her department head--me. I thought it would likely produce an ad hoc reordering of the orginal, freed of statement numbers, but otherwise no more readable than a properly indented Fortran program. Brenda proved me wrong. She discovered that every Fortran program has a canonically structured form. Programmers preferred the canonicalized form to what they had originally written.</i><p>We could've had prettier et al instead of style linters 40(+?) years ago. :(
I've written a few useful scripts that everyone should have.<p>histogram - simply counts each occurrence of a line and then outputs from highest to lowest. I've implemented this program in several different languages for learning purposes. There are practical tricks that one can apply, such as hashing any line longer than the hash itself.<p>unique - like uniq but doesn't need to have sorted input! again, one can simply hash very long lines to save memory.<p>datetimes - looks for numbers that might be dates (seconds or milliseconds in certain reasonable ranges) and adds the human readable version of the date as comments to the end of the line they appear in. This is probably my most used script (I work with protocol buffers that often store dates as int64s).<p>human - reformats numbers into either powers of 2 or powers of 10. inspired obviously by the -h and -H flags from df.<p>I'm sure I have a few more but if I can't remember them from the top of my head, then they clearly aren't quite as generally useful.<p>Anyone else have some useful scripts like these?
<i>“To avoid overflow, counters were updated probabilistically to maintain an estimate of the logarithm of the count.”</i><p>Stuff like this really makes me love what the pioneers of CS did in the past. In the past, they were counting every byte and every register while nowadays, programmers make things without considering the impact it will have on the HW.
> The math library for Bob Morris's variable-precision desk calculator
used backward error analysis to determine the precision necessary at
each step to attain the user-specified precision of the result.<p>I wonder if compilers could do this today? If you can bound values for floating point operations, you might be able to replace them with fixed point equivalents and get a big speedup. You might also be able to replace them with ints or smaller floats if you can detect the result is rounded to an int.<p>CPU's also have the possibility to do this since they know (some of) the actual values at runtime, and could take shortcuts with floating point calculation in places where not needed for the result.
What's surprising about eqn, dc, and egrep? I'm using the latter two all the time, and have used eqn (+troff/groff and even tbl and pic) in the 1990's for manuals and as late as (early) 2000's to typeset math-heavy course material. Not nearly as feature-rich as TeX/LaTeX, but much more approachable for casual math, with DSLs for typesetting equations, tables, and diagrams/graphs. I was delighted to see that GNU had a full suite of roff/troff drop-in replacements (which I later learned was implemented by James Clark, of SGML and, recently, Ballerina fame).
I have found it useful to survey the existing unix utilities (maybe every several years). I'm no genius but I find things I will use. One way of course is simply to review the names in wherever your system stores manual pages, and read (or skim) those where you don't know what they do, trying out some things, or trying to remember at least where to look it up later when ready to use it. Another is by browsing to <a href="https://man.openbsd.org/" rel="nofollow">https://man.openbsd.org/</a> , then put a single period (".") in the search field, optionally choose a section (and/or other system, not sure how far the coverage goes), and click the apropos button.
Doug McIlroy is regularly active in the groff mailing list <a href="https://lists.gnu.org/archive/html/groff/" rel="nofollow">https://lists.gnu.org/archive/html/groff/</a>
Crabs seems likes a really cool program.<p>Here is a paper from Bell Labs<p><a href="http://lucacardelli.name/Papers/Crabs.pdf" rel="nofollow">http://lucacardelli.name/Papers/Crabs.pdf</a>
I didn't find egrep surprising - I use it quite often. The thing I didn't know about it was that it was Al Aho's creation. I only knew about him from awk.