Slightly off topic. The article doesn't call it out but there's a lovely assembly hack here. In:<p>bec 1f / branch if no error<p><pre><code> jsr r5,error / error in file name
<Input not found\n\0>; .even
sys exit
</code></pre>
jsr calls a subroutine passing the return address in register 5. The routine error interprets the return address as a pointer to the string.<p>r5 is incremented in a loop, outputing one character at a time. When the null is found, it's time to return.<p>The instructions used to return from "error:" aren't shown but there is a subtlety here, I think.<p>".even" after the string constant assures that the next instruction, "sys exit", to which "error:" is supposed to return, is aligned on an even address.<p>By implication, the return sequence in "error:" just be sure to increment r5, if r5 is odd. I am guessing something like the pseudo-code:<p>inc r5<p>and r5, fffe<p>ret r5
After skimming through this, I navigated around this Chris Siebelmann's site with the forward and back links, discovering something way more interesting than Unix strings and refreshingly relevant:<p>"How I do per-address blocklists with Exim"<p><a href="https://utcc.utoronto.ca/~cks/space/blog/sysadmin/EximPerUserBlocklists" rel="nofollow">https://utcc.utoronto.ca/~cks/space/blog/sysadmin/EximPerUse...</a><p>I run Exim, and I'm also a huge believer in blocking spam at the SMTP level, and also do some things globally that should perhaps be per-user. I'm eagerly interested in everything this fellow has to say.
I have seen it claimed that null-terminated strings were encouraged by the instruction sets of the time -- that some instruction sets make null-terminated sequences easier to handle than length-prefixed ones. The article's error-message-printing code snippet is a good example. Does anyone think there is any truth to this?
I always felt like NUL-termination, newline-separation, and (eventually) UTF-8 were all sort of complementary ideas: they all take as an axiom that strings are fundamentally streams, not random-access buffers; and they all separate the space of single-byte coding units, by simple one-bitwise-instruction-differentiable means, into multiple lexical token types.<p>Taking all three together, you end up with the conception of a "string" as a serialized bitstring encoding a sequence of four <i>lexical</i> types: a NUL type (like the EOF "character" in a STL stream), an ASCII control-code type (or a set of individual control codes as types, if you like), a set of UTF-8 "beginning of rune" types for each possible rune length, a "byte continuing rune" type, and an ASCII-printable type. (You then feed this stream into another lexer to put the rune-segment-tokens together into rune-tokens.)<p>In the end, it's not a surprise that all of these components were effectively from a single coherent design, thought up by Ken Thompson. It's a bit annoying that each part ended up introduced as part of a separate project, though: NULs with Unix, gets() with C, and runes with Plan 9.<p>One of the pleasant things about Go's string support, I think, it that was an opportunity for Ken to express the entirety of his string semantics as a single ADT type. That part of the compiler is quite lovely.
How else would you implement them, seriously.<p>You have two choices, counted or terminated.<p><i>Counted</i> places a complexity burden at the lowest level of coding.<p>With <i>terminated</i> you still have the option of implementing strings with structs or arrays with counts or anything.<p>And people did of course. Many many different implementations of safe strings exist in C; the fact that none have won out <i>vindicates</i> the decision to use sentinel termination.
The predecessor of Unix, Multics was written in PL/1 and was very innovative (modern OS still borrow "new ideas"): <a href="https://en.wikipedia.org/wiki/Multics" rel="nofollow">https://en.wikipedia.org/wiki/Multics</a>