Yeah, if we only used strings marked with 2-byte integers, everybody would have been happy, because 64kb string is enough for everyone. (And let's be realistic, nobody sane would have chosen 4-byte string length back in early 70s.)<p>So, if we went down the pass, what will we have? All the fun of having "legacy" APIs that seem to work but internally only accept strings up to 64kb length and mysteriously chop off excess bytes when you least expect it. It's Y2K problem all over again.<p>And just when you finally think you're over with it, memory is cheaper again, size_t is 64bit, and someone invariably wants to store a binary blob >4G as string. Fun time again.<p>Have we forgotten how much trouble we went through in the 90s to handle memory in x86 "640k is enough for everybody" architecture?
Oh the irony of history.<p>On the week that str+len was abused left and right, someone surfaces to the frontpage an article about how str+NUL is wrong and everyone should use str+len.
NUL terminated strings were the right decision for C. They’re certainly much simpler than length fields.<p>Consider using a length field. How big should that field be? If it's fixed size, you introduce complications regarding how big a string you can represent, and differences in field sizes across architectures. If it's variable-sized (a la UTF-8), then you've added different complications: you would need library functions to read and write the length, to get access to the string contents, to calculate the amount of memory required to hold a string of a given size, etc. Very much not in the spirit of C.<p>Next, what endianness should that field have? NUL terminated strings have no endianness issues: they can be trivially written to files, embedded in network packets, whatever. But with a length field, we either need to remember to marshall the string, or allow for the length field to not be in native byte order. Neither is a pleasant prospect, especially for a 1970’s C programmer.<p>Also, consider C-style string parsing, e.g. strtok/strsep. These could not be implemented with length-field strings.<p>Explicit length is better when you have an enforced abstraction, like std::string, but at that point you’re not writing in C. If you have to pick an exposed representation, NUL termination is much better than Pascal-style length fields.<p>So what was the “one-byte mistake?” The article says that it was saving a byte by using NUL termination instead of a two-byte length field. Had K&R not made that “mistake,” we would be unable to make a string longer than 65k - a far more serious limitation than anything NUL termination imposes!<p>K&R got it right.
A <i>lot</i> of programming decisions were made to save a byte here and there. It's easy to point at them today and say they're "bad", but at the time they were the absolutely <i>correct</i> thing to do. It's hard to imagine now but not saving that byte could mean your program wouldn't fit into RAM. Try telling your management in the 1960s that your program won't load because it's "properly coded" and see how far you get.<p>What we've failed to do is ever revisit those decisions and change them where we've identified problems. Yes, you can probably compile (with warnings) files from UNIX v7, but we pay for that compatibility. But there's no question designing, building and maintaining a libc alternative is a colossal undertaking and not likely to happen on a whim. So here we are.
Well, strings without an explicit length field allow for things like strstr(3) or prefix parsing without performance penalties due to reallocating memory.
When I was at PARC the Mesa guys (who had counted strings) did some analysis and (at least in those days) the counted strings ended up being, in aggregate, faster. I suspect the advantage would be even greater these days since memory allocation was a bigger deal back then.<p>I wonder if you could do this compatibly in the compiler by adding another primitive type (counted string) which had the length in the bytes before the start of the null-terminated string. You'd need a new type because various routines in the standard library would have to invisibly have two versions for counted and non-counted strings (since if you incremented a string pointer, or used a function like strchr, you'd have to treat it as a regular char<i>). "Safe" code would use a different call (say, cstrchr) that returned an index instead of a char</i>. The compiler could optionally warn on unsafe "legacy" calls as it can with strcpy instead of strncpy.
It's all true, but then again, everything would be better if we'd start from scratch today. Compromises made to tip-toe around technology limitations are what adds complexity to most of today's software, but even tomorrow's software will be influenced by today's limitations. It's best not to dwell on the past.
This page won't load for me and neither will googles webcached version[1], does anyone have a version of this that I can see?<p>[1] <a href="http://webcache.googleusercontent.com/search?q=cache:http://queue.acm.org/detail.cfm?id=2010365&" rel="nofollow">http://webcache.googleusercontent.com/search?q=cache:http://...</a>
Does anyone understand what the author meant by the following statement:<p>${lang} is the language of the future<p>This looks like a macro for substitution, but maybe its some hip new term I've never encountered. An actual language or just a placeholder for a language that hasn't been chosen yet?
Yeah because strings with a length prefix/field are just as secure!<p><pre><code> 200,"STR"
</code></pre>
We know where that got us...<p>Programming 101, rules 1&2:<p>1 - never trust your inputs<p>2 - always check your invariants.
So to be "safe" and "secure" we can only have strings 256 characters long, or we need to waste a few bytes repeatedly for short strings. Sounds like the UTF-8 vs UTF-16/32 debate..