The Most Expensive One-byte Mistake (2011)

60 pointsby fleaflickerabout 11 years ago

12 comments

yongjikabout 11 years ago

Yeah, if we only used strings marked with 2-byte integers, everybody would have been happy, because 64kb string is enough for everyone. (And let's be realistic, nobody sane would have chosen 4-byte string length back in early 70s.)So, if we went down the pass, what will we have? All the fun of having "legacy" APIs that seem to work but internally only accept strings up to 64kb length and mysteriously chop off excess bytes when you least expect it. It's Y2K problem all over again.And just when you finally think you're over with it, memory is cheaper again, size_t is 64bit, and someone invariably wants to store a binary blob >4G as string. Fun time again.Have we forgotten how much trouble we went through in the 90s to handle memory in x86 "640k is enough for everybody" architecture?

评论 #7574438 未加载

评论 #7575516 未加载

评论 #7575637 未加载

评论 #7577848 未加载

gcb0about 11 years ago

Oh the irony of history.On the week that str+len was abused left and right, someone surfaces to the frontpage an article about how str+NUL is wrong and everyone should use str+len.

评论 #7574509 未加载

millstoneabout 11 years ago

NUL terminated strings were the right decision for C. They’re certainly much simpler than length fields.Consider using a length field. How big should that field be? If it's fixed size, you introduce complications regarding how big a string you can represent, and differences in field sizes across architectures. If it's variable-sized (a la UTF-8), then you've added different complications: you would need library functions to read and write the length, to get access to the string contents, to calculate the amount of memory required to hold a string of a given size, etc. Very much not in the spirit of C.Next, what endianness should that field have? NUL terminated strings have no endianness issues: they can be trivially written to files, embedded in network packets, whatever. But with a length field, we either need to remember to marshall the string, or allow for the length field to not be in native byte order. Neither is a pleasant prospect, especially for a 1970’s C programmer.Also, consider C-style string parsing, e.g. strtok/strsep. These could not be implemented with length-field strings.Explicit length is better when you have an enforced abstraction, like std::string, but at that point you’re not writing in C. If you have to pick an exposed representation, NUL termination is much better than Pascal-style length fields.So what was the “one-byte mistake?” The article says that it was saving a byte by using NUL termination instead of a two-byte length field. Had K&R not made that “mistake,” we would be unable to make a string longer than 65k - a far more serious limitation than anything NUL termination imposes!K&R got it right.

评论 #7575165 未加载

评论 #7575532 未加载

评论 #7575653 未加载

评论 #7574627 未加载

TomMaszabout 11 years ago

A lot of programming decisions were made to save a byte here and there. It's easy to point at them today and say they're "bad", but at the time they were the absolutely correct thing to do. It's hard to imagine now but not saving that byte could mean your program wouldn't fit into RAM. Try telling your management in the 1960s that your program won't load because it's "properly coded" and see how far you get.What we've failed to do is ever revisit those decisions and change them where we've identified problems. Yes, you can probably compile (with warnings) files from UNIX v7, but we pay for that compatibility. But there's no question designing, building and maintaining a libc alternative is a colossal undertaking and not likely to happen on a whim. So here we are.

radiospielabout 11 years ago

Well, strings without an explicit length field allow for things like strstr(3) or prefix parsing without performance penalties due to reallocating memory.

评论 #7573363 未加载

评论 #7573079 未加载

gumbyabout 11 years ago

When I was at PARC the Mesa guys (who had counted strings) did some analysis and (at least in those days) the counted strings ended up being, in aggregate, faster. I suspect the advantage would be even greater these days since memory allocation was a bigger deal back then.I wonder if you could do this compatibly in the compiler by adding another primitive type (counted string) which had the length in the bytes before the start of the null-terminated string. You'd need a new type because various routines in the standard library would have to invisibly have two versions for counted and non-counted strings (since if you incremented a string pointer, or used a function like strchr, you'd have to treat it as a regular char). "Safe" code would use a different call (say, cstrchr) that returned an index instead of a char. The compiler could optionally warn on unsafe "legacy" calls as it can with strcpy instead of strncpy.

cliveowenabout 11 years ago

It's all true, but then again, everything would be better if we'd start from scratch today. Compromises made to tip-toe around technology limitations are what adds complexity to most of today's software, but even tomorrow's software will be influenced by today's limitations. It's best not to dwell on the past.

crashandburn4about 11 years ago

This page won't load for me and neither will googles webcached version[1], does anyone have a version of this that I can see?[1] <a href="http://webcache.googleusercontent.com/search?q=cache:http://queue.acm.org/detail.cfm?id=2010365&" rel="nofollow">http://webcache.googleusercontent.com/search?q=cache:http://...</a>

评论 #7573302 未加载

评论 #7573274 未加载

orvadoabout 11 years ago

Does anyone understand what the author meant by the following statement:${lang} is the language of the futureThis looks like a macro for substitution, but maybe its some hip new term I've never encountered. An actual language or just a placeholder for a language that hasn't been chosen yet?

评论 #7574303 未加载

bananasabout 11 years ago

Yeah because strings with a length prefix/field are just as secure!<pre><code> 200,"STR" </code></pre> We know where that got us...Programming 101, rules 1&2:1 - never trust your inputs2 - always check your invariants.

ithinksoabout 11 years ago

With NULL terminated strings it also was simpler to serialize it. If str+len was a standard now we would have 13 more serialization standards.

rw_grimabout 11 years ago

So to be "safe" and "secure" we can only have strings 256 characters long, or we need to waste a few bytes repeatedly for short strings. Sounds like the UTF-8 vs UTF-16/32 debate..

评论 #7573680 未加载

评论 #7573634 未加载

评论 #7574422 未加载