We don't need a string type (2013)

28 点作者 grep_it超过 4 年前

14 条评论

hollasch超过 4 年前

Curious. I have to come to exactly the opposite conclusion — that we should drop the idea of a fixed-length character type, and instead _only_ have (Unicode) string types. Actually, I'd prefer something like `std::text` to finally be free of the baggage of "string". Operations on text should work on logical text concepts. For example, something like `someText.firstCharacter()` would have a return type of `text`, with logical length 1. It's _data_ length is variable, since a Unicode character is variable length. So many Unicode-containing string design problems arise because of the stubborn insistence of having an integral character type.I should be able to extract UTF-8, UTF-16 or whatever encoding I want from a `text` value. Something like `c_str()` would be pretty important, but the semantics would be a design problem, not an encoding problem. Any Unicode-encoding string should be able to encode U+0000, so you'd need to figure out how to handle that from `c_str()` (perhaps a substitution ASCII character could be specified to encode embedded nulls).Basically, users should definitely _not_ need to understand the deeper details of Unicode. They shouldn't need to understand and worry about different entities such as code units, code points, graphemes, and the like, though they should be able to extract such encodings on demand.

评论 #26106502 未加载

评论 #26106077 未加载

评论 #26106147 未加载

评论 #26109013 未加载

评论 #26106661 未加载

37ef_ced3超过 4 年前

Go's immutable UTF-8 string type is one of the nice things about the languageA Go string is almost exactly like this C struct:<pre><code> struct String { uint8_t* addr; ptrdiff_t len; }; </code></pre> The language guarantees you can't modify the bytes in memory range [addr, addr+len)Go's garbage collection makes it simple and natural to have one string alias ("point into", "overlap") part of another string. This works because strings are immutable. Compare this to the nightmare in C++, where substrings require copying or explicit handlingThe rune (UTF-8) iterator and other facilities make Unicode handling natural in GoIn summary, Go's string type is a huge win

评论 #26104450 未加载

评论 #26104800 未加载

评论 #26107041 未加载

pca006132超过 4 年前

I think the problem is that, a lot of time when we deal with strings, we are thinking about ASCII strings instead of other encoding like UTF-8. If we treat them as ASCII strings, an array of characters would make sense, but it is not that simple for other encoding.One of the languages that considered the issue is Rust. In rust, we don't really index into strings, but use iterators or other methods to do the operations required. <a href="https://doc.rust-lang.org/std/string/struct.String.html" rel="nofollow">https://doc.rust-lang.org/std/string/struct.String.html</a>

评论 #26104714 未加载

DougBTX超过 4 年前

The date should be (2013) not (2018), as that dates it before Rust 1.0 (which does have a UTF-8 string type) and before the Julia 1.0 release date (which implements UTF-8 strings as arrays with irregularly spaced indexes, eg, the valid indexes may be 1, 2, 4, 5, if the character at 2 takes up two bytes). Both would be interesting examples to compare against if this article was written today.

评论 #26106575 未加载

shadowgovt超过 4 年前

I think the author started from an assertion ("This primary difference between a C++ ‘string’ and ‘vector’ is really just a historical oddity that many programs don’t even need anymore") that highlights an error in the C++ model of strings, not in the way we must think about strings.Contrast NSString in Cocoa (<a href="https://developer.apple.com/documentation/foundation/nsstring" rel="nofollow">https://developer.apple.com/documentation/foundation/nsstrin...</a>). The Cocoa string is extremely opaque; it's basically an object. And under the hood, that opacity allows for piles of optimization that are unsafe if the developer is allowed to treat the thing as just a vector of bytes or codepoints. Under the hood, Cocoa does all kinds of fanciness to the memory representation of the string (automatically building and cutting cords, "interning" short strings so that multiple copies of the string are just pointers to the same memory, caching of some transforms under the assumption that if it's needed once, it's often needed again).Taken this way, one can even start to talk about things like "Why does 'indexing' into a string always return a character, instead of, say, a word?" and other questions that are harder to get into if one assumes a string is just 'vector of characters' or 'vector of bytes.'

评论 #26106900 未加载

ncmncm超过 4 年前

The article is an argument against types, in general.The point that characters can be stored in other containers is meaningless: the question is whether, conceptually, a specific sequence of character values distinct from another sequence has compile-time meaning. It does. Therefore, it needs a type.Such a sequence has numerous special characteristics. In particular, element at [i] often has an essential connection to element at [i+1] such that swapping them could turn a valid string to an invalid one. In fact, that an invalid sequence is even possible is another such characteristic.

评论 #26104780 未加载

评论 #26105937 未加载

评论 #26105846 未加载

irogers超过 4 年前

String should be an interface/protocol. When I log a message, I want to pass a string. If I have to append large strings for a log message I don't want to run out of memory, I should be able to pass a rope/cord [1]. We've known how to abstract this for forever and should work to optimize our compilers/runtimes accordingly. I'm not aware of a language which has got this right, for example, Java has the ugly CharSequence interface that nobody uses. StringProtocol in Swift (can I implement it?) makes you pay a character tax rather than to just pass a string. Rust/C++ give various non-abstracted types.[1] <a href="https://en.wikipedia.org/wiki/Rope_(data_structure)" rel="nofollow">https://en.wikipedia.org/wiki/Rope_(data_structure)</a>

评论 #26105803 未加载

评论 #26105990 未加载

quelsolaar超过 4 年前

I think that the problem with text is that the basic operation you want to do is inserts. The way memory works in computer makes that an inherently inefficient operation. I'm a bit fascinated by how bad computer are at text give that that is what we use so much of them for.As a C programmer I think that its not really possible to implement an efficient text processing library, because there is no good universal way to store text. So much depends on the pattern of the processing functions. If you want to avoid allocating new memory and moving a lot of text for each operation, the implementation needs to make speculative choices about how text can best be stored. How you store text depends so much on your access pattern. Do you need to be able to get to a line fast? or know how long the text is? Or insert something? and if so how much?A C style string would for instance be terrible for something like a text editor, because every key press would cause a complete copy of the document to have to be allocated, and then copied over. So maybe a linked list? But you dont want just one character in each link because that trashes the cache right? but then its still slow to just skip forward fast, so maybe an array of pointers to snipets? or maybe a linked list of pointers to snippets? So many possibilities that all impact performance differently depending on what you do with it.When I see higher languages with nice easy to use string functionality, I always consider, the impossible choices that had to be made under the hood.

评论 #26110619 未加载

giardini超过 4 年前

Surprising to a Tcl programmer!8-)) b/c"Everything is a String":<a href="https://wiki.tcl-lang.org/page/everything+is+a+string" rel="nofollow">https://wiki.tcl-lang.org/page/everything+is+a+string</a>and"Everything is a Symbol":<a href="https://wiki.tcl-lang.org/page/Everything+is+a+Symbol" rel="nofollow">https://wiki.tcl-lang.org/page/Everything+is+a+Symbol</a>

评论 #26108283 未加载

BlueTemplar超过 4 年前

The author has these followup blogposts :2013 : <a href="https://mortoray.com/2013/11/27/the-string-type-is-broken/" rel="nofollow">https://mortoray.com/2013/11/27/the-string-type-is-broken/</a>2014 : <a href="https://mortoray.com/2014/03/17/strings-and-text-are-not-the-same/" rel="nofollow">https://mortoray.com/2014/03/17/strings-and-text-are-not-the...</a>(See also : <a href="https://thehardcorecoder.com/2014/04/15/data-text-and-strings-oh-my/" rel="nofollow">https://thehardcorecoder.com/2014/04/15/data-text-and-string...</a> )2016 : <a href="https://mortoray.com/2016/04/28/what-is-the-length-of-a-string-a-tricky-question/" rel="nofollow">https://mortoray.com/2016/04/28/what-is-the-length-of-a-stri...</a>

dang超过 4 年前

Discussed at the time: <a href="https://news.ycombinator.com/item?id=6204427" rel="nofollow">https://news.ycombinator.com/item?id=6204427</a>

tyingq超过 4 年前

I can't speak for C++, but for C, the repeated issue is that a null-terminated string has lots of utility routines that are handy for manipulating them. Without 3rd party libraries, plain length-header buffers don't. Hence things like Antirez's sds library, which by nature, is a compromise. I get you can't fundamentally change C now, but a buffer type with a rich manipulation library would have been nice.

BlueTemplar超过 4 年前

Anyone else thinks that we missed an opportunity to make text much simpler to deal with by not increasing the size of a byte from 8 to 32 bits when we moved from 32-bit to 64-bit word length CPUs ?I mean, isn't the 7-bit ASCII text the reason why the byte length was standardized to the next power of two bits ?(With e-mail still supporting non-padded 7-bit ASCII until recently for performance reasons.)

BlueTemplar超过 4 年前

TL;DR : Characters and Strings considered harmful.And he's right, they totally are ! (Also, 'string' can mean an ordered sequence of similar objects of any kind, not just characters.)But (as these discussions also mention) replacing them by much more clearly defined concepts like byte arrays, codepoints, glyphs, grapheme clusters and text fields is only the first step...The big question (these days) is what to do with text, specifically the 'code' kind of text (either programming or markup, and poor separation between 'plain' text and code keeps causing security issues).To start with, even code needs formatting, specifically some way to signal a new line, or it will end up unreadable.Then, code can't be just arbitrary Unicode text, some limits have to apply, because Unicode can get verrrry 'fancy' ! (Arbitrary Unicode is fine in text fields and comments embedded in code.)So, I'm curious, is there any Unicode normalization specifically designed for code ? (If not, why, and which is the closest one ?)I'm thinking of Python (3), which has what seems to be a somewhat arbitrary list of what can and what can't be used as a variable name ? (And the language itself seemingly only uses ASCII, though this shouldn't be a restriction for programming/markup languages !)Also I hear that Julia goes much further than that (with even (La)TeX-like shortcuts for characters that might not be available on some keyboards), what kind of 'normalization' have they adopted ?

评论 #26108281 未加载

14 条评论

hollasch超过 4 年前

评论 #26106502 未加载

评论 #26106077 未加载

评论 #26106147 未加载

评论 #26109013 未加载

评论 #26106661 未加载

37ef_ced3超过 4 年前

评论 #26104450 未加载

评论 #26104800 未加载

评论 #26107041 未加载

pca006132超过 4 年前

评论 #26104714 未加载

DougBTX超过 4 年前

评论 #26106575 未加载

shadowgovt超过 4 年前

评论 #26106900 未加载

ncmncm超过 4 年前

评论 #26104780 未加载

评论 #26105937 未加载

评论 #26105846 未加载

irogers超过 4 年前

评论 #26105803 未加载

评论 #26105990 未加载

quelsolaar超过 4 年前

评论 #26110619 未加载

giardini超过 4 年前

评论 #26108283 未加载

BlueTemplar超过 4 年前

dang超过 4 年前

Discussed at the time: <a href="https://news.ycombinator.com/item?id=6204427" rel="nofollow">https://news.ycombinator.com/item?id=6204427</a>

tyingq超过 4 年前

BlueTemplar超过 4 年前

评论 #26108281 未加载