Is this actually a bug? The default assumption in Java is that types are not thread-safe unless otherwise specified. Attempting to use types in a way that exceeds their documented thread safety has always been allowed to leave your program in an inconsistent state.
Calling this a "bug in java.lang.String" is silly. The same "bug" exists for all functions that take mutable objects. If you take a map and lookup two different keys, yep, that's a "bug".<p>The bug is the other piece of code that introduces the data race in the first place. You can argue the case for languages like Rust with it's borrow system, or others that use linear types or something along those lines, to eliminate the possibility of this happening, but it's quite misleading to say that the innocent user of a mutable object is the source of a bug. You may as well say there's a bug in `printf("Hello, World!\n");` in C because you could have another thread writing random values to random memory, running `while(1) { *((unsigned char*)(void*)rand()) = rand(); }`
This is exactly why java needs frozen arrays [1].<p>The safe thing to do is freeze the array before doing anything with it. Then, you can rely on COW to copy to the array if someone is modifying it concurrently with you reading it. In the general case, you'd have fast string creation and in the tricky case you simply pay the clone cost as a penalty for being dumb.<p>[1] <a href="https://openjdk.org/jeps/8261007#:~:text=How%20do%20I%20use%20a%20frozen%20array%3F%201,an%20array%20is%20frozen%20or%20modifiable.%20More%20items" rel="nofollow noreferrer">https://openjdk.org/jeps/8261007#:~:text=How%20do%20I%20use%...</a>
That's a very interesting finding. Nowadays Java security is a joke, but back in the day, Java security was a serious topic. Users were able to run downloaded applets in their browser, so protecting the sandbox was important. It's very likely that using those kinds of "corrupted" strings would allow to break out of this sandbox, because that protection code definitely relied on strings being sane and correct.<p>I can't imagine this behaviour to cause much problem with modern Java, nobody runs untrusted code anyway. But good to know.
Every time, without fail, somebody shows a bug about a piece of code that we take for granted (In this case, the String class) the bug is related to concurrent modifications.<p>Concurrency is so hard that even OpenJDK developers can't prevent these kind of bugs
This is the exact kind of bug that Rust solves with its borrowing system. The problem is that Java has no way to express the concept of "something that nothing else can modify while I'm looking at it".
Out of interest, how should this be handled?<p>Is this a bug in Java which should be fixed (looks like that to me)? My understanding was Java generally doesn't do "you did an undefined behaviour, so it's your fault", except for specifically marked very low-level interfaces.
I just added solutions to the empty String challenge in the blog post.
This includes a very interesting find from Xavier Cooney, that causes the same problem without involving any concurrency. It instead makes StringBuilder misbehave by throwing an exception at an unexpected place: <a href="https://gist.github.com/XavierCooney/e9f6235f05479ac6bf962ca25e31d8d0" rel="nofollow noreferrer">https://gist.github.com/XavierCooney/e9f6235f05479ac6bf962ca...</a>
It is possible to fix this String constructor implementation without creating a defensive copy of the input array or having a TOCTOU vulnerability.<p><pre><code> // Change this implementation to a loop.
public String(char[] value) {
while (true) {
byte[] temp = StringUTF16.compress(value);
if (temp != null) {
this.value = temp;
this.coder = LATIN1;
break;
}
temp = StringUTF16.toBytes(value);
if (temp != null) {
this.value = temp;
this.coder = UTF16;
break;
}
}
}
// This implementation stays the same.
static byte[] StringUTF16.compress(char[] value) { ... }
// Change this contract and implementation so that it returns null
// if all characters are below 256, otherwise it returns byte[].
// The difference is that previously, this function would never return null.
// Now, we make sure that the function succeeds if and only if the
// char array *requires* UTF-16 as opposed to Latin-1.
static byte[] StringUTF16.compress(char[] value) { ... }</code></pre>
I enjoyed the article, but if I may express a peeve of mine... In the code listings, can we please not use a syntax coloring scheme that makes the comments nearly unreadable? Especially in blog posts like this, where the code deliberately contains numerous explanatory comments. Such low-contrast text slows down my tired old eyes.
> Why is "foo!".equals("foo⁉") false?<p>I don't really understand this question. They...look different? One is an exclamation mark, and the other is an exclamation mark/question mark combo?