> On August 3rd, the WebAssembly CG will poll on whether JavaScript string semantics/encoding are out of scope of the Interface Types proposal. This decision will likely be backed by Google (C++), Mozilla (Rust) and the Bytecode Alliance (WASI), who appear to have a common interest to exclusively promote C++, Rust respectively non-Web semantics and concepts in WebAssembly.<p>> If the poll passes, which is likely, AssemblyScript will be severely impacted as the tools it has developed must be deprecated due to unresolvable correctness and security problems the decision imposes upon languages utilizing JavaScript-like 16-bit string semantics and its users.<p>So, the problem is that AssemblyScript wants to keep using UTF-16? I'm not sure I understand.<p>Is AssemblyScript the thing that lets you hand-write WebAsm?
I'm not going to enter the discussion regarding UTF-8 vs WTF-16 for representing strings, as I lack the context to determine which one is the right approach if everything has to fit the same model. However, I think an approach that allows multiple serialization/deserialization mechanisms depending on the host/guest language seems like a nice way to move it forward.<p>If you want to chime in and retrieve more context, here are some relevant issues:<p>* <a href="https://github.com/WebAssembly/interface-types/issues/135" rel="nofollow">https://github.com/WebAssembly/interface-types/issues/135</a><p>* <a href="https://github.com/WebAssembly/interface-types/issues/136" rel="nofollow">https://github.com/WebAssembly/interface-types/issues/136</a><p>* <a href="https://github.com/WebAssembly/design/issues/1419" rel="nofollow">https://github.com/WebAssembly/design/issues/1419</a>
Can the authors expound on the reasons why they can't compile their language's string semantics into whatever representation will be used by WASI? Both C++ and Rust support numerous string representations, C++ even more so than Rust.
Full disclosure, I am an active participant in WebAssembly standardisation, my github is here (<a href="https://github.com/conrad-watt" rel="nofollow">https://github.com/conrad-watt</a>). What follows is purely my personal opinion.<p>This announcement is deliberately phrased to scare people who do not have sufficient context. I don't know why some AssemblyScript maintainers have decided to act in this extreme way over what is quite a niche issue. The vote that this announcement is sounding the alarm over is _not_ a vote on whether UTF-16 should be supported.<p>There has been a longstanding debate as part of the Wasm interface types proposal regarding whether UTF-8 should be privileged as a canonical string representation. Recently, we have moved in the direction of supporting both UTF-8 and UTF-16, although a vote to confirm this is still pending (but I personally believe would pass uncontroversially).<p>However, JavaScript strings are not always well-formed UTF-16 - in particular some validation is deferred for performance reasons, meaning that strings can contain invalid code points called isolated surrogates. Again, the referenced vote is _not_ a vote on whether UTF-16 should be supported, but is in fact a vote on whether we should require that invalid code points should be sanitised when strings are copied across component boundaries. Some AS maintainers have developed a strong opinion that such sanitisation would somehow be a webcompat/security hazard and have campaigned stridently against it. However sanitising strings in this way is actually a recommended security practice (<a href="https://websec.github.io/unicode-security-guide/character-transformations/" rel="nofollow">https://websec.github.io/unicode-security-guide/character-tr...</a>), so they haven't gained the traction they were hoping for with their objections.<p>The announcement is worded to obscure this point - talking about "JavaScript-like 16-bit string semantics" (i.e. where isolated surrogates are not sanitised) as opposed to merely "UTF-16", which forbids isolated surrogates by definition, but inviting the conflation of the two.<p>AS does not need to radically alter its string representation - if we were were to support UTF-16 with sanitisation, they could simply document that their potentially invalid UTF-16 strings will be sanitised when passed between components. Note that the component model is actually still being specified, so this design choice doesn't even affect any currently existing AS code. I interpret the announcement's threat of radical change as some maintainers holding AS hostage over the (again, very niche) string sanitisation issue, which is frankly pretty poor behaviour.
This is an unfortunate consequence of the poor choice of keeping UCS-2 alive as UTF-16 for way too long. The plug in 16 bit encodings should have been pulled a long time ago, but some people were and still are so focused on backwards compatibility that they didn't see they were just pushing the issue to another decade. UTF-8 has won, completely. UTF-16 is basically a zombie nobody wants anymore, kept artificially alive by the fear of big 90s frameworks of clean breaks with the past.<p>We must get rid of legacy encodings no matter the cost, I'm tired of seeing Java and Qt apps wasting millions of CPU cycles mindlessly converting stuff back and forth from UTF-16. It's plain madness, and sometimes you just need the courage to destroy everything and start again.
This seems to be the discussion thread related to this.<p><a href="https://github.com/WebAssembly/interface-types/issues/13" rel="nofollow">https://github.com/WebAssembly/interface-types/issues/13</a>
This seems to also impact Java and TeaVM, see this post:<p><a href="https://groups.google.com/g/teavm/c/gpy0JoKYqbU" rel="nofollow">https://groups.google.com/g/teavm/c/gpy0JoKYqbU</a>
UTF-16 was always a mistake[1]. Good riddance. Time to get it out of LSP specification[2] as well.<p>[1] <a href="http://utf8everywhere.org/" rel="nofollow">http://utf8everywhere.org/</a><p>[2] <a href="https://github.com/microsoft/language-server-protocol/issues/376" rel="nofollow">https://github.com/microsoft/language-server-protocol/issues...</a>