This is also a security vulnerability because MySQL/MariaDB silently truncate strings at the first invalid character. This can result in data manipulation attacks where a higher level layer validates the complete input as UTF-8 for insertion into the DB, but the database only stores half the string.<p><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1253201" rel="nofollow">https://bugzilla.mozilla.org/show_bug.cgi?id=1253201</a>
I need to share something as literally today I fixed a bug in our project that's somewhat related.<p>MS SQL encodes everything with UTF-16. Emojis (code points) require up to 4 bytes. If it's a grapheme (emoji constructed from emojis), it will be even more.<p>We are using Django. If you check length of an emoji, it will give you `1` and Django assumes utf8 everywhere. If you try to save it to PostgreSQL with char field `max_length=1` it will work just fine, but on MS SQL it will fail as it requires 2 characters (4 bytes) in the DB!<p>I tried it with MS SQL 2017, 2019 with different collations on nvarchar and I'm pretty sure there's no way around it.<p>> Because in CHAR(n) and VARCHAR(n) or in NCHAR(n) and NVARCHAR(n), the n defines the byte storage size, not the number of characters that can be stored, it's important to determine the data type size you must convert to, in order to avoid data truncation.<p><a href="https://docs.microsoft.com/en-US/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver15#utf8" rel="nofollow">https://docs.microsoft.com/en-US/sql/relational-databases/co...</a>
Note:<p>> <i>The utf8mb3 character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4 instead. utf8 is currently an alias for utf8mb3, but it is now deprecated as such, and utf8 is expected subsequently to become a reference to utf8mb4. Beginning with MySQL 8.0.28, utf8mb3 is also displayed in place of utf8 in columns of Information Schema tables, and in the output of SQL SHOW statements.</i><p>* <a href="https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-unicode-utf8.html" rel="nofollow">https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-...</a>
Fun story: When I was working at one of the FAANG companies, I have placed the (Night with stars, it does not seem to render here [0]) character in my preferred name. Some time later, I was notified by HR that I must change my preferred name and I can only use UTF-8 characters with maximum length of 3 bytes. I was quite confused why such specific demand, I am pretty sure it was exactly this issue. I replaced it with ᗧ···ᗣ···<p>[0] <a href="https://emojipedia.org/night-with-stars/" rel="nofollow">https://emojipedia.org/night-with-stars/</a>
Is Medium secretly mining crypto in the background? I have no idea what it is but their site makes my CPU spike and my fans turn on.<p><a href="https://jdon.at/JTHj2G" rel="nofollow">https://jdon.at/JTHj2G</a><p>Update - here's a video, it's making hundreds of graphql requests a minute…<p><a href="https://jdon.at/z05ImC" rel="nofollow">https://jdon.at/z05ImC</a>
In the past, Unicode was assumed to be 64k of codepoints, so a 3-byte UTF-8 sequence was considered "long enough", especially since there were surrogate pairs for the rare cases where you have to encode higher code points.<p>Only "recently" have longer UTF-8 sequences (aka. emojis) become widespread enough that this became a problem.<p>Yes, it could have been avoided if they had allowed arbitrary-length UTF-8 sequences from the beginning, but I can see that they probably just wanted to optimize a bit.<p>What I don't understand is why they had to create a different encoding (the utf8mb4) instead of just extending the existing utf8 encoding, since 4-byte UTF-8 is obviously backward-compatible with 3-byte UTF-8... (unless they always used 3 bytes for every character, which would be stupid as UTF-8 has been explicitly designed for variable-length encodings)<p>Bonus: Many filesystems also do not allow 4+ byte UTF 8 code points in filenames. Test your company's file server to see if it allows you to save a file as "(some random emoji).doc". A few very expensive storage systems also have problems with that (and they have the same workaround: convert your filesystem to a different encoding, instead of simply extending the existing encoding to allow 4+ bytes)
And here's the commit that changed it: <a href="https://github.com/mysql/mysql-server/commit/43a506c0ced0e6ea101d3ab8b4b423ce3fa327d0" rel="nofollow">https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e...</a>. It was originally set to max 6 bytes but for some reason someone dropped it to 3.<p>They also seem to be slowly phasing it out. Internally it's utf8mb3 and utf8 is just an alias. The idea is to eventually make utf8 an alias to utf8mb4.
That was set up when Microsoft and Java had standardized on UTF-16. So this can represent the UTF-16 subset of Unicode, which is Unicode Plane 0, the Basic Multilingual Plane (BMP). The higher-numbered "astral planes" of UTF-8 were rarely used. All modern languages with a significant user base are covered in Plane 0. Plane 1, the Supplementary Multilingual Plane, with Cretan Linear B, Egyptian hieroglyphics, and such, was seldom needed. Few people had fonts for those, anyway.<p>Because of the way UTF-8 is encoded, it takes 3 bytes to represent the UTF-16 set. That's because it only takes one byte for ASCII characters. Hence, 3-byte MySQL data.<p>Emoji, though, were put in Unicode Plane 1. That's where mass demand for the astral planes came from. More four byte UTF-8 characters started showing up in data.
This might be a tad unrelated to the original post, but the following article is one of my favorite primers on Character sets/Unicode :<p><a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/" rel="nofollow">https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...</a>
A "hello devops" article from Jan 3, 2022 discussing this topic was posted by /u/flokoe a week ago:<p><a href="https://www.hellodevops.blog/posts/database-character-sets-and-collations-explained/" rel="nofollow">https://www.hellodevops.blog/posts/database-character-sets-a...</a><p>And the ensuing discussion:<p><a href="https://news.ycombinator.com/item?id=29793916" rel="nofollow">https://news.ycombinator.com/item?id=29793916</a>
The author mentioned MariaDB in the intro and the went on about MySQL, not sure which he actually used.<p>I think the latest recommendation for MySQL was utf8mb4_0900_ai_ci and utf8mb4_general_ci for MariaDB.<p>While I agree about the Postgres recommendation, it's a non-starter in the PHP world, unless doing something from scratch. Even then, the support for MySQL/MariaDB is much better.
Has anyone ever successfully converted a large legacy Apache/PHP/MySQL site to UTF-8? If so you deserve an award. utf8mb4 is just one issue and not the worst.
Wow! I was working on this issue in our DBMS product today!<p>Fun suggestion, try making a JSON string with a NULL character somewhere in the middle. It will be encoded as \u0000 and is a a valid UTF-8 code, but most C based systems will truncate the string by estimating its length via strlen.<p>Java community and some other software vendors designed the Modified UTF-8, which replaces the zero with a 2-byte code point. Sleek. Aside from the fact, that you are modifying the data that customer wants to stay consistent.<p>Postres explicitly bans such cases in the VARCHAR, not sure if it can fit in their JSON columns. Who tried?
The plain utf8 covers the "basic multilingual plane" (x0000-xFFFF), so it will get you very far, actually.<p>In our app, we finally went for utf8mb4 to allow people to enter emoji. As a side "bonus" you will also soon see some clever people entering their names to stand out, such as "𝙹𝚘𝚑𝚗". Note that this is not "John". It is actually a series of mathematical symbols:<p><pre><code> U+1D679 MATHEMATICAL MONOSPACE CAPITAL J
U+1D698 MATHEMATICAL MONOSPACE SMALL O
U+1D691 MATHEMATICAL MONOSPACE SMALL H
U+1D697 MATHEMATICAL MONOSPACE SMALL N</code></pre>
Of course, nobody "refused" to fix a "bug". Instead, a non-conformant behavior was already relied upon by legacy systems out in the wild and the "fix" was added in a backwards-compatible way.<p>Edit: Three bytes are enough to fit nearly any of the chars in use in any language, including Chinese and Japanese, so I can only assume someone "smart" in the MySQL dev team decided to "save space" (before emoji were a thing).
I enjoyed the article right up to the point where it concluded at the very end that the solution is to “switch to postgresql” without actually providing any rationale for that sudden and subjective ending.
MySQL is very similar to PHP in this regard. The good function name is actually broken so a new function with similar name is needed. This creates confusion for new people when adopting the technology.
So in mysql utf8 does not mean utf8. Reminds me of iso8601 in php, which does not mean iso8601.<p><a href="https://www.php.net/manual/en/class.datetimeinterface.php" rel="nofollow">https://www.php.net/manual/en/class.datetimeinterface.php</a>
Let's say for the sake of discussion that your MySQL db has a lot of tables encoded as "utf8." Are there any known drawbacks or gotchas to converting them en masse to "utf8mb4"? Is this a lengthy operation?
That's kind of funny - Oracle has a similar issue. If you want industry-standard UTF-8, you have to specify "AL32UTF8" as your encoding. "UTF8" is kind of crazy - it's this monstrous abomination called CESU-8 (<a href="https://en.wikipedia.org/wiki/CESU-8" rel="nofollow">https://en.wikipedia.org/wiki/CESU-8</a>), which isnt' UTF-8 at all - it's actually this weird "UTF-16 complete with surrogate pairs wrapped in a UTF-8 shell" thing.
Would the DBA's on this thread consider committing code to the MySQL [1] and Postgres [2] tuner scripts that give new DBA's all your learned advise and battle hardening experience? This thread appears to be such an example.<p>[1] - <a href="https://github.com/major/MySQLTuner-perl" rel="nofollow">https://github.com/major/MySQLTuner-perl</a><p>[2] - <a href="https://github.com/jfcoz/postgresqltuner" rel="nofollow">https://github.com/jfcoz/postgresqltuner</a>
Another interesting “issue” is that Case sensitive character sets may lead to unexpected results with aggregations (eg <a href="https://blog.mallya.dev/2021/07/25/mysql-cases-sensitivity/" rel="nofollow">https://blog.mallya.dev/2021/07/25/mysql-cases-sensitivity/</a>)<p>MySQL has many such issues which have turned me off from recommending it. It’s a shame because it’s a solid technology that mostly works and has a really long operational history.
We learned this a few years ago when clients starting using emojis when saving their content and it got truncated. Very quickly converted everything to utf8mb4!
WordPress has, of course, the lion's share of MySQL deployments by server count if not by row count. Yes, WordPress isn't the latest and greatest stuff. Yes, it's <bad thing> and <another bad thing>. But such is the curse of the customer base.<p>They upgraded everybody from utf8 to utf8mb4 (excluding some users of ancient versions of MySQL) with their version 4.2. It went live on April 23, 2015.<p>They announced the upgrade at the beginning of April 2015 here. <a href="https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgrade/" rel="nofollow">https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgra...</a> There's some interesting stuff in there about the practical difficulties of the upgrade. The biggest issue was, and still is, the need for prefix indexes.<p>With respect to Adam Hooper, his article was stale when it appeared just over a year <i>after</i> the WordPress schema upgrade. Here's right of course. But the biggest user base was well on their way to abandoning utfmb3 by that time.
Another fun fact is that string comparisons are case insensitive by default:<p><a href="https://dev.mysql.com/doc/refman/8.0/en/case-sensitivity.html" rel="nofollow">https://dev.mysql.com/doc/refman/8.0/en/case-sensitivity.htm...</a><p>This can definitely catch you by surprise and cause bugs that are only detected late.
Maybe this is a cynical take - but we used <a href="https://pgloader.io/" rel="nofollow">https://pgloader.io/</a> a few years ago to migrate to Postgres, and have never been happier. MySQL has a lot of stupid decisions like this.
> Back in 2002, MySQL gave users a speed boost if users could guarantee that every row in a table had the same number of bytes. To do that, users would declare text columns as “CHAR”<p>This database type has existed in many databases since at least the 1980s
Dear databases, please don't get hung up about string lengths when dealing with UTF8.<p>If I ask for a UTF8 string with a max-length of 100, please don't apply the worse case scenario and allocate space for 100 emojis. Please give me a box of 100 bytes and allow me to write any UTF-8 string that can fit into 100 bytes in there.<p>100 ASCII characters. 20 emojis. Any mixture of the two.<p>If I ask for UTF-8, it'll be because I'd like to make advantage of UTF-8 and I accept the costs. If that means I can't quickly jump to the character at index 84 in a string, no problem, I've accepted the trade-off.
No one talks about the downsides of changing.<p>Moving from 255 max characters to 191 or 192 max means a lot of your data needs to be moved into a text fields which means things like this field can't be a primary key.
According to the docs, MySQL is a bit more explicit and calls the old "utf8" "utf8mb3" nowadays, and notes that it may be removed at some point. Although "utf8" is still aliased to the footgun at the moment.<p><a href="https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html" rel="nofollow">https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...</a>
Wow, never knew this, this is really bad, especially in this day and age where emojis are so prevalent.<p>MySQL should deprecate utf8 and give a warning if you try to use it.
Does this even fully fix the problem? It looks like utf8mb4 is limited to 4 byte sequences, but as far as I understand, utf-8 is variable width. Can utf8mb4 even encode the scottish flag <a href="https://www.iemoji.com/view/emoji/2476/flags/scotland" rel="nofollow">https://www.iemoji.com/view/emoji/2476/flags/scotland</a> ?
> Choose CHAR columns. (The CHAR format is a relic nowadays. Back then, MySQL was faster with CHAR columns. Ever since 2005, it’s not.)<p>Is this vs VARCHAR? To me this is fascinating if so, I'd love to see a link or benchmark. My understanding that reading a stripe off a drive, then chopping it up indiscriminately at known intervals is faster than conditionally branching for strings of other lengths
MySQL is an awful piece of software.<p>Try diving into the Time zone insanity, collations, and of course this UTF-8 gem that has lasted since at least 2011.
Best to be used with mysqli_real_escape_string ;-)
<a href="https://www.php.net/manual/en/mysqli.real-escape-string.php" rel="nofollow">https://www.php.net/manual/en/mysqli.real-escape-string.php</a>
> Database systems have subtle bugs and oddities, and you can avoid a lot of bugs by avoiding database systems.<p>This sounds like an offended developer. Avoiding database systems is not possible if you want to do a lot of programming tasks.
Also relevant about this: <a href="https://blog.koehntopp.info/2022/01/12/utf8mb4.html" rel="nofollow">https://blog.koehntopp.info/2022/01/12/utf8mb4.html</a>
a while ago, in my first engineering job, I was tracking down a strange bug related to chinese character sets.<p>Turns out some of the db tables were using utf8 and latin1; changing them to `utf8mb4` fixed it right up!<p><a href="https://josh.works/troubleshooting-chinese-character-sets-in-mysql" rel="nofollow">https://josh.works/troubleshooting-chinese-character-sets-in...</a><p>It was a fun bug to work on. I learned a lot about character encoding, and enjoyed bringing a refined "process" to the table.
> And developers who wanted correctness were wrong to use “utf8”, because it can’t store “<i>poo emoji</i>”.<p>This sentence deserves an award :)<p>Edit: TIL: HN doesn't support emojis.
I just went through the process of converting a 2TB MySQL database over the summer. Quite a few gotchas along the way that make the process a huge pain.
i'll just stick latin1 charset for now... its slower and increases the size of ur tables to use utf8, etc [0]<p>[0] <a href="https://stackoverflow.com/questions/12449336/utf-8-vs-latin1#12449477" rel="nofollow">https://stackoverflow.com/questions/12449336/utf-8-vs-latin1...</a>
It's because of things like that it's still hard for me to respect MySQL in 2022.<p>I use Postgres since 2008. It has never bitten me even once.
> Database systems have subtle bugs and oddities, and you can avoid a lot of bugs by avoiding database systems.<p>Wat? That is... not a great takeaway from this experience. "Oh, this library had bug reported in it! I know, that means I should just write my own, because clearly they're all buggy and my software is always perfect."
In MySQL, if you want to make sure that your Unicode fields are mangled, never use ‘utf8’, use ‘utf8mb3’. Since ‘utf8’ is an alias to ‘utf8mb3’, that alias might eventually be updated to point to ‘utf8mb4’ which won’t mangle your characters.<p>The only safe choice here is to explicitly use ‘utf8mb3’.
This is an old article and it was incorrect when written, and it's incorrect today. The author may possibly have learned since writing it that UTF-8, by design, encodes <i>up to 21 bits</i> of character code point in at most 4 bytes of data. It works and behaves as it should.