In MySQL, use “utf8mb4” instead of “utf8” (2016)

643 pointsby goranmoominover 3 years ago

63 comments

This is also a security vulnerability because MySQL/MariaDB silently truncate strings at the first invalid character. This can result in data manipulation attacks where a higher level layer validates the complete input as UTF-8 for insertion into the DB, but the database only stores half the string.<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1253201" rel="nofollow">https://bugzilla.mozilla.org/show_bug.cgi?id=1253201</a>

评论 #29910398 未加载

评论 #29908976 未加载

评论 #29909953 未加载

评论 #29910175 未加载

评论 #29921053 未加载

评论 #29909571 未加载

tomwojcikover 3 years ago

I need to share something as literally today I fixed a bug in our project that's somewhat related.MS SQL encodes everything with UTF-16. Emojis (code points) require up to 4 bytes. If it's a grapheme (emoji constructed from emojis), it will be even more.We are using Django. If you check length of an emoji, it will give you `1` and Django assumes utf8 everywhere. If you try to save it to PostgreSQL with char field `max_length=1` it will work just fine, but on MS SQL it will fail as it requires 2 characters (4 bytes) in the DB!I tried it with MS SQL 2017, 2019 with different collations on nvarchar and I'm pretty sure there's no way around it.> Because in CHAR(n) and VARCHAR(n) or in NCHAR(n) and NVARCHAR(n), the n defines the byte storage size, not the number of characters that can be stored, it's important to determine the data type size you must convert to, in order to avoid data truncation.<a href="https://docs.microsoft.com/en-US/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver15#utf8" rel="nofollow">https://docs.microsoft.com/en-US/sql/relational-databases/co...</a>

评论 #29912440 未加载

评论 #29912755 未加载

评论 #29915913 未加载

throw0101aover 3 years ago

Note:> The utf8mb3 character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4 instead. utf8 is currently an alias for utf8mb3, but it is now deprecated as such, and utf8 is expected subsequently to become a reference to utf8mb4. Beginning with MySQL 8.0.28, utf8mb3 is also displayed in place of utf8 in columns of Information Schema tables, and in the output of SQL SHOW statements.* <a href="https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-unicode-utf8.html" rel="nofollow">https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-...</a>

评论 #29910978 未加载

评论 #29915837 未加载

评论 #29911523 未加载

iliketrainsover 3 years ago

Fun story: When I was working at one of the FAANG companies, I have placed the (Night with stars, it does not seem to render here [0]) character in my preferred name. Some time later, I was notified by HR that I must change my preferred name and I can only use UTF-8 characters with maximum length of 3 bytes. I was quite confused why such specific demand, I am pretty sure it was exactly this issue. I replaced it with ᗧ···ᗣ···[0] <a href="https://emojipedia.org/night-with-stars/" rel="nofollow">https://emojipedia.org/night-with-stars/</a>

评论 #29912113 未加载

donatjover 3 years ago

Is Medium secretly mining crypto in the background? I have no idea what it is but their site makes my CPU spike and my fans turn on.<a href="https://jdon.at/JTHj2G" rel="nofollow">https://jdon.at/JTHj2G</a>Update - here's a video, it's making hundreds of graphql requests a minute…<a href="https://jdon.at/z05ImC" rel="nofollow">https://jdon.at/z05ImC</a>

dark-starover 3 years ago

In the past, Unicode was assumed to be 64k of codepoints, so a 3-byte UTF-8 sequence was considered "long enough", especially since there were surrogate pairs for the rare cases where you have to encode higher code points.Only "recently" have longer UTF-8 sequences (aka. emojis) become widespread enough that this became a problem.Yes, it could have been avoided if they had allowed arbitrary-length UTF-8 sequences from the beginning, but I can see that they probably just wanted to optimize a bit.What I don't understand is why they had to create a different encoding (the utf8mb4) instead of just extending the existing utf8 encoding, since 4-byte UTF-8 is obviously backward-compatible with 3-byte UTF-8... (unless they always used 3 bytes for every character, which would be stupid as UTF-8 has been explicitly designed for variable-length encodings)Bonus: Many filesystems also do not allow 4+ byte UTF 8 code points in filenames. Test your company's file server to see if it allows you to save a file as "(some random emoji).doc". A few very expensive storage systems also have problems with that (and they have the same workaround: convert your filesystem to a different encoding, instead of simply extending the existing encoding to allow 4+ bytes)

评论 #29909848 未加载

评论 #29911261 未加载

评论 #29910131 未加载

评论 #29911076 未加载

评论 #29909803 未加载

评论 #29912589 未加载

mrcarruthersover 3 years ago

And here's the commit that changed it: <a href="https://github.com/mysql/mysql-server/commit/43a506c0ced0e6ea101d3ab8b4b423ce3fa327d0" rel="nofollow">https://github.com/mysql/mysql-server/commit/43a506c0ced0e6e...</a>. It was originally set to max 6 bytes but for some reason someone dropped it to 3.They also seem to be slowly phasing it out. Internally it's utf8mb3 and utf8 is just an alias. The idea is to eventually make utf8 an alias to utf8mb4.

评论 #29938291 未加载

Animatsover 3 years ago

That was set up when Microsoft and Java had standardized on UTF-16. So this can represent the UTF-16 subset of Unicode, which is Unicode Plane 0, the Basic Multilingual Plane (BMP). The higher-numbered "astral planes" of UTF-8 were rarely used. All modern languages with a significant user base are covered in Plane 0. Plane 1, the Supplementary Multilingual Plane, with Cretan Linear B, Egyptian hieroglyphics, and such, was seldom needed. Few people had fonts for those, anyway.Because of the way UTF-8 is encoded, it takes 3 bytes to represent the UTF-16 set. That's because it only takes one byte for ASCII characters. Hence, 3-byte MySQL data.Emoji, though, were put in Unicode Plane 1. That's where mass demand for the astral planes came from. More four byte UTF-8 characters started showing up in data.

评论 #29911202 未加载

评论 #29912022 未加载

评论 #29911035 未加载

karsinkkover 3 years ago

This might be a tad unrelated to the original post, but the following article is one of my favorite primers on Character sets/Unicode :<a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/" rel="nofollow">https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...</a>

评论 #29913848 未加载

lsllcover 3 years ago

A "hello devops" article from Jan 3, 2022 discussing this topic was posted by /u/flokoe a week ago:<a href="https://www.hellodevops.blog/posts/database-character-sets-and-collations-explained/" rel="nofollow">https://www.hellodevops.blog/posts/database-character-sets-a...</a>And the ensuing discussion:<a href="https://news.ycombinator.com/item?id=29793916" rel="nofollow">https://news.ycombinator.com/item?id=29793916</a>

PrimeDirectiveover 3 years ago

The author mentioned MariaDB in the intro and the went on about MySQL, not sure which he actually used.I think the latest recommendation for MySQL was utf8mb4_0900_ai_ci and utf8mb4_general_ci for MariaDB.While I agree about the Postgres recommendation, it's a non-starter in the PHP world, unless doing something from scratch. Even then, the support for MySQL/MariaDB is much better.

评论 #29910836 未加载

mwattsunover 3 years ago

Has anyone ever successfully converted a large legacy Apache/PHP/MySQL site to UTF-8? If so you deserve an award. utf8mb4 is just one issue and not the worst.

评论 #29908821 未加载

评论 #29910914 未加载

评论 #29909729 未加载

评论 #29912839 未加载

评论 #29911327 未加载

评论 #29908694 未加载

评论 #29909532 未加载

ashvardanianover 3 years ago

Wow! I was working on this issue in our DBMS product today!Fun suggestion, try making a JSON string with a NULL character somewhere in the middle. It will be encoded as \u0000 and is a a valid UTF-8 code, but most C based systems will truncate the string by estimating its length via strlen.Java community and some other software vendors designed the Modified UTF-8, which replaces the zero with a 2-byte code point. Sleek. Aside from the fact, that you are modifying the data that customer wants to stay consistent.Postres explicitly bans such cases in the VARCHAR, not sure if it can fit in their JSON columns. Who tried?

评论 #29913868 未加载

clonover 3 years ago

The plain utf8 covers the "basic multilingual plane" (x0000-xFFFF), so it will get you very far, actually.In our app, we finally went for utf8mb4 to allow people to enter emoji. As a side "bonus" you will also soon see some clever people entering their names to stand out, such as "𝙹𝚘𝚑𝚗". Note that this is not "John". It is actually a series of mathematical symbols:<pre><code> U+1D679 MATHEMATICAL MONOSPACE CAPITAL J U+1D698 MATHEMATICAL MONOSPACE SMALL O U+1D691 MATHEMATICAL MONOSPACE SMALL H U+1D697 MATHEMATICAL MONOSPACE SMALL N</code></pre>

评论 #29909025 未加载

评论 #29920697 未加载

评论 #29909003 未加载

smarx007over 3 years ago

Of course, nobody "refused" to fix a "bug". Instead, a non-conformant behavior was already relied upon by legacy systems out in the wild and the "fix" was added in a backwards-compatible way.Edit: Three bytes are enough to fit nearly any of the chars in use in any language, including Chinese and Japanese, so I can only assume someone "smart" in the MySQL dev team decided to "save space" (before emoji were a thing).

评论 #29909426 未加载

评论 #29911611 未加载

评论 #29911544 未加载

评论 #29909176 未加载

评论 #29919902 未加载

评论 #29909085 未加载

urbandw311erover 3 years ago

I enjoyed the article right up to the point where it concluded at the very end that the solution is to “switch to postgresql” without actually providing any rationale for that sudden and subjective ending.

didipover 3 years ago

MySQL is very similar to PHP in this regard. The good function name is actually broken so a new function with similar name is needed. This creates confusion for new people when adopting the technology.

评论 #29911744 未加载

talos2110over 3 years ago

So in mysql utf8 does not mean utf8. Reminds me of iso8601 in php, which does not mean iso8601.<a href="https://www.php.net/manual/en/class.datetimeinterface.php" rel="nofollow">https://www.php.net/manual/en/class.datetimeinterface.php</a>

评论 #29915210 未加载

maxfurmanover 3 years ago

Let's say for the sake of discussion that your MySQL db has a lot of tables encoded as "utf8." Are there any known drawbacks or gotchas to converting them en masse to "utf8mb4"? Is this a lengthy operation?

评论 #29909392 未加载

评论 #29938415 未加载

评论 #29909554 未加载

评论 #29909079 未加载

CountSessineover 3 years ago

That's kind of funny - Oracle has a similar issue. If you want industry-standard UTF-8, you have to specify "AL32UTF8" as your encoding. "UTF8" is kind of crazy - it's this monstrous abomination called CESU-8 (<a href="https://en.wikipedia.org/wiki/CESU-8" rel="nofollow">https://en.wikipedia.org/wiki/CESU-8</a>), which isnt' UTF-8 at all - it's actually this weird "UTF-16 complete with surrogate pairs wrapped in a UTF-8 shell" thing.

评论 #29920022 未加载

bambaxover 3 years ago

> If you need a database, don’t use MySQL or MariaDB. Use PostgreSQL.Sure. But for many use cases, SQLite is enough.

评论 #29915022 未加载

LinuxBenderover 3 years ago

Would the DBA's on this thread consider committing code to the MySQL [1] and Postgres [2] tuner scripts that give new DBA's all your learned advise and battle hardening experience? This thread appears to be such an example.[1] - <a href="https://github.com/major/MySQLTuner-perl" rel="nofollow">https://github.com/major/MySQLTuner-perl</a>[2] - <a href="https://github.com/jfcoz/postgresqltuner" rel="nofollow">https://github.com/jfcoz/postgresqltuner</a>

pm90over 3 years ago

Another interesting “issue” is that Case sensitive character sets may lead to unexpected results with aggregations (eg <a href="https://blog.mallya.dev/2021/07/25/mysql-cases-sensitivity/" rel="nofollow">https://blog.mallya.dev/2021/07/25/mysql-cases-sensitivity/</a>)MySQL has many such issues which have turned me off from recommending it. It’s a shame because it’s a solid technology that mostly works and has a really long operational history.

webkikeover 3 years ago

Genuine question: why would anyone prefer MySQL over Postgres?

评论 #29910911 未加载

评论 #29911369 未加载

评论 #29914083 未加载

评论 #29909955 未加载

评论 #29917442 未加载

评论 #29909974 未加载

riteshpatelover 3 years ago

We learned this a few years ago when clients starting using emojis when saving their content and it got truncated. Very quickly converted everything to utf8mb4!

评论 #29914644 未加载

OliverJonesover 3 years ago

WordPress has, of course, the lion's share of MySQL deployments by server count if not by row count. Yes, WordPress isn't the latest and greatest stuff. Yes, it's <bad thing> and <another bad thing>. But such is the curse of the customer base.They upgraded everybody from utf8 to utf8mb4 (excluding some users of ancient versions of MySQL) with their version 4.2. It went live on April 23, 2015.They announced the upgrade at the beginning of April 2015 here. <a href="https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgrade/" rel="nofollow">https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgra...</a> There's some interesting stuff in there about the practical difficulties of the upgrade. The biggest issue was, and still is, the need for prefix indexes.With respect to Adam Hooper, his article was stale when it appeared just over a year after the WordPress schema upgrade. Here's right of course. But the biggest user base was well on their way to abandoning utfmb3 by that time.

dehrmannover 3 years ago

Memories... When I was at Amazon, my starter project was modifying Redshift to support four-byte UTF-8 characters, not just three-byte.

MailNerdover 3 years ago

Another fun fact is that string comparisons are case insensitive by default:<a href="https://dev.mysql.com/doc/refman/8.0/en/case-sensitivity.html" rel="nofollow">https://dev.mysql.com/doc/refman/8.0/en/case-sensitivity.htm...</a>This can definitely catch you by surprise and cause bugs that are only detected late.

fareeshover 3 years ago

I decided to build all my new projects on postgres a few years ago. I am often reminded of why that was a good decision.

tapoxiover 3 years ago

Maybe this is a cynical take - but we used <a href="https://pgloader.io/" rel="nofollow">https://pgloader.io/</a> a few years ago to migrate to Postgres, and have never been happier. MySQL has a lot of stupid decisions like this.

评论 #29910865 未加载

评论 #29909914 未加载

评论 #29908862 未加载

评论 #29911502 未加载

评论 #29910802 未加载

评论 #29910719 未加载

TedDoesntTalkover 3 years ago

> Back in 2002, MySQL gave users a speed boost if users could guarantee that every row in a table had the same number of bytes. To do that, users would declare text columns as “CHAR”This database type has existed in many databases since at least the 1980s

billpgover 3 years ago

Dear databases, please don't get hung up about string lengths when dealing with UTF8.If I ask for a UTF8 string with a max-length of 100, please don't apply the worse case scenario and allocate space for 100 emojis. Please give me a box of 100 bytes and allow me to write any UTF-8 string that can fit into 100 bytes in there.100 ASCII characters. 20 emojis. Any mixture of the two.If I ask for UTF-8, it'll be because I'd like to make advantage of UTF-8 and I accept the costs. If that means I can't quickly jump to the character at index 84 in a string, no problem, I've accepted the trade-off.

评论 #29912981 未加载

评论 #29910818 未加载

评论 #29910462 未加载

评论 #29911894 未加载

评论 #29912618 未加载

ipaddrover 3 years ago

No one talks about the downsides of changing.Moving from 255 max characters to 191 or 192 max means a lot of your data needs to be moved into a text fields which means things like this field can't be a primary key.

评论 #29911333 未加载

评论 #29914052 未加载

fredoraliveover 3 years ago

According to the docs, MySQL is a bit more explicit and calls the old "utf8" "utf8mb3" nowadays, and notes that it may be removed at some point. Although "utf8" is still aliased to the footgun at the moment.<a href="https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html" rel="nofollow">https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...</a>

hn_throwaway_99over 3 years ago

Wow, never knew this, this is really bad, especially in this day and age where emojis are so prevalent.MySQL should deprecate utf8 and give a warning if you try to use it.

评论 #29909013 未加载

评论 #29908917 未加载

davidjfelixover 3 years ago

Does this even fully fix the problem? It looks like utf8mb4 is limited to 4 byte sequences, but as far as I understand, utf-8 is variable width. Can utf8mb4 even encode the scottish flag <a href="https://www.iemoji.com/view/emoji/2476/flags/scotland" rel="nofollow">https://www.iemoji.com/view/emoji/2476/flags/scotland</a> ?

评论 #29908868 未加载

评论 #29909115 未加载

评论 #29908936 未加载

评论 #29908885 未加载

评论 #29908954 未加载

评论 #29909285 未加载

评论 #29908983 未加载

exabrialover 3 years ago

> Choose CHAR columns. (The CHAR format is a relic nowadays. Back then, MySQL was faster with CHAR columns. Ever since 2005, it’s not.)Is this vs VARCHAR? To me this is fascinating if so, I'd love to see a link or benchmark. My understanding that reading a stripe off a drive, then chopping it up indiscriminately at known intervals is faster than conditionally branching for strings of other lengths

77pt77over 3 years ago

MySQL is an awful piece of software.Try diving into the Time zone insanity, collations, and of course this UTF-8 gem that has lasted since at least 2011.

lerosover 3 years ago

In addition to emojis, I've also had issues with certain characters copied from Microsoft Word being 4 bytes.

patrickctengover 3 years ago

I've had to "upgrade" a couple to utf8, and varchar(255) bites me all the time.

cryptosover 3 years ago

Best to be used with mysqli_real_escape_string ;-) <a href="https://www.php.net/manual/en/mysqli.real-escape-string.php" rel="nofollow">https://www.php.net/manual/en/mysqli.real-escape-string.php</a>

jamaicahestover 3 years ago

> Database systems have subtle bugs and oddities, and you can avoid a lot of bugs by avoiding database systems.This sounds like an offended developer. Avoiding database systems is not possible if you want to do a lot of programming tasks.

kolaenteover 3 years ago

Also relevant about this: <a href="https://blog.koehntopp.info/2022/01/12/utf8mb4.html" rel="nofollow">https://blog.koehntopp.info/2022/01/12/utf8mb4.html</a>

wonder_erover 3 years ago

a while ago, in my first engineering job, I was tracking down a strange bug related to chinese character sets.Turns out some of the db tables were using utf8 and latin1; changing them to `utf8mb4` fixed it right up!<a href="https://josh.works/troubleshooting-chinese-character-sets-in-mysql" rel="nofollow">https://josh.works/troubleshooting-chinese-character-sets-in...</a>It was a fun bug to work on. I learned a lot about character encoding, and enjoyed bringing a refined "process" to the table.

btschaeggover 3 years ago

> And developers who wanted correctness were wrong to use “utf8”, because it can’t store “poo emoji”.This sentence deserves an award :)Edit: TIL: HN doesn't support emojis.

brightballover 3 years ago

I just went through the process of converting a 2TB MySQL database over the summer. Quite a few gotchas along the way that make the process a huge pain.

oleg_antonyanover 3 years ago

Thanks to emoji (almost)everything now supports unicode

ninjuover 3 years ago

This HN posting should be marked with (2016)Is the issue still an issue?

评论 #29908980 未加载

评论 #29908993 未加载

评论 #29910214 未加载

Dylan16807over 3 years ago

Also make sure mysqldump is using utf8mb4 or you'll get character replacement in your backups. This often needs to be configured separately.

NelsonMinarover 3 years ago

In 2022 with a new release of MySQL or MariaDB, is it still possible to create a database with "utf8" encoding? Does it print a warning?

评论 #29919240 未加载

jijjiover 3 years ago

i'll just stick latin1 charset for now... its slower and increases the size of ur tables to use utf8, etc [0][0] <a href="https://stackoverflow.com/questions/12449336/utf-8-vs-latin1#12449477" rel="nofollow">https://stackoverflow.com/questions/12449336/utf-8-vs-latin1...</a>

DonHopkinsover 3 years ago

>Of course, they never advertised this (probably because the bug is so embarrassing).Not as embarrassing as being owned by Oracle.

cblconfederateover 3 years ago

will something terrible happen if i just convert my tables to utf8mb4? Will i have to upgrade each column?

pkruminsover 3 years ago

Does anyone know if utf8mb4 is variable length or does each character take full 4 bytes?

评论 #29911010 未加载

fake-nameover 3 years ago

In MySQL, use postgres instead.

ShaneMcGowanover 3 years ago

This one is very fun to debug in production when you're on MySQL 5.8

jmnicolasover 3 years ago

It's because of things like that it's still hard for me to respect MySQL in 2022.I use Postgres since 2008. It has never bitten me even once.

评论 #29911302 未加载

hashimotonomoraover 3 years ago

Just use ASCII.

kyralisover 3 years ago

> Database systems have subtle bugs and oddities, and you can avoid a lot of bugs by avoiding database systems.Wat? That is... not a great takeaway from this experience. "Oh, this library had bug reported in it! I know, that means I should just write my own, because clearly they're all buggy and my software is always perfect."

评论 #29910219 未加载

评论 #29908789 未加载

评论 #29908837 未加载

评论 #29908973 未加载

评论 #29908728 未加载

评论 #29909916 未加载

TedShillerover 3 years ago

use Postgres instead of MySQL

rubyist5evaover 3 years ago

god I hate mysql so much - long live postgres, one of the greatest pieces of software ever created

sshineover 3 years ago

In MySQL, if you want to make sure that your Unicode fields are mangled, never use ‘utf8’, use ‘utf8mb3’. Since ‘utf8’ is an alias to ‘utf8mb3’, that alias might eventually be updated to point to ‘utf8mb4’ which won’t mangle your characters.The only safe choice here is to explicitly use ‘utf8mb3’.

daneel_wover 3 years ago

This is an old article and it was incorrect when written, and it's incorrect today. The author may possibly have learned since writing it that UTF-8, by design, encodes up to 21 bits of character code point in at most 4 bytes of data. It works and behaves as it should.

评论 #29908911 未加载

评论 #29908892 未加载