I wonder how tokenization works for East Asian languages. There are obviously more characters than the token vocabulary size of typical current models.<p>So, how do models like GPT answer in Chinese? Are they able to produce any Chinese character? From what I understand, they are not.<p>My second question would then be, which tokenization algorithms are used for Chinese and other East Asian languages. What does that mean for the models? How do models that can learn proper Chinese (with complete tokenization) differ from the models for languages with less characters?