科技回声

1 comment

weinzierl将近 2 年前

I wonder how tokenization works for East Asian languages. There are obviously more characters than the token vocabulary size of typical current models.<p>So, how do models like GPT answer in Chinese? Are they able to produce any Chinese character? From what I understand, they are not.<p>My second question would then be, which tokenization algorithms are used for Chinese and other East Asian languages. What does that mean for the models? How do models that can learn proper Chinese (with complete tokenization) differ from the models for languages with less characters?

评论 #36640082 未加载

评论 #36618228 未加载

评论 #36616123 未加载

评论 #36616107 未加载

How the BPE tokenization algorithm used by large language models works

1 comment

How the BPE tokenization algorithm used by large language models works

1 comment