TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

How the BPE tokenization algorithm used by large language models works

29 点作者 montebicyclelo将近 2 年前

1 comment

weinzierl将近 2 年前
I wonder how tokenization works for East Asian languages. There are obviously more characters than the token vocabulary size of typical current models.<p>So, how do models like GPT answer in Chinese? Are they able to produce any Chinese character? From what I understand, they are not.<p>My second question would then be, which tokenization algorithms are used for Chinese and other East Asian languages. What does that mean for the models? How do models that can learn proper Chinese (with complete tokenization) differ from the models for languages with less characters?
评论 #36640082 未加载
评论 #36618228 未加载
评论 #36616123 未加载
评论 #36616107 未加载