TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Why is GPT-3 15.77x more expensive for certain languages?

117 点作者 rayshan大约 2 年前

16 条评论

lukeschlather大约 2 年前
I would want to see some data on tokenization for some real-world examples. &quot;Je voudrais une pizza&quot; actually translates more directly to &quot;I would like a pizza&quot; which is 5 tokens. But also I think there&#x27;s some danger here in terms of this might be cherrypicking examples. Spanish is a lot more dense than English or French and might tokenize better. (I see &quot;quiero pizza&quot; is 4 tokens which seems like the right number of tokens to me - &quot;quiero&quot; actually contains &quot;I want &lt;present tense&gt;&quot;) You could argue it&#x27;s 2 or 3 tokens but 4 seems preferable.<p>For diacratics in French or Spanish, diacratics are logically characters. I can&#x27;t think of an example where it&#x27;s actually useful to split the letter into a different token but I could see it happening and not being harmful. I do think it&#x27;s possible French is just weird and just needs more tokens. When I think about how I process French, I probably do treat e.g. &quot;Je l&#x27;ai aimé&quot; as a pathological example as 3 tokens when I speak it out loud. But I can also see why you would tokenize it as 6 tokens, I&#x27;m not sure that&#x27;s Anglocentrism so much as it&#x27;s recognizing a complexity difference between French and English writing.<p>But all this is contrast to how non-roman characters are tokenized at the byte level. That just seems bad and like it&#x27;s definitely going to make it worse with non-roman languages. There&#x27;s no point in having tokens that split characters.
评论 #35516737 未加载
评论 #35516561 未加载
评论 #35516813 未加载
kouteiheika大约 2 年前
Slightly offtopic, but:<p>&gt; One of the models listed above called NLLB (No Language Left Behind) has been open sourced by Facebook allowing for translation for 200 languages.<p>It was not. The model&#x27;s weights are under CC-BY-NC, which certainly motivates commercial entities to not leave those languages behind. &#x2F;s
评论 #35515741 未加载
评论 #35515799 未加载
评论 #35515827 未加载
FredPret大约 2 年前
What an interesting aspect I haven&#x27;t considered before. All the AIs will be trained on the available media - most of which is English.<p>I sometimes wonder what it takes to unseat a lingua franca, but it looks like we won&#x27;t see that soon. English is set to dominate for a long time.
评论 #35515762 未加载
评论 #35515673 未加载
评论 #35516082 未加载
评论 #35515637 未加载
评论 #35515729 未加载
评论 #35517176 未加载
评论 #35516584 未加载
galaxytachyon大约 2 年前
So what I got from this is that GPT was trained on a dataset that biased in English contents. Is that right?<p>I think even human has to spend extra energy to speak a language they were not born with, no matter how fluent they are in this language. I don&#x27;t know about natural multilinguals.
评论 #35515534 未加载
评论 #35515666 未加载
评论 #35515908 未加载
评论 #35515435 未加载
评论 #35515730 未加载
wolfium3大约 2 年前
You can use their online tool to see how it tokenizes words: <a href="https:&#x2F;&#x2F;platform.openai.com&#x2F;tokenizer" rel="nofollow">https:&#x2F;&#x2F;platform.openai.com&#x2F;tokenizer</a>
评论 #35515333 未加载
karmoka大约 2 年前
&quot;Je voudrais une pizza&quot; is better translated to &quot;I would like a pizza&quot; &quot;I want a pizza&quot; would be &quot;je veux une pizza&quot;
评论 #35515892 未加载
bob1029大约 2 年前
If you think about this from a &quot;language is computation&quot; perspective, it starts to get even more interesting.<p>For example, what would the real-world performance of ChatGPT be if we had trained it predominantly on German or Korean text?<p>Is English actually the best language&#x2F;structure for this system?
评论 #35516612 未加载
wordpad25大约 2 年前
HUGE SALE! Save 93% OFF on GPT API by translating prompt into English first!!!
rubywilde大约 2 年前
Actually, it is not true. Hilarious<p>Author compares different encoders: for Facebook&#x27;s NLLB and GPT2. Where did title came from?<p>Another point is that OpenAI changed encoders for chat models. Link: <a href="https:&#x2F;&#x2F;github.com&#x2F;openai&#x2F;openai-cookbook&#x2F;blob&#x2F;main&#x2F;examples&#x2F;How_to_count_tokens_with_tiktoken.ipynb">https:&#x2F;&#x2F;github.com&#x2F;openai&#x2F;openai-cookbook&#x2F;blob&#x2F;main&#x2F;examples...</a><p>Now English is less optimized for tokens usage and other languages are much more balanced. E.g. Ukrainian takes only twice as much tokens, before it had 6 times more tokens
FrostKiwi大约 2 年前
So glad someone took the time to put up some data about it. Since day one, the subpar results for Asian languages has stuck out to me. It&#x27;s especially true for LLama-derived models, where the output is just abysmal. It&#x27;s my own pet theory, that bad tokenization is an important reason as to why they suck so much in the first place.<p>It&#x27;s not just broken grammar, it&#x27;s a surprising lack of creativity, that English doesn&#x27;t suffer from. ChatGPT English -&gt; DeepL and fixing the auto-translation gives vastly improved results, than prompting ChatGPT to respond in an asian language.
mgaunard大约 2 年前
So for latin languages, they tokenize per word, and somehow for asian languages, it&#x27;s tokenizing per radical.<p>Of course you&#x27;d end up with a lot more tokens. Just tokenize by word regardless of language.
评论 #35516469 未加载
评论 #35516203 未加载
评论 #35516614 未加载
Imnimo大约 2 年前
Setting aside the specific choice of tokenizer for GPT models, I&#x27;m curious how much difference in performance is made by the features of the human language used to represent the training data. Like if you kept the exact same training corpus and could wave a magic wand and translate it into any language and could create a custom tokenization for each language, would some be more amenable than others to GPT-style language modeling?
startupsfail大约 2 年前
I’m finding it amazing that the model comes localized and supports obscure languages and is available. Compare this to traditional software. Or even to web software. Does Google come localized to all of these languages, for example?<p>Yes, there is overhead from localization. So what, this overhead was always there for software.
jinushaun大约 2 年前
The French example is strange and shows that the language model has an English bias.<p><pre><code> - “I want a pizza” = 4 tokens - “Je voudrais une pizza” = 7 tokens </code></pre> Why is “want” only 1 token in English, but “voudrais” 4 tokens? Following the French example, would “wants” and “wanted” map to 1 or two tokens?
评论 #35519399 未加载
seba_dos1大约 2 年前
tl;dr - because it operates on tokens, not words, and the set of tokens it uses is optimized for representing English text.
评论 #35515459 未加载
评论 #35515415 未加载
29athrowaway大约 2 年前
It is not that tokenization is optimized for English, but rather the other way around perhaps.<p>Take &quot;lampara&quot; or &quot;pantalones&quot; in Spanish for example. English speakers were clever enough to shorten those words to &quot;lamp&quot; and &quot;pants&quot; respectively. And they have done this with many words.<p>Translate text into Spanish and you will see text gets longer and there is more meaning encoded into words.<p>&quot;La mesa&quot; refers to a female table, although tables are not lifeforms and have no sex.<p>To me some languages impose a communication tax. It is taboo because people conflate language and culture and such.
评论 #35515587 未加载
评论 #35515649 未加载
评论 #35515355 未加载
评论 #35515404 未加载
评论 #35515489 未加载
评论 #35515494 未加载
评论 #35516675 未加载