If token limit and accuracy are important, it seems English (or other spoken languages) are no optimal.<p>They're a butchered product of history and easy verbal noises.<p>A new custom language seems inevitable, that is concise, unambiguous, rooted in relation with custom words. Replacing common sentences with simple strings such as "Once upon a time..." to "a1"<p>Most likely alpha-numeric, to minimise tokens, and generate an order of magnitude increase in context window.<p>Followed by translation back to {language}<p>Is this possible? Anyone working on it?<p>(here to be educated)
> Replacing common sentences with simple strings<p>This is what byte-pair encoding does. It doesn't go quite so far as to allocate only a single token to "Once upon a time", because that string isn't actually <i>that</i> common, but in principle it could.<p>Trying to get humans to produce content directly in such a concise representation is a waste of time, since LLMs heavily rely on the ability to take whatever content is already available on the internet, which drastically reduces the labor cost of acquiring training data.