A lot of it is that language models need a lot of training data to work.<p>If pronoun X were as commonly used as "he" or "she" than a language model would learn it as well as "he" or "she". If X is used 0.1% as often then the language model is going to have trouble.<p>You could make a synthetic data set where "he", "she" and X occur all about 1/3 of time and the model would handle X pretty well in certain ways, but would make other mistakes because it thinks X occurs too often.<p>It's a basic problem of language models that they conflate syntax and semantics and thus get the meaning of things wrong. They come to conclusions like "Tyrone is a thug", "Fat people are transgender," etc. (When do "they" get to put A on their passport for Adipose?)