One thing I've thought about for coding with LLMs, is that passing in source code to be tokenized by the clip/whatever English text parser seems like it would be suboptimal compared to training on the AST that gets generated by the compiler after parsing the source.