I’ve been reading a lot about the cliff that AI frontier models face as training data sources dry up. I’ve seen synthetic data mentioned as an option but haven’t seen a lot of details (maybe I haven’t looked hard enough).<p>I’m curious whether you could create an unlimited resource of synthetic data and improve coding/logic performance by having an LLM generate code and then train on predicting (1) whether it compiles and (2) what outputs it would generate for an unlimited series of generated inputs.