Needs a (2023) tag. But definitely the release of ARC2 and image outputs from 4o got me thinking about the JEPA family too.<p>I don't know if it's right (and I'm sure JEPA has lots of performance issues) but seems good to have a fully latent space representation, ideally across all modalities, so that the concept "an apple a day keeps the doctor away" becoming image/audio/text is a choice of decoder rather than dedicated token ranges being chosen even before the actual creation process in the model begins.