Fascinating stuff. The idea that visual and language generation could be generalized with the same underlying model was the most interesting part of Lex Fridman's podcast with Ilya Sutskever in May: <a href="https://www.youtube.com/watch?v=13CZPWmke6A" rel="nofollow">https://www.youtube.com/watch?v=13CZPWmke6A</a>