I know it's not the main point of this, but... so many multimodal models now that take frozen vision encoders and language decoders and weld them together with a projection layer! I wanna grab the EVA02-CLIP-E image encoder and the Llama-2 33B model and do the same, I bet that'd be fun :D