seems like an improvement on the aloha approach? You still need to finetune it on roughly the same amount of OOD examples. Contrast this with google's approach over 2023, which was training large vision-language models with the goal of generalizing on OOD.