Since llava is multimodal, I wonder if there's a chance here to strip a bit of complexity. Specifically, instead of going through 3 embeddings (llava internal, text, mini-lm), could you use the not-last layer of llava as your vector? It would probably require a bit of fine-tuning though.<p>For pure text, that's kind of how e5-mistral works <a href="https://huggingface.co/intfloat/e5-mistral-7b-instruct" rel="nofollow">https://huggingface.co/intfloat/e5-mistral-7b-instruct</a> Or yeah, just use clip like another commenter suggests...