Mini-Gemini is a bit of a confusing name.<p>Reminds me of how DALL·E Mini came out three years ago and eventually had to rename itself to Craiyon <a href="https://github.com/borisdayma/dalle-mini">https://github.com/borisdayma/dalle-mini</a>
The paper introduces Mini-Gemini, a framework aimed at enhancing Vision Language Models (VLMs) to close the performance gap with advanced models like GPT-4 and Gemini. It focuses on improving visual tokens resolution, creating high-quality datasets for better image comprehension, and expanding VLMs' operational scope. Mini-Gemini supports a range of large language models and has shown superior performance in zero-shot benchmarks. The code and models are publicly available.
WTF is a "Multi-modality Vision Language Model"? Does it mean:<p>- a program where you give it a text description, and it outputs a picture<p>- a program where you give it a picture, and it outputs a text description<p>- both of the above<p>- something else<p>?