TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models

83 pointsby milliondreamsabout 1 year ago

7 comments

simonwabout 1 year ago
Mini-Gemini is a bit of a confusing name.<p>Reminds me of how DALL·E Mini came out three years ago and eventually had to rename itself to Craiyon <a href="https:&#x2F;&#x2F;github.com&#x2F;borisdayma&#x2F;dalle-mini">https:&#x2F;&#x2F;github.com&#x2F;borisdayma&#x2F;dalle-mini</a>
milliondreamsabout 1 year ago
Code and Models - <a href="https:&#x2F;&#x2F;github.com&#x2F;dvlab-research&#x2F;MiniGemini">https:&#x2F;&#x2F;github.com&#x2F;dvlab-research&#x2F;MiniGemini</a>
milliondreamsabout 1 year ago
Project website - <a href="https:&#x2F;&#x2F;mini-gemini.github.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;mini-gemini.github.io&#x2F;</a>
ilakshabout 1 year ago
Is this based on LLaVA 1.6? Not to be too lazy, but maybe someone could link to a comparison with that, if there is one?
mountainriverabout 1 year ago
Excite to see how this does on open compass!
milliondreamsabout 1 year ago
The paper introduces Mini-Gemini, a framework aimed at enhancing Vision Language Models (VLMs) to close the performance gap with advanced models like GPT-4 and Gemini. It focuses on improving visual tokens resolution, creating high-quality datasets for better image comprehension, and expanding VLMs&#x27; operational scope. Mini-Gemini supports a range of large language models and has shown superior performance in zero-shot benchmarks. The code and models are publicly available.
评论 #39894133 未加载
PontifexMinimusabout 1 year ago
WTF is a &quot;Multi-modality Vision Language Model&quot;? Does it mean:<p>- a program where you give it a text description, and it outputs a picture<p>- a program where you give it a picture, and it outputs a text description<p>- both of the above<p>- something else<p>?
评论 #39894129 未加载