Interesting, I also manage an IRC bot with multimodal capability for months now. It's not a real LMM - rather, a combination of 3 models. It uses Llava for images and Whisper for audio. The pipeline is simple: if it finds a URL which looks like an image - it feeds it to Llava (same with audio). Llava's response is injected back to the main LLM (a round robin of Solar 10.7B and Llama 13B) to provide the response in the style of the bot's character (persona) and in the context of the conversation. I run it locally on my RTX 3060 using llama.cpp. Additionally, it's also able to search on Wikipedia, in the news (provided by Yahoo RSS) and can open HTML pages (if it sees a URL which is not an image or audio).<p>Llava is a surprisingly good model for its size. However, what I found is that it often hallucinates "2 people in the background" for many images.<p>I made the bot just to explore how far I can go with local off-the-shelf LLMs, I never thought it could be useful for blind people, interesting. A practical idea I had on my mind was to hook it to a webcam so that if something interesting happens in front of my house, I can be notified by the bot, for example. I guess it could also be useful for blind people if the camera is mounted on the body.
I'm also totally blind and, somewhat relatedly, I've built Gptcmd, a small console app to ease GPT conversation and experimentation (see the readme for more on what it does, with inline demo). Version 2.0 will get GPT vision (image) support:<p><a href="https://github.com/codeofdusk/gptcmd">https://github.com/codeofdusk/gptcmd</a>
I had an interesting conversation the other day about how best to make ChatGPT style "streaming" interfaces accessible to screenreaders, where text updates as it streams in.<p>It's not easy! <a href="https://fedi.simonwillison.net/@simon/111836275974119220" rel="nofollow">https://fedi.simonwillison.net/@simon/111836275974119220</a>
Hey! I don’t understand too much about AI/ML/LLMs (and now LMMs!) so hoping someone could explain a little further for me?<p>What I gather is this is an IRC bot/plugin/add-on that will allow a user to prompt an ‘LMM’ which is essentially an LLM with multiple output capabilities (text, audio, images etc) which on the surface sounds awesome.<p>How does an LMM benefit blind users over an LLM with voice capability? Is the addition of image/video just for accessibility to none-blind people?<p>What’s the difference between this and integrating an LLM with voice/image/video capability?<p>Is there any reason that this has been made over other available uncensored/free/local LLMs (aside from this being an LMM)?<p>Thanks in advance.
Since there's no way to truly objectively tell if LLM output is correct, this seems like it would have its limits, even if it seems subjectively good, but I have that problem with all of the LLM stuff I guess.