I love this research direction! Multimodal is the future and the possibilities of gluing together pretrained models are under explored. As tinkerers it's something we can do at home that doesn't require a datacenter full of H100s or a terabyte dataset.<p>Crazy that you were able to trace your issues to bad RAM! I probably would have torn all my hair out long before suspecting bad RAM.<p>I imagine that Whisper based embeddings wouldn't be great for analyzing music but they should be excellent for allowing LLMs to understand speech. Although it might seem trivial to hook up Whisper to LLMs already using text, I think using embeddings instead (or in addition) would allow the LLM to understand much more about speech. Cadence, tone, accent, etc. I think something like this will be necessary for speech agents in the medium term. It should allow a LLM to respond much more naturally to speech input, vs. just giving it the text output of a speech to text system. Maybe it could be done on the output side too, hooking it up to the internals of a text-to-speech system for an end-to-end audio-to-audio chatbot!<p>Do you have a Twitter account or some other way to follow your progress?