TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Listening with LLM

132 pointsby ppymouover 1 year ago

3 comments

refulgentisover 1 year ago
If author is around: amazing work!!! Multimodal from scratch :)<p>I&#x27;m curious if you have the test clip you use, I got to the end and was like &quot;wait....is that a good result! The words are completely different!&quot;<p>Then I re-read a couple times scanning carefully for references to what the audio is.<p>This quote[^1] makes me think the sample is music, as that would explain why the end result is good -- it&#x27;s trying to describe a sound file of just music, not a sound file that is a spoken word version of the &quot;ground truth&quot;:<p>[^1] &quot;For dataset, I chose MusicCaps. I did not see any convenient links to download processed&#x2F;segmented audio files, so I wrote a small script to download the Youtube videos.&quot;
评论 #38993197 未加载
modelessover 1 year ago
I love this research direction! Multimodal is the future and the possibilities of gluing together pretrained models are under explored. As tinkerers it&#x27;s something we can do at home that doesn&#x27;t require a datacenter full of H100s or a terabyte dataset.<p>Crazy that you were able to trace your issues to bad RAM! I probably would have torn all my hair out long before suspecting bad RAM.<p>I imagine that Whisper based embeddings wouldn&#x27;t be great for analyzing music but they should be excellent for allowing LLMs to understand speech. Although it might seem trivial to hook up Whisper to LLMs already using text, I think using embeddings instead (or in addition) would allow the LLM to understand much more about speech. Cadence, tone, accent, etc. I think something like this will be necessary for speech agents in the medium term. It should allow a LLM to respond much more naturally to speech input, vs. just giving it the text output of a speech to text system. Maybe it could be done on the output side too, hooking it up to the internals of a text-to-speech system for an end-to-end audio-to-audio chatbot!<p>Do you have a Twitter account or some other way to follow your progress?
评论 #38995500 未加载
评论 #38995475 未加载
asymmetricover 1 year ago
Very OT, but I love the style of your resume. Is the source available somewhere?
评论 #38992579 未加载