TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Listening with LLM

132 点作者 ppymou超过 1 年前

3 条评论

refulgentis超过 1 年前
If author is around: amazing work!!! Multimodal from scratch :)<p>I&#x27;m curious if you have the test clip you use, I got to the end and was like &quot;wait....is that a good result! The words are completely different!&quot;<p>Then I re-read a couple times scanning carefully for references to what the audio is.<p>This quote[^1] makes me think the sample is music, as that would explain why the end result is good -- it&#x27;s trying to describe a sound file of just music, not a sound file that is a spoken word version of the &quot;ground truth&quot;:<p>[^1] &quot;For dataset, I chose MusicCaps. I did not see any convenient links to download processed&#x2F;segmented audio files, so I wrote a small script to download the Youtube videos.&quot;
评论 #38993197 未加载
modeless超过 1 年前
I love this research direction! Multimodal is the future and the possibilities of gluing together pretrained models are under explored. As tinkerers it&#x27;s something we can do at home that doesn&#x27;t require a datacenter full of H100s or a terabyte dataset.<p>Crazy that you were able to trace your issues to bad RAM! I probably would have torn all my hair out long before suspecting bad RAM.<p>I imagine that Whisper based embeddings wouldn&#x27;t be great for analyzing music but they should be excellent for allowing LLMs to understand speech. Although it might seem trivial to hook up Whisper to LLMs already using text, I think using embeddings instead (or in addition) would allow the LLM to understand much more about speech. Cadence, tone, accent, etc. I think something like this will be necessary for speech agents in the medium term. It should allow a LLM to respond much more naturally to speech input, vs. just giving it the text output of a speech to text system. Maybe it could be done on the output side too, hooking it up to the internals of a text-to-speech system for an end-to-end audio-to-audio chatbot!<p>Do you have a Twitter account or some other way to follow your progress?
评论 #38995500 未加载
评论 #38995475 未加载
asymmetric超过 1 年前
Very OT, but I love the style of your resume. Is the source available somewhere?
评论 #38992579 未加载