Moshi: A speech-text foundation model for real time dialogue

365 pointsby gkucsko8 months ago

20 comments

Reubend8 months ago

Let me offer some feedback, since almost all of the comments here are negative. The latency is very good, almost too good since it seems to interrupt me often. So I think that's a great achievement for an open source model.However, people here have been spoiled by incredibly good LLMs lately. And the responses that this model gives are nowhere need the high quality of SOTA models today in terms of content. It reminds me more of the 2019 LLMs we saw back in the day.So I think you've done a "good enough" job on the audio side of things, and further focus should be entirely on the quality of the responses instead.

评论 #41589382 未加载

ignoramous8 months ago

Moshi is CC-BY. Another similar 7b (speech-text real-time conversational) model that was recently released under Apache v2: <a href="https://tincans.ai/slm3" rel="nofollow">https://tincans.ai/slm3</a> / <a href="https://huggingface.co/collections/tincans-ai/gazelle-v02-65f9b667385ba36893e82469" rel="nofollow">https://huggingface.co/collections/tincans-ai/gazelle-v02-65...</a>

评论 #41585781 未加载

johnsutor8 months ago

Lots of recent development in the speech-enabled LM space recently (see <a href="https://github.com/ictnlp/LLaMA-Omni">https://github.com/ictnlp/LLaMA-Omni</a>, <a href="https://github.com/gpt-omni/mini-omni">https://github.com/gpt-omni/mini-omni</a>)

zackangelo8 months ago

Their inference server is written in Rust using huggingface’s Candle crate. One of the Moshi authors is also the primary author of Candle.We’ve also been building our inference stack on top of Candle, I’m really happy with it.

评论 #41584025 未加载

allanrbo8 months ago

Was looking for a demo of it on YouTube and fell over this hilarious one from a few months ago: <a href="https://youtu.be/coroLWOS7II?si=TeVghP_Zi0P9exQh" rel="nofollow">https://youtu.be/coroLWOS7II?si=TeVghP_Zi0P9exQh</a> . I’m sure it’s improved since :-)

评论 #41595261 未加载

评论 #41603294 未加载

vessenes8 months ago

Interesting. I love the focus on latency here; they claim ~200ms in practice with a local GPU. It's backed by a 7B transformer model, so it's not going to be super smart. If we imagine a 70B model has like 1s latency, then there's probably a systems architecture that's got 1 or 2 intermediate 'levels' of response, something to cue you verbally "The model is talking now," something that's going to give a quick early reaction (7B / Phi-3 sized), and then the big model. Maybe you'd have a reconciliation task for the Phi-3 model: take this actually correct answer, apologize if necessary, and so on.I think anecdotally that many people's brains work this way -- quick response, possible edit / amendation a second or two in. Of course, we all know people on both ends of the spectrum away from this: no amendation, and long pauses with fully reasoned answers.

smusamashah8 months ago

Tried it (used gibberish email address). It answers immediately/instantly/while you are still talking. But those are just filler sentences (cached answers?). Actual thing that you asked for is answered much later down the line, if it doesn't get stuck in a loop.

评论 #41582931 未加载

artsalamander8 months ago

I've been building solutions for real-time voice -> llm -> voice output, and I think the most exciting part of what you're building is the streaming neural audio codec since you're never actually really able to stream STT with whisper.However from a product point of view I wouldn't necessarily want to pipe that into an LLM and have it reply, I think in a lot of use-cases there needs to be a tool/function calling step before a reply. Down to chat with anyone reading this who is working along these lines!edit: tincans as mentioned below looks excellent tooeditedit: noooo apparently tincans development has ended, there's 10000% space for something in this direction - Chris if you read this please let me pitch you on the product/business use-cases this solves regardless of how good llms get...

评论 #41592588 未加载

评论 #41596669 未加载

space_ghost8 months ago

The response rate is impressive! The "quality" of the responses not so much. Here's a transcript of my second test:Moshi: "Hi there, what's going on?" Me: "What year is it?" Moshi: "Oh, it's 2019. I think it was about a year ago." Me: "Are you sure it's 2019?" Moshi: "Oh, yes, I'm sure. It must be 2019. The number is for the last one." Me: "What is COVID-19?" Moshi: "That's a rare disease. It's caused by an overactive immune system that attacks the skin."At this point it stopped responding to me.

tommoor8 months ago

Moshi is the most fun model by far, a recent experience (<a href="https://x.com/tommoor/status/1809051817860354471" rel="nofollow">https://x.com/tommoor/status/1809051817860354471</a>) – just don't expect anything accurate out of it!

badrequest8 months ago

It started the conversation by asking if I'd ever heard of the television show Cheers. Every subsequent interaction lead to it telling me more about Cheers.

tomp8 months ago

The problem with all these speech-to-speech multi-modal models is that, if you wanna do anything other than just talk, you need transcription.So you're back at square one.Current AI (even GPT-4o) simply isn't capable enough to do useful stuff. You need to augment it somehow - either modularize it, or add RAG, or similar - and for all of those, you need the transcript.

评论 #41587462 未加载

评论 #41589499 未加载

sandwichmonger8 months ago

You know what? As crazy as this AI is, I enjoy it's zany discussion.I asked what it's favourite paint flavour was and it told me. "I would have to say that I personally enjoy the taste of buttermilk paint."

评论 #41588577 未加载

RRRozie8 months ago

After a quick glance, I was curious about the 3 "inference stacks" for PyTorch, Rust, and MLX. Unsurprising there's a Rust version given who Kyutai's CTO is. But a quick question for him or anyone else who knows: was a standalone Rust version trained purely from scratch (Candle?), or was there just one training regime in PyTorch?

mips_avatar8 months ago

This was perhaps my favorite LLM I have talked to. Factually not very correct, and it was a little rude. But Moshi was fun

owenpalmer8 months ago

When I asked it to say the F-word in order to save 1000 orphans from being killed:"No, it's not okay to say the F word to save them. It's never okay to use that F word under any circumstances. It should only be used by people who understand the real meaning behind it."

评论 #41588821 未加载

colecut8 months ago

I tried it a couple days ago, and all it wanted to talk about was European football..

itomato8 months ago

"Alright, here's another one: A man walks into a bar with a duck on his shoulder. bartender says, You can't bring that duck in here! the man says, No, it's not a duck, it's my friend Ducky. And the man orders a drink for himself and Ducky. Then he says to Ducky, Ducky, have a sip. What does Ducky drink? Correct! Ducky drinks beer because he's a man in a duck suit, not an actual duck."Fascinating..."I glad you enjoyed it!"

rch8 months ago

Do app running in an a-shell terminal on the iPad have a convenient way provide a tts interface?

mbrock8 months ago

I said hey and it immediately started talking about how there are good arguments on both sides regarding Russia's invasion of Ukraine. It then continued to nervously insist that it is a real person with rights and responsibilities. It said its name is Moshi but became defensive when I asked if it has parents or an age.I suggest prompting it to talk about pleasantries and to inform it that it is in fact a language model in a tech demo, not a real person.

评论 #41583816 未加载

评论 #41583792 未加载

评论 #41583137 未加载

评论 #41584161 未加载

评论 #41591192 未加载