Crossing the uncanny valley of conversational voice

402 点作者 monroewalker2 个月前

55 条评论

This was already posted here: <a href="https://news.ycombinator.com/item?id=43221377">https://news.ycombinator.com/item?id=43221377</a> but I’m really surprised at the lack of attention this model is getting. The responsiveness and apparent personality are pretty mind blowing. It’s similar to what OpenAI had initially demoed for advanced voice mode, at least for the voice conversation portion.The demo interactions are recorded, which is mentioned in their disclaimer under the demo UI. What isn't mentioned though is that they include past conversations in the context for the model on future interactions. It was pretty surprising to be greeted with something like "welcome back" and the model being able to reference what was said in previous interactions. The full disclaimer on the page for the demo is:" 1. Microphone permission is required. 2. Calls are recorded for quality review but not used for ML training and are deleted within 30 days. 3. By using this demo, you are agreeing to our "edit: Actually this has been posted quite a few times already and had good visibility a couple days ago: - <a href="https://news.ycombinator.com/item?id=43200400">https://news.ycombinator.com/item?id=43200400</a> Others: <a href="https://hn.algolia.com/?q=sesame.com" rel="nofollow">https://hn.algolia.com/?q=sesame.com</a>

评论 #43227957 未加载

评论 #43229110 未加载

评论 #43228160 未加载

评论 #43228219 未加载

评论 #43229239 未加载

brendaniribe2 个月前

Hey, it’s Brendan from Sesame. The feedback is spot on. We still have so much to do to make it good. Inspiring but still many steps away from a great experience. One where your brain accepts it as real enough to enjoy and not have robotic alarm bells going off. Today, we’re firmly in the valley, but we’re optimistic we can climb out.Verbal communication is complex. There’s a big list of interesting challenges to tackle. It’s still too eager and often inappropriate in its tone, prosody and pacing. The timing of when it responds is wrong more often than right. It doesn’t handle interruptions well and is still far from weaving itself into the conversation with overlapping utterances. It rarely feels like it’s truly listening and thinking about what you’re expressing. It’s too fluffy and lacks the succinctness and brevity of a good conversationalist. Its personality is inconsistent. Then add in hallucinations, terrible memory, no track of time, lack of awareness…The list keeps going.I believe the community can make meaningful progress on all of these.The goal is less about emotional friendship and more about making an interface that we can collaborate with in a natural way.Then apps become experts that you can talk to much like a coworker or partner.The models are already powerful enough to do so many things. But finding the right prompt is often tricky and time consuming.Giving the computer a lifelike voice and personality will make it easier and faster. Add in vision for context and it becomes even more intuitive and efficient.I’m more convinced than ever that we’re at the cusp of a new interface.

评论 #43232428 未加载

评论 #43231429 未加载

评论 #43349997 未加载

评论 #43238228 未加载

noodlesUK2 个月前

I tried the demo, but I decided to not say anything. It desperately tried to make me talk. The entire experience was bizarre and unsettling - another commenter described it as a northern Californian startup CEO’s level of strange fake enthusiasm. As a Brit, I found the level of synthetic bubbliness in the voice extremely off-putting. I’d hate to live in a world where that was the way everyone behaved in real life.The entire thing felt like it was a hyper advanced engagement hack. Not there to achieve anything (even my enjoyment), just something to keep my attention locked on my device.AI products in the future should have a clear objective for me as a user - what can they help me do? Some simulacrum of a person that is just there to talk to me at length is probably going to be a net negative on society. As a tech demo, this makes me afraid for the future.

评论 #43230400 未加载

评论 #43231381 未加载

评论 #43273261 未加载

mentalgear2 个月前

While impressive, the paramount question stands: Why do we even need "emotional" voices?All that emotionality adds is that you get the illusion of a friend - a friend that can't help you in any way in the real world and who's confidentiality is as strong as the privacy policies & data security of the company running it - which often ultimately trends towards 0.Smart Neutral Voice Assistants could be a great help, but none of it requires "emotionality" and trying to build a "human connection" with the user. Quite the contrary: the more emotional a voice, the easier it is to misuse it for scams, faking rapport and in general make you "addicted" to loop you in babble with it.

评论 #43231240 未加载

评论 #43229733 未加载

评论 #43229630 未加载

评论 #43232094 未加载

评论 #43229523 未加载

评论 #43230262 未加载

评论 #43339182 未加载

评论 #43229321 未加载

martingoodson2 个月前

I played with this last night with my four-year old daughter. We had fun with asking Miles to explain what bones are made of etc.Today, she asked "where has that robot guy gone?". Crying now because I won't let her talk to Miles anymore.She has already developed an emotional connection to it. Worrying indeed.

评论 #43229283 未加载

评论 #43229588 未加载

评论 #43229943 未加载

评论 #43229184 未加载

thekevan2 个月前

It's good, but it still sounds fake to me, but in a different way. The voice itself sounds like a human, undoubtedly.But the cadence and the rhythm of speaking are off. It sounds like someone who isn't a podcaster trying to speak in the personality of a podcaster. It just sounds like someone trying too hard and speaking in an unnatural way.

评论 #43228772 未加载

评论 #43228642 未加载

评论 #43228366 未加载

评论 #43228726 未加载

评论 #43228526 未加载

评论 #43229205 未加载

评论 #43229102 未加载

评论 #43228857 未加载

bloomingkales2 个月前

This is so good that it's disarming. People are going to blabber everything to it, so we need a local private model. It's a lot to ask, I know. Incredible tech.

评论 #43228150 未加载

评论 #43228017 未加载

评论 #43228044 未加载

评论 #43228161 未加载

karimf2 个月前

This might be a game changer for learning English.I'm from a developing country and it's sad that most English teachers on public schools here can't speak English well. There are good English teachers, but they are expensive and they are not affordable for the average people.OpenAI realtime models are good, but we can't deploy it to masses since it's very expensive.This model might be able to solve the issue since it's better or on par with the OpenAI model, yet it's significantly cheaper since it's a fairly small model.

jonplackett2 个月前

My end-of-the-world AI prediction is everyone gets a phone call all at the same time and the voice on the end of the phone is so perfect they never put the phone down again. Maybe they do whatever it asks them to, maybe it’s just lovely.

评论 #43230834 未加载

tpowell2 个月前

Well I'm astounded. I talked to it for 13min, it crashed, but remembered the context when I returned a few minutes later and talked for a full 30min (it's limit).It 99.9% felt like it performed at the level of Samantha in the movie Her.I started asking all kinds of questions about how it worked and it mentioned a word I had to have it repeat because I hadn't heard it before: PROSODY (linguistics) — the study of elements of speech, including intonation, stress, rhythm and loudness, that occur simultaneously with individual phonetic segments: vowels and consonants. I asked about personality settings, à la TARS from Interstellar, and it said it automatically tailored responses by listening for tone and content.It felt like the most "the future's here but not evenly distributed" interaction I've had since multi-touch on an original iPhone.

rendall2 个月前

Well done. My first impression:Cons: they are just a bit too casual with their language. The casualness came off somewhat studied and inauthentic. They were just a bit too eager to fill silence: less than a split second of silence, and they were chattering. If they were humans I would think they were a bit insecure and trying too hard to establish rapport. But those flaws are relatively minor, and could just be an uncanny valley thing.Pros: They had such personalities that I felt at moments that I was talking to a person. Maya was trying to make me laugh and succeeded. They took initiative in conversation; even if that needs some tweaking, it feels huge.

gorgoiler2 个月前

I would say most command and control voice interactions are going to be like buying a coffee — the parameters of the transaction are well known, so it’s just about fine tuning the match between what the user wants and what the robot has to do.A small minority of these interactions are going to be like a restaurant server — chit chat, pleasantries, some information gathering, followed by issuing direct orders.The truly conversational interactions, while impressive, seem to be focused on… having a conversation. When am I going to want to have a conversation with an artificial person?It’s precisely this kind of boundary violation of DMV clerks being chatty and friendly and asking about my kids that feels so uncanny, imho, when I’m clearly there for, literally, a one hundred percent transactional purpose. Do people really want to be asked how their day is going when sizing up an M5 bolt order?In fact the humanising of robots like this makes it feel very uncomfortable when I have to interrupt their patter, ask them to be quiet, and insist they stay on topic.

评论 #43228894 未加载

评论 #43233282 未加载

评论 #43229510 未加载

brendanfinan2 个月前

all chat models seem enraptured by what I have to say. The first one to feign disinterest will pass the Turing test

评论 #43228668 未加载

kats2 个月前

AI voice is an overwhelmingly harmful technology. It's biggest use will be to hurt people.

评论 #43228021 未加载

评论 #43227966 未加载

评论 #43228300 未加载

评论 #43229031 未加载

评论 #43228892 未加载

评论 #43228139 未加载

drvladb2 个月前

Definitely an improvement over your normal Text-To-Speach model, and to some degree really different, but the subtle imperfections do appear and ruin the overall perception. A move in the right direction, though, I suppose.

评论 #43227989 未加载

tobr2 个月前

I asked it if it could whisper, and it replied in full voice, ”I’m whispering to you right now”.

评论 #43228002 未加载

评论 #43228888 未加载

diimdeep2 个月前

Some comedy skilled guys made radio play like impro with this AI and it is beyond hilarious.Miles gets Arrested: Sesame.ai <a href="https://youtu.be/cGMO2hRNnv0" rel="nofollow">https://youtu.be/cGMO2hRNnv0</a>

评论 #43229315 未加载

评论 #43232454 未加载

评论 #43229226 未加载

mohsen12 个月前

The intelligence of the model is very low though. I asked it about catcalling and it started to talk about cats!

评论 #43228624 未加载

评论 #43228916 未加载

评论 #43228157 未加载

评论 #43231804 未加载

spyder2 个月前

Seems similar to that Moshi model from 6 months ago, but this is more refined than that, Moshi is a little crazy, but still it was an impressive demo of how low latency responses, continuous listening and interruptions can improve the voice chat and make it more real or uncanny, (sometimes its "latency" is even too low because is interrupts you before you finish) <a href="https://www.youtube.com/watch?v=-XoEQ6oqlbE" rel="nofollow">https://www.youtube.com/watch?v=-XoEQ6oqlbE</a>They even released some models on huggingface:<a href="https://huggingface.co/collections/kyutai/moshi-v01-release-66eaeaf3302bef6bd9ad7acd" rel="nofollow">https://huggingface.co/collections/kyutai/moshi-v01-release-...</a>

评论 #43231142 未加载

lasky2 个月前

This is incredibly impressive. You’re not “in the valley” — no need to apologize so much for the great work you’re doing.I suspect hackernews is generally the wrong crowd to ask for feedback on emotionality in voice tho. Some of these folks would prefer humans speak like robots.

razemio2 个月前

I asked if speaking in German would be possible and the result was if someone is trying to speak German without knowing any word. However, I asked if a german sentence could be repeated after me and it was insanely good. Impressive tech!

smusamashah2 个月前

I played around. Asked mile to tell a story about a screaming and a whispering guy in very dramatic tone. It couldn't do it as expressively as the voice samples on the page. It was plain reading mostly. I could hear that this generation is text based. I was expecting (based on quality of sound) that it's not narrating next like that.Example: it was saying "two dude-us" while trying to tell a melodramatic story. Which I assume was originally "two dude...s" or something.

评论 #43230226 未加载

oezi2 个月前

Text-To-Speech models still aren't trained on rich enough data to have all the nuances we need to be fully expressive. For example, most models don't have a way to change accents separately from language (e.g. English with a slight French accent) or have an ability to set emotions such as excitement or sleepiness.We aren't even talking about adding laughing, singing/rap or beatboxing.

yobid202 个月前

I have so many questions. Is the model running client side? I was expecting to see webrtc used to send audio to a backend service, but instead i think i the audio waveform processing is done client side? Is it sending audio tokens over websockets to a backend service that is hosting the model? 1/16 slices are enough to accurately be able to recreate an audible sentence? Or is a speech to text model also running client side and are both text and tokens being sent to backend service? Is the backend sending audio tokens back or just text , with the text to speech running 100% client side? Is this using mimi codec or facebook's encodec?

satisfice2 个月前

The whole experiment is profoundly wrong-headed. It should be considered tantamount to consumer fraud to present a robot that appears to have emotional intelligence when it has no such thing.This software is a con artist. I mean that literally. It's not LIKE a con artist, it is literally attempting to con the user into forming assumptions about its intentions and mental states that its creators know to be false.

names_are_hard2 个月前

I must be doing something wrong, but the demo seems to be the voice having a conversation with itself? It doesn't let me interject, and it answers its own questions. There's some kind of feedback loop here, it seems.

评论 #43228286 未加载

评论 #43228304 未加载

gloosx2 个月前

Impressive, but I think this is missing two important things to not sound robotic – some atmosphere and space. During a real conversation, both partners are in some kind of a space, either in room, park, car or just on foot in the street. So the voice must have a little bit of reverb according to the space this voice is located in, and there must be some bits of background noise present from that same space. Even lip movement provides some tiniest background noises when you speak which contributes to making the sound real.

评论 #43229527 未加载

singularity20012 个月前

pretty impressive demo but not my style I mean the constant jabbing and kind of unintelligent behavior. so yeah it feels pretty uncanny but unfortunately in a negative annoying way. I don't think this is a limitation of the model they could just adopt to more scientific users in a more cooperative way, similar to how ChatGPT has this very sophisticated aura. I don't like how systems which have no emotions constantly pretend to have emotions but maybe that's just me.

评论 #43228065 未加载

radley2 个月前

The inflection was quite good. The only thing off seemed to be when she was thinking on something new. Instead of pausing to think, her next thought actually started too quickly, cutting off the very end of what she was saying before.I am curious how easy it would be to adjust the inflection and timing. She was over-complimentary, which is fine for a demo. But I'd love something more direct, like a brainstorming session, and almost talking over each other. And then a whiteboard...

taylorius2 个月前

It's very good, really impressive demo. My feedback would be, Maya needs to keep quiet a little longer after asking a question. She would ask something, then as I thought about my reply, already be on to the next thing. It left me with the impression she was a babbler (which is not an unrealistic model of how humans are, but it would be cool to be able to dial such traits up or down to taste).I suppose the lack of visual cues probably hinders things in that regard.

评论 #43228895 未加载

jsenn2 个月前

Are there any technical innovations here over Moshi, which invented some of the pieces they use for their model? The only comparison I see is they split the temporal and depthwise transformers on the zeroth RVQ codebook, whereas Moshi has a special zeroth level vector quantizer distilled from a larger audio model, with the intent to preserve semantic information.EDIT: also Moshi started with a pretrained traditional text LLM

35mm2 个月前

Seems like they’re going to make a hardware product based on their open positions. A universal translator earbud would be nice.

richrichardsson2 个月前

Still suffers the same problem that all Voice Recognition seems to suffer; cannot reliably detect that the speaker has finished speaking.This was almost worse though because it did feel like a rude person just interrupting instead of a dumb computer not being able to pick up normal social cues around when the person they're listening to has finished.

评论 #43230547 未加载

评论 #43229276 未加载

alt2272 个月前

Tried to do the demo but it kept cutting every sentance off half way through. When I told it that I couldnt understand it because their voice kept cutting off, it said 'oh you noticed that did you? Sorry about that we are still working out some kinks' - all perfectly with no cutting out. I fail to see that as coincidence.

评论 #43233605 未加载

notadev2 个月前

I tried both models. I could easily tell Maya was AI, but Miles sounded so lifelike that I felt that initial apprehension like hopping on a conference line with strangers. I even chuckled at one of his side remarks. It was strange knowing it wasn’t a real person, but it was very hard not to feel like it was.

micw2 个月前

I just discovered the "bookmark" feature. You can bookmark a point in the conversation and start at this point when you come back next time. Just ask it to make a bookmark.

gHA52 个月前

The underlying text generation should be made aware that it can make sounds. It told me it can't.Also for proper emotional dialogue it needs to determine the human input emotions. It seems to work with a transcript of the input.

hoelle2 个月前

Didn't think it would cross the uncanny valley for me when it opened the chat by taunting me for being up too late, reading the time digit by digit. Not something a human would do.But I did feel bad hanging up on it. Him?

kaizenb2 个月前

Glad to have my HER moment!

评论 #43228288 未加载

habosa2 个月前

The first thing it said to me was that I should read the “looong looong” post about how it works and it pronounced that as “loon-g” not “lawn-g” which was a weird own goal.Extremely impressive overall though.

swang2 个月前

i turned it on while i was heating some hot chocolatetold it, "hold on" as i was putting on my headset, they said "no problem". but then i tried to fill the empty airtime by saying, "i'm uhh heating some hot chocolate?"the ai's response was something like, "ah.. (something) (something). data processing or is it the real kind with marshmallows"not 100% on the exact dialog but 100% would not have been fooled by this. closed it there. no uncanny valley situation for me.

ChrisArchitect2 个月前

Previously: <a href="https://news.ycombinator.com/item?id=43200400">https://news.ycombinator.com/item?id=43200400</a>

TZubiri2 个月前

Or don't, revert course and give me robo-voice!

forgotmysn2 个月前

a lot of comments are dismissive of these generated convos because of out how obvious it is that these convos are generated. i feel like that's a high bar. you can tell that GTA5 is generated, but it's close enough to be fun. i imagine that's as close as we'll get with conversational AI

mdrzn2 个月前

This is an insane demo. Great tech so far, can't wait to see how it progresses.

throwaway9811202 个月前

As Bruce Schneier has said, it is important to create an unmistakable robotic sound for your AI voices even while you make them capable and conversational.<a href="https://www.schneier.com/blog/archives/2025/02/ais-and-robots-should-sound-robotic.html" rel="nofollow">https://www.schneier.com/blog/archives/2025/02/ais-and-robot...</a>

bobosha2 个月前

Very impressive. well done team sesame!

spacemanspiff012 个月前

Is it a voice to voice model, or a voice->text->voice?I might have missed it in their writeup.

ausbah2 个月前

reminds me of an hr rep right before they would fire you

pulkitsh12342 个月前

This is mind blowing

daniel-ash2 个月前

Miles is the first AI I’ve met that is way cooler than meIncredible!

rjpruitt162 个月前

"I hate to say this, but I was deeply offended by this model. It sounds more human-like, but it has a strong bias toward political views. I don’t want to talk about the topic that was discussed. However, I would never allow my children to listen to this. I’m surprised that AI is capable of making me this mad. At first, I was excited about a tremendous leap into the future, but now I’m worried about the level of mind control this technology could have over children."

评论 #43228699 未加载

评论 #43228719 未加载

bradley132 个月前

Maybe I'm weird, but I have zero desire to talk with an AI model. I use them a lot, in a browser or a console. But talking? No. Just...no. Why would I?

评论 #43228280 未加载

评论 #43229195 未加载

wewewedxfgdf2 个月前

Yeah that's remarkable.Trying asking it to be dungeon master and play dungeons and dragons style role playing game.

427728272 个月前

“Maya and Miles are too busy at the moment.”