Emotionally Expressive Text to Speech

261 pointsby interwebabout 5 years ago

28 comments

crazygringoabout 5 years ago

This is fascinating.But I'm very curious what the emotional "parameters" are? There are literally at least a thousand different ways of saying "I love you" (serious-romantic, throwaway to a buddy, reassuring a scared child, sarcastic, choking up, full of gratitude, irritated, self-questioning, dismissive, etc. ad finitum). Anyone who's worked as an actor and done script analysis knows there are 100's of variables that go into a line reading. Just three words, by themselves, can communicate roughly an entire paragraph's worth of meaning solely by the exact way they're said -- which is one of the things that makes acting, and directing actors, such a rewarding challenge.Obviously it's far too complex to infer from text alone. So curious how the team has simplified it? What are emotional dimensions that you can specify? And how did they choose those dimensions over others? Are they geared towards the kind of "everyday" expression in a normal conversation between friends, or towards the more "dramatic" or "high comedy" of intense situations that much of film and TV lean towards?

评论 #23210807 未加载

评论 #23207475 未加载

sonanticabout 5 years ago

Hey HN - Zeena Qureshi (Co-Founder and CEO at Sonantic) here.Thanks for your thoughts and feedback thus far! I'd be happy to answer questions (within reason) about our latest cry demo / emotional TTS! Feel free to fire away on this thread.

评论 #23207433 未加载

评论 #23210499 未加载

评论 #23207584 未加载

评论 #23207345 未加载

评论 #23210625 未加载

评论 #23207798 未加载

评论 #23210335 未加载

spaceprisonabout 5 years ago

My daughter is dyslexic and would love to play things like stardew valley, pokemon or even animal crossing but being text only makes them such a slog for her.The same goes for sub titles, she'd be perfectly fine with a robot voice for the actors if they sounded real enough like this.Game changer.

评论 #23206894 未加载

评论 #23214408 未加载

评论 #23206680 未加载

ArneVogelabout 5 years ago

This site has to have one of the worst cookie choice decision popups: <a href="https://imgur.com/a/YLsGadP" rel="nofollow">https://imgur.com/a/YLsGadP</a>

评论 #23206917 未加载

评论 #23211330 未加载

评论 #23207612 未加载

评论 #23206159 未加载

vessenesabout 5 years ago

Hi Zeena, I love this! I just filled out your form.I was just mucking around with Nvidia's latest, called flowtron, and I know from that experience there's a significant amount of work between getting a tech demo out and launching a usable product, whether API-based, or with some visual workflow like your video shows.One thing I think worth considering on the commercialization front is whether or not the core offering is the workflow niceties around your engine, the engine-as-API, or both. I'm just a random person on the internet, so take these thoughts with a large grain of salt, but thinking about it, it seems to me that prioritizing integration with say unity, unreal engine, video compositing tools, blog posting tools are all interesting and viable market paths. The underlying networks are going to keep improving for some time, so you're really trying to buy some long term customers.Some stuff that's obvious, but I can't resist:I could off the top of my head imagine using this for massively reducing the cost to develop games, for script writers pulling comps together, for myself to create audio versions of my own writing, for better IOT applications inside the home... I'd really love to be able to play with this.There still isn't a truly non-annoying virtual assistant voice; when the first tacotron paper came out, I was hopeful I would see more prosody embedded in assistants by now, but the longer we live with siri and google, the more sensitive I think we are to their shortcomings. I have a preference for passive / ambient communication and updates, so I would place a really high value on something that could politely interrupt or say hello with information.At any rate, congratulations, this is cool. :)

diminishabout 5 years ago

Impressive next step for text-to-speect. Wish there was some simple real demos. I also work on the same thing using DL- and hope to open source the "emotional part" of it.We soon can create emotionally expressive youtube videos with synthetic actors..

评论 #23206944 未加载

yc-kralnabout 5 years ago

I have a comment and a question:The comment: I noticed that your demo video also had "emotional" video layered on top of the dialogue. This could be considered manipulative; perhaps consider sharing a naked version so we could attempt to interpret the emotion based solely on the text to speech engine.The question: You mention you met at EF. I was wondering if, beyond bringing you together, you found EF to be worth the cost of admission?

评论 #23208326 未加载

评论 #23211822 未加载

microtherionabout 5 years ago

The prosody sounds nice. But two of the longer samples have a lot of vocal fry, and the third sounds like the voice has a stuffy nose and/or a slight lisp. I wonder whether those mannerisms were chosen to camouflage artifacts inherent in their current implementation.

评论 #23205837 未加载

schoolornotabout 5 years ago

Between this and Lyrebird there seem to be a high number of cutting edge TTS solutions being worked on in the private sector. Does anyone know why there haven't been much advancement with the FOSS libraries?

评论 #23206664 未加载

评论 #23206886 未加载

评论 #23206797 未加载

评论 #23206249 未加载

jarielabout 5 years ago

Recommending editing the video down to 43-60 seconds.It would be nice to try with actual text inputs right on the page, that this doesn't exist is tiny flag.A great choice to work with voice actors, because there isn't any 'pure' TTY that's good enough in the most general sense, having the actual voice actor as a working basis will help.Perhaps for small game houses, they can just use something off the shelf, big houses can use a customized voice, and then not worry if they have to make tweaks or changes, they don't have to do a whole production.

评论 #23207137 未加载

microcolonelabout 5 years ago

Very cool demo, but the quality of the vocoding is not state of the art, and it's audibly artificial, which is probably why you covered it up with the obnoxiously loud music.Next time be honest about what you have when presenting it; every human with functioning ears is attuned to the sound of speech. This sort of technology would be amazing for narrative video games even with the less than perfect vocoding.

ameliusabout 5 years ago

Sounds nice but difficult to judge with the background music.

评论 #23206643 未加载

voiper1about 5 years ago

Is there any pay-to-use or open source voice for Hebrew?Amazon's Polly English voice, Matthew is pretty nice. But they don't have Hebrew. Also Google doesn't have Hebrew. Bing has some attribution requirement that I haven't fully investigated.

评论 #23207216 未加载

评论 #23207830 未加载

DenisMabout 5 years ago

This is very impressive.I wonder if attaching this to a modern-day Elisa will improve the Turing test scores? Emotional load can reduce the requirement for semantic coherence.

评论 #23206755 未加载

tomByrerabout 5 years ago

@sonantic Seems you don't do real-time yet?If so, have plans for a Web Speech API plugin? I'm about to release a reader demo based around it. <a href="https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API" rel="nofollow">https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...</a>

aasasdabout 5 years ago

As a non-native-speaker, I understood exactly four words from the monologue in the vid. Which might be on par for some movies, often having actors whisper and breathy-voice through the whole thing (ahem House of Cards cough). However, for actual TTS like webpages and audiobooks, the ‘Dina’ voice works much better.

wishinghandabout 5 years ago

Hey Zeena- will there be options to make the voices more unreal? The use case I imagine is for a character with a damaged vocoder or a broken speaker. Other glitchy affectations could be useful too.

diskmuncherabout 5 years ago

History has shown us that technological advancement of this kind will be adopted first by ...Obvious application: H-anime. Reduced parameters for the "emotion" as well.

hyperpalliumabout 5 years ago

the video <a href="https://youtube.com/watch?v=zwYiDraKtSA" rel="nofollow">https://youtube.com/watch?v=zwYiDraKtSA</a>

sarabandeabout 5 years ago

If this could generate well-done audiobooks instantly from a text, that would be fantastic. All e-books could have an audiobook version overnight.

评论 #23207647 未加载

评论 #23207469 未加载

Animatsabout 5 years ago

Can't hear the voices over the music.

评论 #23211807 未加载

moron4hireabout 5 years ago

Any plans to support languages other than English? This would be huge in the foreign language instruction field.

评论 #23207167 未加载

blattimwindabout 5 years ago

I could see this being used for RPG games to fix the choice deficiency that has been caused by going for fully voiced dialogue. Also, making Hitler read copypastas even more convincingly.

评论 #23206692 未加载

dequalantabout 5 years ago

This is amazing! I was looking something like this to come up for a long time. Finally someone did it!

terrycodyabout 5 years ago

Applied the form. Really cool.I want the know the price and when can we use it in production.

cemregrabout 5 years ago

Is there an actual demo?

dejonghabout 5 years ago

Borh Cool and creepy!

评论 #23211830 未加载

maxdoabout 5 years ago

Wow sounds very real