I've actually been tossing around the idea of creating a program like this, although for a specific use case.<p>In Bethesda games (Oblivion, Skyrim, Fallout) there are large modding communities adding new quests, areas and plot lines. But one technical and financial challenge for them has always been voice acting. Not only do they have to worry about voice acting new potential characters, but they have no means of writing new dialogue for existing characters.<p>In Fallout 4, for example, the protagonist is fully voice acted. That means a distinct change between the way the main game feels, and any modding efforts made by the community (barring actually re-hiring the original voice actor for new lines).<p>I'm envisioning having this tool train on the already provided voice lines in the game(depending on the character in question, that's quite a bit). And then letting mod authors input new dialogue lines to be spit out in somewhat the actors voice.<p>Lots of problems with the approach of course, not to mention the fact that these are actors and not just voices (there would probably be significant amount of emotion lost). But it would give the modding community such a powerful tool to add new plots for existing characters.
I wonder how much better the rendering would be if the audio track were much longer and the software would have more to learn. I don't mean more <i>words</i> for a 1 to 1 match since it's clearly beyond that (pronouncing words that it didn't see), but voice features that weren't in that short track.<p>Hypothetical question: say it had access to all the episodes from Key & Peele, would the rendering be better to the point you could basically generate an audio track from a script with intonation and all?<p>It would be interesting if they offered "voice packages" either online or offline so you could just pass text through it and the output would be a Morgan Freeman narration. You'd have a shop for "Cords" the same as iTunes for songs & apps. Maybe game developers will find that interesting, too. Having access to way more voices than they'd have in real life, and on a budget.<p>Someone could also save their voices for posterity. Many people listen to recordings of loved ones who passed away to remember them. Saving the voice for new content would be something to think about.
1) Mentioned near the end of the video that it actually required around 20 minutes of audio to start synthesys. Not quite as magic as it first seemed. Still cool.<p>2) The intonation always matched the initial sample. Give us some filters like "vocal fry", "perplexed", "angry", "wonder" etc and then we'll really have something here.
Sounds a lot like this: <a href="https://youtu.be/xzL-pxcpo-E?t=933" rel="nofollow">https://youtu.be/xzL-pxcpo-E?t=933</a><p>IRCAM has been doing some really cool stuff in this area for a long time. Check out their pages on corpus-based synthesis for example: <a href="http://imtr.ircam.fr/imtr/Corpus_Based_Synthesis" rel="nofollow">http://imtr.ircam.fr/imtr/Corpus_Based_Synthesis</a>
This sounds waaay better than the Donald Trump text to speech system I've been working on: <a href="http://jungle.horse" rel="nofollow">http://jungle.horse</a><p>I wish I could chat with their engineering team. I'd love to learn the mathematics and tech. (A lot of it might be patented?)<p>Is there an equivalent of SIGGRAPH for audio?
If photoshop would give these results, a lot of industries would go belly up.<p>Nice marketing line, but it's speech recognition which set the begin/end frame in the sample.<p>I was expecting either "painting" away defects or actually reconstruction a real TTS by using a small sample.
Copied from my comment on an earlier submission on this:<p>I don't see how the watermarking they talk about is going to succeed in preventing forgeries.<p>If they're planning to watermark unedited recordings, you have a huge false positive problem because there are billions of hours of legitimate but unwatermarked audio recordings, and will probably continue to be. You can also get false negatives by tampering with a watermark-capable device to get it to watermark something that wasn't recorded from analog. Or you can rerecord edited audio from an analog source and simply claim that your "genuine" recording is slightly noisy.<p>If they're planning to watermark edited recordings, someone else can implement the same kind of technology but without the watermarking.
"Photoshop for audio," seems so obvious, I'm surprised we haven't seen this before. (After all, the underlying technology has been around for a while now.)
if adobe has this working in a demo, rest assured "security service" developed such thing 10 years ago. then you can go back and ask yourselves why osama has been reported dead as early as like 2001, the cia released videos in which he always looked different and why his body was quickly drowned at an unknown location.<p>go back to sleep, now. everything's alright. great new tech. will help catching terrorists from beneath your bed.