The music examples are utterly fascinating. It sounds insanely natural.<p>The only thing I can hear that sounds unnatural, is the way that the reverberation in the room (the "echo") immediately gets lower when the raw piano sound itself gets lower. In a real room, if you produce a loud sound and immediately after a soft sound, the reverberation of the loud sound remains. But since this network only models "the sound right now", the volume of the reverberation follows the volume of the piano sound.<p>To my ears, this is most prevalent in the last example, which starts out loud and gradually becomes softer. It sounds a bit like they are cross-fading between multiple recordings.<p>Regardless, the piano sounds completely natural to me, I don't hear any artifacts or sounds that a real piano wouldn't make. Amazing!<p>There are also fragments that sounds inspiring and very musical to my ears, such as the melody and chord progression after 00:08 in the first example.
This can be used to implement seamless voice performance transfer from one speaker to another:<p>1. Train a WaveNet with the source speaker.<p>2. Train a second WaveNet with the target speaker. Or for something totally new, train a WaveNet with a bunch of different speakers until you get one you like. This becomes the <i>target WaveNet</i>.<p>3. Record raw audio from the source speaker.<p>Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse" to recover those inputs given the rendered output. In this case, we now have raw audio from the source speaker that—in principle— <i>could</i> have been rendered by the source speaker's WaveNet, and we want to recover the inputs that would have rendered it, had we done so.<p>To do that, usually you convert all numbers in the forward renderer into Dual numbers and use automatic differentiation to recover the inputs (in this case, phonemes and what not).<p>4. Recover the inputs. (This is computationally expensive, but not difficult in practice, especially if WaveNet's generation algorithm is implemented in C++ and you've got a nice black-box optimizer to apply to the inputs, of which there are many freely available options.)<p>5. Take the recovered WaveNet inputs, feed them into the target speaker's WaveNet, and record the resulting audio.<p>Result: The resulting raw audio will have the same overall performance and speech as the source speaker, but rendered completely naturally in the target speaker's voice.
Wonder if there are any implications here for breaking (MitM) ZRTP protocol.<p><a href="https://en.wikipedia.org/wiki/ZRTP" rel="nofollow">https://en.wikipedia.org/wiki/ZRTP</a><p>At some point to authenticate both parties verify a short message by reading it to each other.<p>However, NSA has already tried to MitM that about 10 years ago by using voice synthesis. It was deemed inadequate at the time. Wonder if TTS improvements like these, change that game and make it more plausable scenario.
The samples sound amazing. These causal convolutions look like a great idea, will have to re-read a few times. All the previous generative audio from raw audio samples I've heard (using LSTM) has been super noisy. These are crystal clear.<p>Dilated convolutions are already implemented in TF, look forward to someone implementing this paper and publishing the code.
What's really intriguing is the part in their article where they explain the "babbling" of wavenet, when they train the network without the text input.<p>That sounds just like a small kid imitating a foreign (or their own) language. My kids grow up bilingual and I hear them attempt something similar when they are really small. I guess it's like listening in to their neural network modelling the sound of the new language.
So when I get the AI from one place, train it with the voices of hundreds of people from dozens of other sources, and then have it read a book from Project Gutenberg to an mp3... who owns the mechanical rights to that recording?
Any suggestions on where to start learning how to implement this? I understand some of the high level concepts (and took an intro AI class years ago - probably not terribly useful), but some of them are very much over my head (e.g. 2.2 Softmax Distributions and 2.3 Gated Activation Units) and some parts of the paper feel somewhat hand-wavy (2.6 Context Stacks). Any pointers would be useful as I attempt to understand it. (EDIT: section numbers refer to their paper)
Is it possible to use the "deep dream" methods with a network trained for audio such as this? I wonder what that would sound like, e.g., beginning with a speech signal and enhancing with a network trained for music or vice versa.
How much data does a model take up? I wonder if this would work for compression? Train a model on a corpus of audio, then store the audio as text that turns back into a close approximation of that audio. (Optionally store deltas for egregious differences.)
Wow! I'd been playing around with machine learning and audio, and this blows even my hilariously far-future fantasies of speech generation out of the water. I guess when you're DeepMind, you have both the brainpower and resources to tackle sound right at the waveform level, and rely on how increasingly-magical your NNs seem to rebuild everything else you need. Really amazing stuff.
I'm guessing DeepMind has already done this (or is already doing), but conditioning on a video is the obvious next step. It would be incredibly interesting to see how accurate it can get generating the audio for a movie. Though I imagine for really great results they'll need to mix in an adversarial network.
Wow. I badly want to try this out with music, but I've taken little more than baby steps with neural networks in the past: am I stuck waiting for someone else to reimplement the stuff in the paper?<p>IIRC someone published an OSS implementation of the deep dreaming image synthesis paper fairly quickly...
And when you think of all those Hollywood SF movies where the robot could reason and act quite well but in a tin-voice. How wrong they got it. We can simulate high quality voices but we can't have our reasoning, walking robots.
This is amazing. And it's not even a GAN. Presumably a GAN version of this would be even more natural — or maybe they tried that and it didn't work so they didn't put it in the paper?<p>Definitely the death knell for biometric word lists.
I hope this shows up as a TTS option for VoiceDream (<a href="http://www.voicedream.com/" rel="nofollow">http://www.voicedream.com/</a>) soon! With the best voices they have to offer (currently, the ones from Ivona), I can suffer through a book if the subject is really interesting, but the way the samples sounded here, the WaveNet TTS could be quite pleasant to listen to.
I wonder how a hybrid model would sound, where the net generates parameters for a parametric synthesis algorithm (or a common speech codec) instead of samples, to reduce CPU costs.
I suppose it's impressive in a way, but when I looked into "smoothing out" text to speech audio a few years ago, it seemed fairly straightforward. I was left wondering why it hadn't been done already, but alas, most Engineers at these companies are either politicking know-nothing idiots, or are constantly being road blocked, preventing them from making any real advancements.