WaveNet: A Generative Model for Raw Audio

627 pointsby benanneover 8 years ago

29 comments

augustlover 8 years ago

The music examples are utterly fascinating. It sounds insanely natural.The only thing I can hear that sounds unnatural, is the way that the reverberation in the room (the "echo") immediately gets lower when the raw piano sound itself gets lower. In a real room, if you produce a loud sound and immediately after a soft sound, the reverberation of the loud sound remains. But since this network only models "the sound right now", the volume of the reverberation follows the volume of the piano sound.To my ears, this is most prevalent in the last example, which starts out loud and gradually becomes softer. It sounds a bit like they are cross-fading between multiple recordings.Regardless, the piano sounds completely natural to me, I don't hear any artifacts or sounds that a real piano wouldn't make. Amazing!There are also fragments that sounds inspiring and very musical to my ears, such as the melody and chord progression after 00:08 in the first example.

评论 #12458492 未加载

评论 #12458143 未加载

评论 #12457599 未加载

评论 #12458386 未加载

评论 #12458049 未加载

评论 #12458086 未加载

erichoceanover 8 years ago

This can be used to implement seamless voice performance transfer from one speaker to another:1. Train a WaveNet with the source speaker.2. Train a second WaveNet with the target speaker. Or for something totally new, train a WaveNet with a bunch of different speakers until you get one you like. This becomes the target WaveNet.3. Record raw audio from the source speaker.Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse" to recover those inputs given the rendered output. In this case, we now have raw audio from the source speaker that—in principle— could have been rendered by the source speaker's WaveNet, and we want to recover the inputs that would have rendered it, had we done so.To do that, usually you convert all numbers in the forward renderer into Dual numbers and use automatic differentiation to recover the inputs (in this case, phonemes and what not).4. Recover the inputs. (This is computationally expensive, but not difficult in practice, especially if WaveNet's generation algorithm is implemented in C++ and you've got a nice black-box optimizer to apply to the inputs, of which there are many freely available options.)5. Take the recovered WaveNet inputs, feed them into the target speaker's WaveNet, and record the resulting audio.Result: The resulting raw audio will have the same overall performance and speech as the source speaker, but rendered completely naturally in the target speaker's voice.

评论 #12458172 未加载

评论 #12460442 未加载

评论 #12456992 未加载

评论 #12457090 未加载

评论 #12456877 未加载

评论 #12457062 未加载

评论 #12460363 未加载

评论 #12460129 未加载

rdtscover 8 years ago

Wonder if there are any implications here for breaking (MitM) ZRTP protocol.<a href="https://en.wikipedia.org/wiki/ZRTP" rel="nofollow">https://en.wikipedia.org/wiki/ZRTP</a>At some point to authenticate both parties verify a short message by reading it to each other.However, NSA has already tried to MitM that about 10 years ago by using voice synthesis. It was deemed inadequate at the time. Wonder if TTS improvements like these, change that game and make it more plausable scenario.

评论 #12462225 未加载

dharma1over 8 years ago

The samples sound amazing. These causal convolutions look like a great idea, will have to re-read a few times. All the previous generative audio from raw audio samples I've heard (using LSTM) has been super noisy. These are crystal clear.Dilated convolutions are already implemented in TF, look forward to someone implementing this paper and publishing the code.

评论 #12457833 未加载

novalis78over 8 years ago

What's really intriguing is the part in their article where they explain the "babbling" of wavenet, when they train the network without the text input.That sounds just like a small kid imitating a foreign (or their own) language. My kids grow up bilingual and I hear them attempt something similar when they are really small. I guess it's like listening in to their neural network modelling the sound of the new language.

评论 #12458491 未加载

noonespecialover 8 years ago

So when I get the AI from one place, train it with the voices of hundreds of people from dozens of other sources, and then have it read a book from Project Gutenberg to an mp3... who owns the mechanical rights to that recording?

评论 #12459798 未加载

评论 #12457768 未加载

评论 #12458252 未加载

评论 #12459447 未加载

评论 #12459448 未加载

jay-andersonover 8 years ago

Any suggestions on where to start learning how to implement this? I understand some of the high level concepts (and took an intro AI class years ago - probably not terribly useful), but some of them are very much over my head (e.g. 2.2 Softmax Distributions and 2.3 Gated Activation Units) and some parts of the paper feel somewhat hand-wavy (2.6 Context Stacks). Any pointers would be useful as I attempt to understand it. (EDIT: section numbers refer to their paper)

评论 #12463457 未加载

chestervonwinchover 8 years ago

Is it possible to use the "deep dream" methods with a network trained for audio such as this? I wonder what that would sound like, e.g., beginning with a speech signal and enhancing with a network trained for music or vice versa.

评论 #12457983 未加载

评论 #12458989 未加载

fastoptimizerover 8 years ago

Do they say how much time is the generation taking?Is this insanely slow to train but extremely fast to do generation?

评论 #12456165 未加载

评论 #12457745 未加载

评论 #12466152 未加载

评论 #12458147 未加载

ronreiterover 8 years ago

Please please please someone please share an IPython notebook with something working already :)

评论 #12456712 未加载

评论 #12459819 未加载

评论 #12457759 未加载

评论 #12457760 未加载

grandalfover 8 years ago

This is incredible. I'd be worried if I were a professional audiobook reader :)

评论 #12459543 未加载

评论 #12456832 未加载

评论 #12456449 未加载

评论 #12458539 未加载

JoshTriplettover 8 years ago

How much data does a model take up? I wonder if this would work for compression? Train a model on a corpus of audio, then store the audio as text that turns back into a close approximation of that audio. (Optionally store deltas for egregious differences.)

评论 #12457935 未加载

bbctolover 8 years ago

Wow! I'd been playing around with machine learning and audio, and this blows even my hilariously far-future fantasies of speech generation out of the water. I guess when you're DeepMind, you have both the brainpower and resources to tackle sound right at the waveform level, and rely on how increasingly-magical your NNs seem to rebuild everything else you need. Really amazing stuff.

fpgaminerover 8 years ago

I'm guessing DeepMind has already done this (or is already doing), but conditioning on a video is the obvious next step. It would be incredibly interesting to see how accurate it can get generating the audio for a movie. Though I imagine for really great results they'll need to mix in an adversarial network.

评论 #12459811 未加载

JonnieCacheover 8 years ago

Wow. I badly want to try this out with music, but I've taken little more than baby steps with neural networks in the past: am I stuck waiting for someone else to reimplement the stuff in the paper?IIRC someone published an OSS implementation of the deep dreaming image synthesis paper fairly quickly...

评论 #12457948 未加载

visargaover 8 years ago

And when you think of all those Hollywood SF movies where the robot could reason and act quite well but in a tin-voice. How wrong they got it. We can simulate high quality voices but we can't have our reasoning, walking robots.

评论 #12460929 未加载

ericjangover 8 years ago

"At Vanguard, my voice is my password..."

kragenover 8 years ago

This is amazing. And it's not even a GAN. Presumably a GAN version of this would be even more natural — or maybe they tried that and it didn't work so they didn't put it in the paper?Definitely the death knell for biometric word lists.

imaginenoreover 8 years ago

Please make it sound like Morgan Freeman.

评论 #12460820 未加载

banachover 8 years ago

I hope this shows up as a TTS option for VoiceDream (<a href="http://www.voicedream.com/" rel="nofollow">http://www.voicedream.com/</a>) soon! With the best voices they have to offer (currently, the ones from Ivona), I can suffer through a book if the subject is really interesting, but the way the samples sounded here, the WaveNet TTS could be quite pleasant to listen to.

imurrayover 8 years ago

Would delete this post if I could. Was a request to fix a broken link. Now fixed.

评论 #12455756 未加载

rounceover 8 years ago

So when does the album drop?

评论 #12458352 未加载

nitrogenover 8 years ago

I wonder how a hybrid model would sound, where the net generates parameters for a parametric synthesis algorithm (or a common speech codec) instead of samples, to reduce CPU costs.

partycoderover 8 years ago

The first to do semantic style transfer on audio gets a cookie!

mtgxover 8 years ago

When can we expect this to be used in Google's TTS engine?

tunnuzover 8 years ago

Love the music part! Mmmh ... infinite jazz.

AstralStormover 8 years ago

Finally a convincing Simlish generator!

billconanover 8 years ago

hope they can release some source code.wonder how many gpus are required to hold this model.

baccheionover 8 years ago

I suppose it's impressive in a way, but when I looked into "smoothing out" text to speech audio a few years ago, it seemed fairly straightforward. I was left wondering why it hadn't been done already, but alas, most Engineers at these companies are either politicking know-nothing idiots, or are constantly being road blocked, preventing them from making any real advancements.