Audio AI: isolating vocals from stereo music using Convolutional Neural Networks

275 pointsby turbohzover 6 years ago

21 comments

btownover 6 years ago

This is an awesome project, but it seems it was done without reference to academic literature on source separation. In fact, people have been doing audio source separation for years with neural networks.For instance, Eric Humphrey at Spotify Music Understanding Group describes using a U-Net architecture here: <a href="https://medium.com/this-week-in-machine-learning-ai/separating-vocals-in-recorded-music-at-spotify-with-eric-humphrey-51c2f85d1451" rel="nofollow">https://medium.com/this-week-in-machine-learning-ai/separati...</a> - paper at <a href="http://openaccess.city.ac.uk/19289/1/7bb8d1600fba70dd79408775cd0c37a4ff62.pdf" rel="nofollow">http://openaccess.city.ac.uk/19289/1/7bb8d1600fba70dd7940877...</a>They compare their performance to the widely-cited state of the art Chimera model (Luo 2017): <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5791533/#R24" rel="nofollow">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5791533/#R24</a> with examples at <a href="http://naplab.ee.columbia.edu/ivs.html" rel="nofollow">http://naplab.ee.columbia.edu/ivs.html</a> - from the examples, there's significantly less distortion than OP.Not to discourage OP from doing first-principles research at all! But it's often useful to engage with the larger community and know what's succeeded and failed in the past. This is a problem domain where progress could change the entire creative landscape around derivative works ("mashups" and the like), and interested researchers could do well to look towards collaboration rather than reinventing each others' wheels.EDIT: The SANE conference has talks by Humphrey and many others available online: <a href="https://www.youtube.com/channel/UCsdxfneC1EdPorDUq9_XUJA/videos" rel="nofollow">https://www.youtube.com/channel/UCsdxfneC1EdPorDUq9_XUJA/vid...</a>

评论 #19177102 未加载

评论 #19179583 未加载

评论 #19174938 未加载

emcqover 6 years ago

What motivates people to invent phrases like "perceptual binarization" when googling "audio binary mask" literally gives you citations in the field that have been doing this for years?For example, 2009 musical sound separation based on binary time frequency masking.Or more recent stuff using deep learning. Also the field generally prefers ratio masks because they lead to better sounding output.

评论 #19174496 未加载

评论 #19174974 未加载

评论 #19174614 未加载

评论 #19174030 未加载

GistNoesisover 6 years ago

Hello, a little self promotion, you can see it our experiment with some deep neural networks doing real-time audio processing in the browser, using tensorflow.js<a href="http://gistnoesis.github.io/" rel="nofollow">http://gistnoesis.github.io/</a>If you want to see how it's done it's shared source : <a href="https://github.com/GistNoesis/Wisteria/" rel="nofollow">https://github.com/GistNoesis/Wisteria/</a>Thanks

SyneRyderover 6 years ago

Does anyone know if this is related to the new iZotope RX 7 vocal isolation & stemming tools? It does seem to be talking about something similar, especially when it mentions using the same technique to split a song into instrument stems.(Or to put it another way - there is commercial music software released in the last year that lets you do this yourself now.)<a href="https://www.youtube.com/watch?v=kEauVQv2Quc" rel="nofollow">https://www.youtube.com/watch?v=kEauVQv2Quc</a><a href="https://www.izotope.com/en/products/repair-and-edit/rx/music.html" rel="nofollow">https://www.izotope.com/en/products/repair-and-edit/rx/music...</a>

评论 #19174819 未加载

bsaulover 6 years ago

I used to work in an audio processing research center back in 2003, and colleagues next to me were able to isolate each instrument in a stereo mix live using the fact that they were "placed" on different spot in the stereo plane.Don't ask me how they did that, it was close to magic to me at that time, but i'm sure it wasn't neural networks. Although it probably involved convolution, as it is the main tool for producing audio filters.If anyone has more info on the fundamental differences of the neural network approach compared to the "traditional" one, i'd be thankful.

评论 #19173425 未加载

评论 #19173076 未加载

评论 #19174755 未加载

评论 #19173216 未加载

评论 #19174528 未加载

评论 #19173073 未加载

评论 #19173328 未加载

tasty_freezeover 6 years ago

Trivia: Avery Wang, the guy who invented the Shazam algorithm and was their CTO did his PhD thesis on this topic:<a href="https://ccrma.stanford.edu/~jos/EE201/More_Recent_PhD_EEs_CCRMA.html" rel="nofollow">https://ccrma.stanford.edu/~jos/EE201/More_Recent_PhD_EEs_CC...</a>``Instantaneous and Frequency-Warped Signal Processing Techniques for Auditory Source Separation'' (1994)

sytelusover 6 years ago

Please stop publishing on Medium. I'm getting error "You read a lot. We like that. You’ve reached the end of your free member preview for this month. Become a member now for $5/month to read this story".Not gonna do that.

评论 #19177391 未加载

En_gr_Studentover 6 years ago

There is a lagged autoregressive technique used in forensic analysis that allows 3d reconstruction using 1d (mic) sound.A CNN should be able to back that out too, and do other things like regenerate a 3d space. In the right, high-fidelity, acoustic tracks could be the spatial information to reconstruct a stage and a performance. It would be neat/beautiful/(possibly very powerful) to back video out of audio in that way.

plaidfujiover 6 years ago

The presentation of this project alone is a visual tour de force to say nothing of the technical quality. Beautiful and easily digestible post. As with any interesting, non-toy applied ML problem, the dataset generation is really where the innovation is. It gets a neat little graphic at the end. As far as how the author characterizes the problem, I think the word he's looking for is "semantic segmentation" - he's trying to classify each pixel of the spectrograph as vocal/non-vocal. I'd be curious if he could drop the dataset into pix2pix-style networks and achieve the same results.

8bitsruleover 6 years ago

Question: has any progress been made in removing reverb?There are many, many historical recordings (and modern ones made in less-than-ideal circumstances) that suffer badly from reverb. Seems like a valuable use-case that -ought- to be in reach today.

评论 #19174401 未加载

评论 #19175566 未加载

switchbakover 6 years ago

Just wanted to mention there's some folks doing realtime source separation (not sure exactly how they've implemented it) with a DNN for reduction of background noise in, eg: Skype conversations.I'm not involved with them in any way, but I've been amazed with its ability to cancel out coffee-shop style noise.Check out <a href="https://krisp.ai/technology/" rel="nofollow">https://krisp.ai/technology/</a> - Mac/Windows. I wish they had Linux support!Edit: Appears they don't have Windows support yet.

评论 #19174596 未加载

syntaxingover 6 years ago

Clicked into to the article because I was curious how the training set was created. Using the acapella version is an amazing idea! Wished the article went more in-depth about this section.

Animatsover 6 years ago

Is it possible yet to take a recording of singing and generate a model of the singer for synthesis, like a Vocaloid?

petraover 6 years ago

Question: Currently building earphones with great active-noise-cancellation is a secret kept within few companies.This means they're expensive($300 headphones from Bose etc).Do neural network make this simpler ?And do you think they can be applied cheaply enough,say for $99 headphones ?I assume this will sell really well, and justify creating a dedicated chip, with time.

评论 #19174612 未加载

sonnyblarneyover 6 years ago

Soon enough there will be an AI filter that will take any old hacky, coughing, wheezing singer running around on stage, singing out of tune - and turn it into virtuoso chops. Maybe even derived from their own voice.Which will give entirely new meaning to 'lip synching'.

评论 #19175555 未加载

samstaveover 6 years ago

A fun thing to do with this would be to slurp the lyrics from one song - the beats from another, some other stream from another and remix the “threads” together into something new.Basically a giant equalizer that allows you to dim or brighten each channel from multiple sources.

smrtinsertover 6 years ago

This project I've found to be very useful if you want access to something that like what the article describes. <a href="http://isse.sourceforge.net" rel="nofollow">http://isse.sourceforge.net</a>

canada_dryover 6 years ago

I'd like to try using this kinda thing to build an automated beat saber map. The ability to orchestrate the beats very specifically would make for excellent mappings.Alas so many projects, too little time!

dharma1over 6 years ago

Sounds pretty good but exhibits the same artifacts/phasing that I've heard with other source separation. Good for forensics etc but I wouldn't use this for music production

jtbaylyover 6 years ago

There was a similar demo (I think from Google) here on HN sometime last year that was far more impressive. I can't seem to find it though. Anybody know what it was?

exabrialover 6 years ago

Are there any hearing aid manufacturers taking this approach? Quite incredible.