This is an awesome project, but it seems it was done without reference to academic literature on source separation. In fact, people have been doing audio source separation for years with neural networks.<p>For instance, Eric Humphrey at Spotify Music Understanding Group describes using a U-Net architecture here: <a href="https://medium.com/this-week-in-machine-learning-ai/separating-vocals-in-recorded-music-at-spotify-with-eric-humphrey-51c2f85d1451" rel="nofollow">https://medium.com/this-week-in-machine-learning-ai/separati...</a> - paper at <a href="http://openaccess.city.ac.uk/19289/1/7bb8d1600fba70dd79408775cd0c37a4ff62.pdf" rel="nofollow">http://openaccess.city.ac.uk/19289/1/7bb8d1600fba70dd7940877...</a><p>They compare their performance to the widely-cited state of the art Chimera model (Luo 2017): <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5791533/#R24" rel="nofollow">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5791533/#R24</a> with examples at <a href="http://naplab.ee.columbia.edu/ivs.html" rel="nofollow">http://naplab.ee.columbia.edu/ivs.html</a> - from the examples, there's significantly less distortion than OP.<p>Not to discourage OP from doing first-principles research at all! But it's often useful to engage with the larger community and know what's succeeded and failed in the past. This is a problem domain where progress could change the entire creative landscape around derivative works ("mashups" and the like), and interested researchers could do well to look towards collaboration rather than reinventing each others' wheels.<p>EDIT: The SANE conference has talks by Humphrey and many others available online: <a href="https://www.youtube.com/channel/UCsdxfneC1EdPorDUq9_XUJA/videos" rel="nofollow">https://www.youtube.com/channel/UCsdxfneC1EdPorDUq9_XUJA/vid...</a>
What motivates people to invent phrases like "perceptual binarization" when googling "audio binary mask" literally gives you citations in the field that have been doing this for years?<p>For example, 2009 musical sound separation based on binary time frequency masking.<p>Or more recent stuff using deep learning. Also the field generally prefers ratio masks because they lead to better sounding output.
Hello, a little self promotion, you can see it our experiment with some deep neural networks doing real-time audio processing in the browser, using tensorflow.js<p><a href="http://gistnoesis.github.io/" rel="nofollow">http://gistnoesis.github.io/</a><p>If you want to see how it's done it's shared source :
<a href="https://github.com/GistNoesis/Wisteria/" rel="nofollow">https://github.com/GistNoesis/Wisteria/</a><p>Thanks
Does anyone know if this is related to the new iZotope RX 7 vocal isolation & stemming tools? It does seem to be talking about something similar, especially when it mentions using the same technique to split a song into instrument stems.<p>(Or to put it another way - there is commercial music software released in the last year that lets you do this yourself now.)<p><a href="https://www.youtube.com/watch?v=kEauVQv2Quc" rel="nofollow">https://www.youtube.com/watch?v=kEauVQv2Quc</a><p><a href="https://www.izotope.com/en/products/repair-and-edit/rx/music.html" rel="nofollow">https://www.izotope.com/en/products/repair-and-edit/rx/music...</a>
I used to work in an audio processing research center back in 2003, and colleagues next to me were able to isolate each instrument in a stereo mix live using the fact that they were "placed" on different spot in the stereo plane.<p>Don't ask me how they did that, it was close to magic to me at that time, but i'm sure it wasn't neural networks. Although it probably involved convolution, as it is the main tool for producing audio filters.<p>If anyone has more info on the fundamental differences of the neural network approach compared to the "traditional" one, i'd be thankful.
Trivia: Avery Wang, the guy who invented the Shazam algorithm and was their CTO did his PhD thesis on this topic:<p><a href="https://ccrma.stanford.edu/~jos/EE201/More_Recent_PhD_EEs_CCRMA.html" rel="nofollow">https://ccrma.stanford.edu/~jos/EE201/More_Recent_PhD_EEs_CC...</a><p>``Instantaneous and Frequency-Warped Signal Processing Techniques for Auditory Source Separation'' (1994)
Please stop publishing on Medium. I'm getting error "You read a lot. We like that. You’ve reached the end of your free member preview for this month. Become a member now for $5/month to read this story".<p>Not gonna do that.
There is a lagged autoregressive technique used in forensic analysis that allows 3d reconstruction using 1d (mic) sound.<p>A CNN should be able to back that out too, and do other things like regenerate a 3d space. In the right, high-fidelity, acoustic tracks could be the spatial information to reconstruct a stage and a performance. It would be neat/beautiful/(possibly very powerful) to back video out of audio in that way.
The presentation of this project alone is a visual tour de force to say nothing of the technical quality. Beautiful and easily digestible post. As with any interesting, non-toy applied ML problem, the dataset generation is really where the innovation is. It gets a neat little graphic at the end. As far as how the author characterizes the problem, I think the word he's looking for is "semantic segmentation" - he's trying to classify each pixel of the spectrograph as vocal/non-vocal. I'd be curious if he could drop the dataset into pix2pix-style networks and achieve the same results.
Question: has any progress been made in removing reverb?<p>There are many, many historical recordings (and modern ones made in less-than-ideal circumstances) that suffer badly from reverb. Seems like a valuable use-case that -ought- to be in reach today.
Just wanted to mention there's some folks doing realtime source separation (not sure exactly how they've implemented it) with a DNN for reduction of background noise in, eg: Skype conversations.<p>I'm not involved with them in any way, but I've been amazed with its ability to cancel out coffee-shop style noise.<p>Check out <a href="https://krisp.ai/technology/" rel="nofollow">https://krisp.ai/technology/</a> - Mac/Windows. I wish they had Linux support!<p>Edit: Appears they don't have Windows support yet.
Clicked into to the article because I was curious how the training set was created. Using the acapella version is an amazing idea! Wished the article went more in-depth about this section.
Question: Currently building earphones with great active-noise-cancellation is a secret kept within few companies.<p>This means they're expensive($300 headphones from Bose etc).<p>Do neural network make this simpler ?<p>And do you think they can be applied cheaply enough,say for $99 headphones ?<p>I assume this will sell really well, and justify creating a dedicated chip, with time.
Soon enough there will be an AI filter that will take any old hacky, coughing, wheezing singer running around on stage, singing out of tune - and turn it into virtuoso chops. Maybe even derived from their own voice.<p>Which will give entirely new meaning to 'lip synching'.
A fun thing to do with this would be to slurp the lyrics from one song - the beats from another, some other stream from another and remix the “threads” together into something new.<p>Basically a giant equalizer that allows you to dim or brighten each channel from multiple sources.
This project I've found to be very useful if you want access to something that like what the article describes.
<a href="http://isse.sourceforge.net" rel="nofollow">http://isse.sourceforge.net</a>
I'd like to try using this kinda thing to build an automated beat saber map. The ability to orchestrate the beats very specifically would make for excellent mappings.<p>Alas so many projects, too little time!
Sounds pretty good but exhibits the same artifacts/phasing that I've heard with other source separation. Good for forensics etc but I wouldn't use this for music production
There was a similar demo (I think from Google) here on HN sometime last year that was far more impressive. I can't seem to find it though. Anybody know what it was?