TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Finding the genre of a song with Deep Learning

107 pointsby Despoisjover 8 years ago

11 comments

iverjoover 8 years ago
To the author: Have you tried to use a logarithmic frequency scale in the spectrogram? [1] That representation is closer to the way humans perceive sound, and gives you finer resolution in the lower frequencies. [2] If you want to make your representation even closer to the human&#x27;s perception, take a look at Google&#x27;s CARFAC research. [3] Basically, they model the ear. I&#x27;ve prepared a Python utility for converting sound to Neural Activity Pattern (resembles a spectrogram when you plot it) here: <a href="https:&#x2F;&#x2F;github.com&#x2F;iver56&#x2F;carfac&#x2F;tree&#x2F;master&#x2F;util" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;iver56&#x2F;carfac&#x2F;tree&#x2F;master&#x2F;util</a><p>[1] <a href="https:&#x2F;&#x2F;sourceforge.net&#x2F;p&#x2F;sox&#x2F;feature-requests&#x2F;176&#x2F;" rel="nofollow">https:&#x2F;&#x2F;sourceforge.net&#x2F;p&#x2F;sox&#x2F;feature-requests&#x2F;176&#x2F;</a><p>[2] <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Mel_scale" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Mel_scale</a><p>[3] <a href="http:&#x2F;&#x2F;research.google.com&#x2F;pubs&#x2F;pub37215.html" rel="nofollow">http:&#x2F;&#x2F;research.google.com&#x2F;pubs&#x2F;pub37215.html</a>
评论 #13085866 未加载
评论 #13085898 未加载
评论 #13087359 未加载
nkurzover 8 years ago
Wow, I find it incredible that this works. As I understand it, the approach is to do a Fourier transform on a couple seconds of the song to create a 128x128 pixel spectrogram. Each horizontal pixel represents a 20 ms slice in time, and each vertical pixel represents 1&#x2F;128 of the frequency domain.<p>Then treating these spectrograms as images, train a neural net to classify them using pre-labelled samples. Then take samples from the unknown songs, and let it classify them. I find it incredible that 2.5 seconds of sound represented as a tiny picture captures information enough for reliable classification, but apparently it does!
评论 #13085370 未加载
评论 #13087161 未加载
评论 #13088184 未加载
评论 #13087234 未加载
chestervonwinchover 8 years ago
1. I wonder how the continuous wavelet transform would compare to the windowed Fourier transform used here. See [1] an python implementation, for example.<p>2. The size of frequency analysis blocks seems arbitrary. I wonder if there is a &quot;natural&quot; block size based on a song&#x27;s tempo, say 1 bar. This would of course require a priori tempo knowledge or a run-time estimate.<p>[1]: <a href="https:&#x2F;&#x2F;docs.scipy.org&#x2F;doc&#x2F;scipy-0.15.1&#x2F;reference&#x2F;generated&#x2F;scipy.signal.cwt.html" rel="nofollow">https:&#x2F;&#x2F;docs.scipy.org&#x2F;doc&#x2F;scipy-0.15.1&#x2F;reference&#x2F;generated&#x2F;...</a>
评论 #13088177 未加载
maxericksonover 8 years ago
See also <a href="http:&#x2F;&#x2F;everynoise.com&#x2F;" rel="nofollow">http:&#x2F;&#x2F;everynoise.com&#x2F;</a> which is a view into how Spotify classifies music.<p>The creator wrote about it here:<p><a href="http:&#x2F;&#x2F;blog.echonest.com&#x2F;post&#x2F;52385283599&#x2F;how-we-understand-music-genres" rel="nofollow">http:&#x2F;&#x2F;blog.echonest.com&#x2F;post&#x2F;52385283599&#x2F;how-we-understand-...</a><p>and writes a lot about it on their blog:<p><a href="http:&#x2F;&#x2F;www.furia.com&#x2F;page.cgi?terms=noise&amp;type=search" rel="nofollow">http:&#x2F;&#x2F;www.furia.com&#x2F;page.cgi?terms=noise&amp;type=search</a><p>Of course those are going in the other direction, not generating the classification from the data, but it&#x27;s probably one of the best data sets as far as classifying existing music.
评论 #13089141 未加载
jschmitz28over 8 years ago
Unless I&#x27;m misunderstanding the validation set, I&#x27;m skeptical of the ability of this classifier to tag unlabeled tracks, given that it is only being trained and tested on tracks which are already known to belong to one of the few trained genres. I&#x27;d be curious to see the performance if you were to additionally test on tracks which are not any of (Hardcore, Dubstep, Electro, Classical, Soundtrack and Rap), with a correct prediction being no tag.
评论 #13087250 未加载
iverjoover 8 years ago
Nice approach, and well explained! By the way, Niland is a startup that also does music labeling with the help of deep learning.<p>Demo available here: <a href="http:&#x2F;&#x2F;demo.niland.io&#x2F;" rel="nofollow">http:&#x2F;&#x2F;demo.niland.io&#x2F;</a><p>For example, it can output Drum Machine: 87%, House: 88%, Female Voice: 55%, Groovy: 93%
评论 #13087275 未加载
GFK_of_xmaspastover 8 years ago
See also Bob Sturm&#x27;s work on genre classification: <a href="http:&#x2F;&#x2F;link.springer.com&#x2F;article&#x2F;10.1007&#x2F;s10844-013-0250-y" rel="nofollow">http:&#x2F;&#x2F;link.springer.com&#x2F;article&#x2F;10.1007&#x2F;s10844-013-0250-y</a>
评论 #13087306 未加载
tunesmithover 8 years ago
That&#x27;s pretty cool, I&#x27;d like to use something like this to tell me what genre my own songs are. It&#x27;s annoying to write a song and then upload it to some service or another and have no idea what genre to pick. :-) My stuff is somewhere in the jazz-influenced singer-songwriter american piano pop realm which is a combination that works for me but it generally feels like I&#x27;m selling the song short if I have to pick only one.
评论 #13087283 未加载
return0over 8 years ago
Good luck convincing musicians that &quot;THAT&#x27;s your genre&quot;
评论 #13087258 未加载
dkarapetyanover 8 years ago
Hmm, convolution is perfectly good operation to run on wave forms as well. In fact the wikipedia article (<a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Convolution" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Convolution</a>) shows the operation on functions which would correspond to time-domain wave forms. What is the point of converting everything to pictures and then using 2D convolutions when that step could have been skipped entirely?<p>Converting to pictures is unnecessary. It makes the processing harder. The pooling should just happen on segments of the wave form instead of the fourier transform (frequency-domain) picture spectrograms.
评论 #13086105 未加载
jtmarmonover 8 years ago
i&#x27;m not super familiar with deep learning so forgive me if i&#x27;m missing some nuance, but what&#x27;s the purpose of writing&#x2F;reading to&#x2F;from images? seems like it would add a ton of processing time. could the CNN not just read from a 50 item array of tuples representing the data from the 20ms slice?
评论 #13089176 未加载