Great! Refreshing to see a ML post using some well-understood methods instead of throwing a random neural net from Kaggle at the problem...<p>Tangential:<p>> <i>Is a given audio file a sample of a kick drum, snare drum, hi-hat, other percussion, or something else? (...) Humans have no trouble classifying these two sounds, as we’ve likely heard them tens of thousands of times before.</i><p>Are people taught that in schools or something? Because I personally can't classify those sounds, don't know these names, and I'm not sure how I was supposed to learn them, other by playing in a band.
Had an idea to do this a couple months ago, but haven't got around to implementing it yet. I'm curious: did you consider using standard image processing techniques with spectrograms as an alternative to decision trees? I know thats how Izotope does their Neutron instrument detection, but I'm not sure how it would compare performance wise. Also, have you tried classifying percussive sounds that aren't actual drums? I'd love to see how it categorizes various stuff.
Surprised there's no discussion of FFT, power spectra, etc. Would like to see someone with an electrical engineering/signal processing background work on this problem.
Could I use something like this to identify which of two or three people is speaking in an audio clip? Assume I can label several samples of each person's speech, then present an unlabeled sample for classification.