Props for mentioning BirdNET as a potentially more accessible starting point for less technical folks.<p>There are a couple relative advantages of your approach that I feel are notable though:<p>Squeezed wav2vec2 (SEW) architecture leverages Transformer layers and operates directly on time series inputs. But BirdNET converts audio to a spectrogram first and then uses 2D convolution layers (ResNet-like backbone).<p>This over-representation of inputs to BirdNET implies that SEW will be much more computationally efficient for a given audio classification task (all else held equal).<p>Plus, simply using a pre-trained SEW model and then training a linear classifier on the embeddings would almost certainly produce strong baseline results. No GPU would be necessary for that.<p>P.S. Minor typo - precision and recall are confused here:<p><i>> “precision” (how many of the animal calls it notices) and its “recall” (the rate at which it makes accurate predictions).</i>
For those who want to try it a real world version of this, the BirdClef 2024 competition is currently running:
<a href="https://www.kaggle.com/competitions/birdclef-2024" rel="nofollow">https://www.kaggle.com/competitions/birdclef-2024</a><p>Lots more I can say here - I've been working on problems in bioacoustics for years - but for now will just leave a link to some work on using bird song embeddings from last year.<p><a href="https://www.nature.com/articles/s41598-023-49989-z" rel="nofollow">https://www.nature.com/articles/s41598-023-49989-z</a>
For a well-polished ready-to-use version, I recommend the Merlin Bird ID app from Cornell. <a href="https://merlin.allaboutbirds.org/" rel="nofollow">https://merlin.allaboutbirds.org/</a>
I record a few hours when the birds are noisy in the morning and then run it through birdnet. It's integrated with audacity so you can just pull up a wav and its annotations and listen through for the various species. I could imagine automating this and using it to track migration patterns, etc, around your house.