I was struck by the comparison between audio spectra and image spectra. Image spectra have a strong power law effect, but audio spectra have more power in middle bands. Why? One part of the issue is that the visual spectrum is very narrow (just 1 order of magnitude from red to blue) compared to audio (4 orders of magnitude from 20Hz to 20kHz).<p>But another issue not mentioned in the article is that in images we can zoom in/out arbitrarily. So the width of a pixel can change – it might be 1mm in one image, or 1cm in another, or 1m or 1km. Whereas in audio, the “width of a pixel” (the time between two audio samples) is a fixed amount of time – usually 1/44.1kHz, but even if it’s at a different sample rate, we would convert all images to have the <i>same</i> sample rate before training an NN. The equivalent of this for images would be rescaling all images so that a picture of a cat is say 100x100 pixels, while a picture of a tiger is 300x300.<p>Which, come to think of it, would be potentially an interesting thing to do.