Diffusion Is Spectral Autoregression

223 点作者 ackbar038 个月前

13 条评论

The high and low frequency components of speech are produced and perceived in different ways.The lower frequencies (roughly below 4KHz) are created by the vocal chords opening and closing at the fundamental frequency, and harmonics of this fundamental frequency (e.g. 100Hz + 2/3/400Hz etc harmonics), with this frequency spectrum then being modulated by the resonances of the vocal tract which change during pronunciation. What we perceive as speech is primarily the changes to these resonances (aka formants) due to articulation/pronunciation.The higher frequencies present in speech mostly comes from "white noise" created by the turbulence of forcing air out through closed teeth/etc (e.g. "S" sound), and our perception of these "fricative" speech sounds is based on onset/offset of energy in these higher 4-8KHz frequencies. Frequencies above 8KHz are not very perceptually relevant, and may be filtered out (e.g. not present in analog telephone speech).

thho23i42343438 个月前

I don't mean to mean but: what is surprising about any of this ?Joseph Fourier's solution to the heat-equation (linear diffusion) was in fact the origin of the FT. The high-freq coefficients decay (as -t^2 IIRC) in there; the reverse is also known to be "unstable" (numerically, and is singular from the equillibrium).More over, the reformulation doesn't immediately reveal some computational speedup, or a better alternative formulation (which is usually a measure of how valuable it is epistemically).(Edit: note that Heat-equation is more akin to the Fokker-Planck eqn, not actual Diffusion as an SDE as is used in Diffusion models).

评论 #41434393 未加载

评论 #41434000 未加载

评论 #41434219 未加载

nyanpasu648 个月前

> I won’t speculate about why images exhibit this behaviour and sound seemingly doesn’t, but it is certainly interesting (feel free to speculate away in the comments!).Images have a large near-DC component (solid colors) and useful time-domain properties, while human hearing starts at ~20 Hz and the frequencies needed to understand speech range from 300-4 kHz (spitballing based on the bandwidth of analog phones).What would happen if you built a diffusion model using pink noise to corrupt all coefficients simultaneously? Alternatively what if you used something other than noise (like a direct blur) for the model to reverse?

评论 #41433264 未加载

评论 #41435203 未加载

评论 #41436138 未加载

评论 #41433636 未加载

magicalhippo8 个月前

Not my area, enjoyed the read. It reminded me of how you can decode a scaled-down version of a JPEG image by simply ignoring the higher-order DCT coefficients.As such it seems the statement is that stable diffusion is like an autoregressive model which predicts the next set of higher-order FT coefficients from the lower-order ones.Seems like this is something one could do with a "regular" autoregressive model, has this been tried? Seems obvious so I assume so, but curious how it compares.

评论 #41433386 未加载

andersbthuesen8 个月前

This post reminded me of a conversation I had with my cousins about language and learning. It’s interesting how (most?) languages seem inherently sequential, while ideas and knowledge tend to have a more hierarchical structure, with a “base frequency” communicating the basic idea and higher frequency overtones adding the nuances. I wonder what implications this might have in teaching current LLMs to reason?

评论 #41432511 未加载

评论 #41432398 未加载

评论 #41434015 未加载

WithinReason8 个月前

To me this means that you could significantly speed up image generation by using a lower resolution at the beginning of the generation process and gradually transitioning to higher resolutions. This would also help with the attention mechanism not getting overwhelmed when generating a high resolution image from scratch.Also, you should probably enforce some kind of frequency cutoff later when you're generating the high frequencies so that you don't destroy low frequency details later in the process.

评论 #41435293 未加载

评论 #41432761 未加载

nowayno5838 个月前

Intuitively, audio is way more sensitive to phase and persistence because of the time domain. So maybe audio models look more like video models instead of image models?I'm not really sure how current video generating models work, but maybe we could get some insight into them by looking at how current audio models work?I think we are looking at an auto regression of auto regressions of sorts, where each PSD + phase is used to output the next, right? Probably with different sized windows of persistence as "tokens". But I'm a way out of my depth here!

评论 #41437257 未加载

shaunregenbaum8 个月前

This was a fascinating read. I wonder if anyone has done an analysis on the FT structures of various types of data from molecular structures to time series data. Are all domains different, or do they share patterns?

评论 #41433914 未加载

评论 #41436069 未加载

jmmcd8 个月前

I was struck by the comparison between audio spectra and image spectra. Image spectra have a strong power law effect, but audio spectra have more power in middle bands. Why? One part of the issue is that the visual spectrum is very narrow (just 1 order of magnitude from red to blue) compared to audio (4 orders of magnitude from 20Hz to 20kHz).But another issue not mentioned in the article is that in images we can zoom in/out arbitrarily. So the width of a pixel can change – it might be 1mm in one image, or 1cm in another, or 1m or 1km. Whereas in audio, the “width of a pixel” (the time between two audio samples) is a fixed amount of time – usually 1/44.1kHz, but even if it’s at a different sample rate, we would convert all images to have the same sample rate before training an NN. The equivalent of this for images would be rescaling all images so that a picture of a cat is say 100x100 pixels, while a picture of a tiger is 300x300.Which, come to think of it, would be potentially an interesting thing to do.

评论 #41440218 未加载

评论 #41443377 未加载

theptip8 个月前

> The RAPSD of Gaussian noise is also a straight line on a log-log plot; but a horizontal one, rather than one that slopes down. This reflects the fact that Gaussian noise contains all frequencies in equal measureHuh. Does this mean that pink noise would be a better prior for diffusion models than Gaussian noise, as your denoiser doesn’t need to learn to adjust the overall distribution? Or is this shift in practice not a hard thing to learn in the scale of a training run?

catgary8 个月前

I feel like Song et al characterized diffusion models as SDEs pretty unambiguously, and it connects to Optimal Transport in a pretty unambiguous manner. I understand the desire to give different perspectives, but once you start using multiple hedge words/qualitatives like:> basically an approximate version of the Fourier transform!You should take a step back and ask “am I actually muddying the water right now?”

评论 #41437931 未加载

slashdave8 个月前

This has little to do with diffusion. The aspects described relate to images (and sound) and are true for VAE models, for example. I mean, what else is a UNet?

theo19968 个月前

WEll yes econometrics and time series analyses had already described all the methods and functions for """AI"""", but marketing idiots decided t ocreate new names for 30 year old knowledge.