Are y'all running the diffusion in PyTorch eager mode?<p>AITemplate, stable-fast, or even a torch.compile could get that down to 60ms, I bet. Though I'm not sure the example implementations would work on a non SD architecture.
i worked on the gpu/infra side of this, so feel free to AMA
Ultimately the LCM is just a SD Unet trained with a new objective, so a lot of SD optimizations are transferable to LCMs
will write an essay about it later, but servers exploding even with experimental alpha users.<p>I'm excited about what H200s or B100s will mean for technology like this one.