I've seen some models like MuseGAN but nothing yet that has the same level of performance and generality as Stable Diffusion (granted, they are different applications). Does such a model exist?
I'm looking forward to such a model being announced too, as I'm sure people must already be working on it.<p>I did think, though, that a pre-requisite for such a model would be a system which could separate a track into its component instruments (and reverse engineer all the audio mixing that went into the final product) in order to reduce the dimensionality of the input to the learning model.<p>There's been some progress on that front, but not enough to produce a perfect transcription, and I'm not even sure if a transcription to sheet music would be the ideal data representation for an AI to truly understand what makes a good piece of music anyway.
Mostly private IP still but this is bound to come out at some point. There are some folks on Fiverr who've definitely technically figured out some kind of solution.