> <i>These outputs are from text-conditioned latent audio diffusion models on raw audio, no spectrograms. We hope to get models like this released within the next few months! </i> [1]<p>It is difficult to judge without knowing the "raw materials" for each generation (prompt/seeds…). Audio Quality is understandably better than Riffusion. What most surprised me in Riffusion was those coherent, intricate (and musical !) melody lines. Here I'm (now) hearing mostly EDM/IDM(ish) music, and it is the tonal textures (and beats) that are surprisingly coherent.<p>I suppose it is highly curated . Still, it is promising.<p>[1] <a href="https://twitter.com/harmonai_org/status/1604903187978547201" rel="nofollow">https://twitter.com/harmonai_org/status/1604903187978547201</a>