Other author here! This got a posted a little earlier than we intended so we didn't have our GPUs scaled up yet. Please hang on and try throughout the day!<p>Meanwhile, please read our about page <a href="http://riffusion.com/about" rel="nofollow">http://riffusion.com/about</a><p>It’s all open source and the code lives at <a href="https://github.com/hmartiro/riffusion-app" rel="nofollow">https://github.com/hmartiro/riffusion-app</a> --> if you have a GPU you can run it yourself<p>This has been our hobby project for the past few months. Seeing the incredible results of stable diffusion, we were curious if we could fine tune the model to output spectrograms and then convert to audio clips. The answer to that was a resounding yes, and we became addicted to generating music from text prompts. There are existing works for generating audio or MIDI from text, but none as simple or general as fine tuning the image-based model. Taking it a step further, we made an interactive experience for generating looping audio from text prompts in real time. To do this we built a web app where you type in prompts like a jukebox, and audio clips are generated on the fly. To make the audio loop and transition smoothly, we implemented a pipeline that does img2img conditioning combined with latent space interpolation.
This really is unreasonably effective. Spectrograms are a lot less forgiving of minor errors than a painting. Move a brush stroke up or down a few pixels, you probably won't notice. Move a spectral element up or down a bit and you have a completely different sound. I don't understand how this can possibly be precise enough to generate anything close to a cohesive output.<p>Absolutely blows my mind.
This is a genius idea. Using an already-existing and well-performing image model, and just encoding input/output as a spectrogram... It's elegant, it's obvious in retrospective, it's just pure genius.<p>I can't wait to hear some serious AI music-making a few years from now.
Some of this is really cool! The 20 step interpolations are very special, because they're concepts that are distinct and novel.<p>It absolutely sucks at cymbals, though. Everything sounds like realaudio :) composition's lacking, too. It's loop-y.<p>Set this up to make AI dubtechno or trip-hop. It likes bass and indistinctness and hypnotic repetitiveness. Might also be good at weird atonal stuff, because it doesn't inherently have any notion of what a key or mode is?<p>As a human musician and producer I'm super interested in the kinds of clarity and sonority we used to get out of classic albums (which the industry has kinda drifted away from for decades) so the way for this to take over for ME would involve a hell of a lot more resolution of the FFT imagery, especially in the highs, plus some way to also do another AI-ification of what different parts of the song exist (like a further layer but it controls abrupt switches of prompt)<p>It could probably do bad modern production fairly well even now :) exaggeration, but not much, when stuff is really overproduced it starts to get way more indistinct, and this can do indistinct. It's realaudio grade, it needs to be more like 128kbps mp3 grade.
This is huge.<p>This show me that Stable Diffusion can create anything with the following conditions:<p>1. Can be represented as as static item on two dimensions (their weaving together notwithstanding, it is still piece-by-piece statically built)<p>2. Acceptable with a certain amount of lossiness on the encoding/decoding<p>3. Can be presented through a medium that at some point in creation is digitally encoded somewhere.<p>This presents a lot of very interesting changes for the near term. ID.me and similar security approaches are basically dead. Chain of custody proof will become more and more important.<p>Can stable diffusion work across more than two dimensions?
I think there has to be a better way to make long songs...<p>For example, you could take half the previous spectrogram, shift it to the left, and then use the inpainting algorithm to make the next bit... Do that repeatedly, while smoothly adjusting the prompt, and I think you'd get pretty good results.<p>And you could improve on this even more by having a non-linear time scale in the spectrograms. Have 75% of the image be linear, but the remaining 25% represent an exponentially downsampled version of history. That way, the model has access to what was happening seconds, minutes, and hours ago (although less detail for longer time periods ago).
I bet a cool riff on this would be to simply sample an ambient microphone in the workplace and use that the generate and slowly introduce matching background music that fits the current tenor of the environment. Done slowly and subtly enough I'd bet the listener may not even be entirely aware its happening.<p>If we could measure certain kinds of productivity it might even be useful as a way to "extend" certain highly productive ambient environments a la "music for coding".
Producing images of spectrograms is a genius idea. Great implementation!<p>A couple of ideas that come to mind:<p>- I wonder if you could separate the audio tracks of each instrument, generate separately, and then combine them. This could give more control over the generation. Alignment might be tough, though.<p>- If you could at least separate vocals and instrumentals, you could train a separate model for vocals (LLM for text, then text to speech, maybe). The current implementation doesn't seem to handle vocals as well as TTS models.
This opens up ideas. One thing people have tried to do with stable diffusion is create animations. Of course, they all come out pretty janky and gross, you can't get the animation smooth.<p>But what if what if a model was trained not on single images, but animated sequential frames, in sets, laid out on a single visual plane. So a panel might show a short sequence of a disney princess expressing a particular emotion as 16 individual frames collected as a single image. One might then be able to generate a clean animated sequence of a previously unimagined disney princess expressing any emotion the model has been trained on. Of course, with big enough models one could (if they can get it working) produced text prompted animations across a wide variety of subjects and styles.
The vocals in these tracks are so interesting. They sound like vocals, with the right tone, phonemes. and structure for the different styles and languages but no meaning.<p>Reminds me of the soundtrack to Nier Automata which did a similar thing: <a href="https://youtu.be/8jpJM6nc6fE" rel="nofollow">https://youtu.be/8jpJM6nc6fE</a>
Another related audio diffusion model (but without text prompting) here:
<a href="https://github.com/teticio/audio-diffusion" rel="nofollow">https://github.com/teticio/audio-diffusion</a>
Earlier this year, graphic designers, last month it was software engineers, and now musicians are also feeling the effects.<p>Who else will AI make looking for a new job?
This looks great and the idea is amazing. I tried with the prompt: "speed metal" and "speed metal with guitar riffs" and got some smooth rock-balad type music. I guess there was no heavy metal in the learning samples haha.<p>Great work!
Fun! I tried something similar with DCGAN when it first came out, but that didn't exactly make nice noises. The conversion to and from Mel spectrograms was lossy (to put it mildly), and DCGAN, while impressive in its day, is nothing like the stuff we have today.<p>Interesting that it gets so good results with just fine tuning the regular SD model. I assume most of the images it's trained on are useless for learning how to generate Mel spectrograms from text, so a model trained from scratch could potentially do even better.<p>There's still the issue of reconstructing sound from the spectrograms. I bet it's responsible for the somewhat tinny sound we get from this otherwise very cool demo.
Interesting. I experimented a bit with the approach of using diffusion on whole audio files, but I ultimately discarded it in favor of generating various elements of music separately. I'm happy with the results of my project of composing melodies (<a href="https://www.youtube.com/playlist?list=PLoCzMRqh5SkFPG0-RIAR8jYRaICWubUdx" rel="nofollow">https://www.youtube.com/playlist?list=PLoCzMRqh5SkFPG0-RIAR8...</a>) and I still think this is the way to go and but that was before Stable Diffusion came out. These are interesting results though, maybe it can lead to something more.
It may be clearer to those of you who are smarter than me, but I guess I've only recently begun to appreciate what these experiments show--that AI graphical art, literature, music and the like will not succeed in lowering the barriers to humans making things via machines but in training humans to respond to art that is generated by machines. Art will not be challenging but designed by the algorithm to get us to like it. Since such art can be generated for essentially no cost, it will follow a simple popularity model, and will soon suck like your Netflix feed.
I’d been wondering (naively) if we’d reached the point where we can’t see any new kinds of music now that electronic synthesis allows us to make any possible sound. Changes in musical styles throughout history tend to have been brought about by people embracing new instruments or technology.<p>This is the most exciting thing I’ve seen in ages as it shows we may be on the verge of the next wave of new technology in music that will allow all sorts of weird and wonderful new styles to emerge. I can’t wait to see what these tools can do in the hands of artists as they become more mainstream.
For the 30 anniversary? <a href="https://warp.net/gb/artificial-intelligence" rel="nofollow">https://warp.net/gb/artificial-intelligence</a>
Wow those examples are shockingly good. It's funny that the lyrics are garbled analogously to text in stable diffusion images.<p>The audio quality is surprisingly good, but does sound like it's being played through an above-average quality phone line. I bet you could tack on an audio-upres model afterwards. Could train it by turning music into comparable-resolution spectrograms.
How comes that the stable diffuse model helps here? Does the fact that it knows what an astronaut on a horse looks like have effect on the audio? Would starting the training from an empty model work too?
The results of this are similar to my nitpicks of AI generated images (well, duh!). There's definitely something recognizable there, but somethings just not quite right about it.<p>I'm quite impressed that there was enough training data within SD to know what a spectrograph looks like for the different sounds.
I just wanted to say you guys did an amazing job packaging this up. I managed to get a local instance up and running against my local GPU in less than 10 minutes.
This is so good that I wondered if it's fake. Really impressive results from generated spectrographs! Also really interesting that it's not exactly trained on the audio files themselves - wonder if the usual copyright-based objections wild even apply here.
@haykmartiros, @seth_, thank you for open sourcing this!<p>Played a bit with the very impressive demos, now waiting in queue for my very own riff to get generate.<p>Great as this is, I'm imagining what it could do for song crossfades (actual mixing instead of plain crossfade even with beat matching).
Plug this into a video game and you could have GTA 6 where the NPCs have full dialogue with the appearance of sentience, concerts where imaginary bands play their imaginary catalogue live to you and all kinds of other dynamically generated content.
Is there a different mapping of FFT information to a two dimensional image that would make harmonic relationships more obvious?<p>IE, use a polar coordinate system where angle 12 oclock is 440hz, and the 12 chromatic notes would be mapped to the angle of the hours. Maybe red pixel intensity is bit mapped to octave, IE first, third and eight octave: 0b10100001.<p>Time would be represented by radius. Unfortunately the space wouldn't wrap nicely like if there was a native image format for representing donuts.
This is genuinely amazing. Like with any AI there are areas it's better at than others. I hope it doesn't go unnoticed just because people try typing "death metal" and are not happy with the results. This one seems to excel at 70-130BPM lo-fi vaporwave/washed out electronic beats. Think Ninja Tune or the modern lo-fi beats to study to. Some of this stuff genuinely sounds like something I'd encounter on Bandcamp or Soundcloud.<p>I think I'm beginning to crack the code with this one, here's my attempt at something like a "DJ set" with this. My goal was to have the smoothest transitions possible and direct the vibe just like I would doing a regular DJ set.
<a href="https://www.youtube.com/watch?v=BUBaHhDxkIc">https://www.youtube.com/watch?v=BUBaHhDxkIc</a><p>I wonder if this could be the future of DJing or perhaps beginning of a new trend in live music making kind of like Sonic Pi. Instead of mixing tracks together, the DJ comes up with prompts on the spot and tweaks AI parameters to achieve the desired direction. Pretty awesome.
Congratulations this is an amazing application of technology and truly innovative. This could be leveraged by a wide range of applications that I hope you'll capitalize on.
Incredible stuff, Seth & Hayk. I've been thinking nonstop about new things to build using Stable Diffusion and this is the first one that really took my breath away.
This is amazing! Would it be possible to use it to modify this interpolate between two existing songs (i.e. generate spectrograms from audio and transition between them)?
This is so completely wild. Love the novelty and inventiveness.<p>Could anyone help me understand whether using SVG instead of bitmap image would be possible? I realize that probably wouldn't be taking advantage of the current diffusion part of Stable-Diffusion, but my intuition is maybe it would be less noisy or offer a cleaner/more compressible path to parsing transitions in the latent space.<p>Great idea? Off base entirely? Would love some insight either way :D
A musician friend of mine told me that this is (I freely translate) a perversion, building in frequency and returning time. Don't shoot the messenger.<p>Personally I like the results. I'm totally untrained and couldn't hear any of the issues many comments are pointing out.<p>I guess that all of lounge/elevator music and probably most ad jingles will be automated soon, if automation cost less than human authors.
I find it really cool that the "uncanny valley" that's audible on nearly every sample is exactly as I would imagine that the visual artifacts would sound that crop up in most generated art. Not really surprising I guess, but still cool that there's such a direct correlation between completely different mediums!
Things similar to the “interpolation” part (not the generative part) are already used extensively especially for game and movie sound design. Kyma [1] is the absolute leader (it requires expensive hardware though). IMO later iterations on this approach may lead to similar or better results.<p>FYI, other apps that use more classic but still complex Spectral/Granular algos :<p><a href="https://www.thecargocult.nz/products/envy" rel="nofollow">https://www.thecargocult.nz/products/envy</a><p><a href="https://transformizer.com/products/" rel="nofollow">https://transformizer.com/products/</a><p><a href="https://www.zynaptiq.com/morph/" rel="nofollow">https://www.zynaptiq.com/morph/</a><p>[1] <a href="https://kyma.symbolicsound.com/" rel="nofollow">https://kyma.symbolicsound.com/</a>
If copyright laws don't catch up, the sampling industry is cooked.<p>Made this: <a href="https://soundcloud.com/obnmusic/ai-sampling-riffusion-waves-lapping-on-a-shore-to-nowhere" rel="nofollow">https://soundcloud.com/obnmusic/ai-sampling-riffusion-waves-...</a>
great stuff, while it comes with the usual smeary iFFT artifacts that AI-generated sound tends to have the results are surprisingly good.
i especially love the nonsense vocals it generates in the last example, which remind me of what singing along to foreign songs felt like in my childhood.
I was thinking about this - what if someone trained a stable diffusion type model on all of the worlds commercial music? This model would probably produce quite amazing music given enough prompting and I'm wondering if the music industry would be allowed to claim copyright on works created with such a model. Would it be illegal or is this just like a musician picking up ideas from hearing the world of music? Is it really right to make learning a crime, even if machines are doing it? I'm conflicted after finding out that for sync licensing the music industry want a percentage of revenue based on your subscription fees, sometimes as high as 15%-20%! I'm surprised such a huge fee isn't considered some kind of protection racket.
<a href="https://soundcloud.com/toastednz/stablediffusiontoddedwards?si=e7018bdda1014b8084d47d0ade6bf1ee&utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing" rel="nofollow">https://soundcloud.com/toastednz/stablediffusiontoddedwards?...</a><p>40 sec clip of uk garage/todd edwards style track made with riffusion -> serato studio with todd beats added
>> Prompt - When providing prompts, get creative! Try your favorite artists, instruments like saxophone or violin, modifiers like arabic or jamaican, genres like jazz or rock, sounds like church bells or rain, or any combination. Many words that are not present in the training data still work because the text encoder can associate words with similar semantics. The closer a prompt is in spirit to the seed image And BPM, the better the results. For example, a prompt for a genre that is much faster BPM than the seed image will result in poor, generic audio.<p>(1) Is there a corpse the keywords were collected from?<p>(2) Is it possible to model the proximity of the image to keywords and sets of keywords?
I can’t help but see parallels to synesthesia. It’s amazing how capable these models are at encoding arbitrary domain knowledge as long as you can represent it visually w/ reasonable noise margins.
As a musician, I'll start worrying once an AI can write at the level of sophistication of a Bill Withers song:<p><a href="https://www.youtube.com/watch?v=nUXgJkkgWCg">https://www.youtube.com/watch?v=nUXgJkkgWCg</a><p>Not simply SOUND like a Bill Withers song, but to have the same depth of meaning and feeling.<p>At that point, even if we lose we win because we'll all be drowning in amazing music. Then we'll have a different class of problem to contend with.
This is awesome! It would be interesting to generate individual stems for each instrument, or even MIDI notes to be rendered by a DAW and VST plugins. It's unfortunate that most musicians don't release the source files for their songs so it's hard to get enough training data. There's a lot of MIDI files out there but they don't usually have information about effects, EQ, mastering, etc.
Pretty nice, I was just talking to a friend about needing a music version of chatgpt, so thank you for this.<p>Wondering if it would be possible to create a version of this that you can point at a person SoundsCloud and have it emulate their style / create more music in the style of the original artist. I have a couple albums worth of downtempo electronic music I would love to point something like this at and see what it comes up with.
Could this concept be inverted, to identify music from a text prompt, as in i want a particular vibe and it can go tell me what music fits that description. Ive always thought the ability to find music you like was very lacking, it should not be bounded by a genre instead its usually based on rythmic and melodic structures that appeal to you, regardless of what type of music it might be.
Really great! I've been using diffusion as well to create sample libraries. My angle is to train models strictly on chord progression annotated data as opposed to the human descriptions so they can be integrated into a DAW plugin. Check it out: <a href="https://signalsandsorcery.org/" rel="nofollow">https://signalsandsorcery.org/</a>
You can train/finetuned a Stable Diffusion model on an arbitrary aspect ratio/resolution and then the model starts creating coherent images, would be cool to try finetuning/training this model on entire songs by extending the time dimension (also the attention layer at the usual 64x64 resolution should be removed or it would eat too much memory)
Amazing project! Here is a demo including an input for negative prompt. It's impressive how it works. You can try:<p>prompt: relaxing jazz melody bass music
negative_prompt: piano music<p><a href="https://huggingface.co/spaces/radames/spectrogram-to-music" rel="nofollow">https://huggingface.co/spaces/radames/spectrogram-to-music</a>
I was confused because I must not have read good that the working webapp is at <a href="https://www.riffusion.com/" rel="nofollow">https://www.riffusion.com/</a>. Go to <a href="https://www.riffusion.com/" rel="nofollow">https://www.riffusion.com/</a> and press the play button to see it in action!
I found this awesome podcast that goes into several AI & music related topics <a href="https://open.spotify.com/show/2wwpj4AacVoL4hmxdsNLIo?si=IAaJ6A3yQ1GT5na4iSZnRA" rel="nofollow">https://open.spotify.com/show/2wwpj4AacVoL4hmxdsNLIo?si=IAaJ...</a><p>They even talk specifically about about applying stable diffusion and spectrograms.
Awesome work.<p>Would you be willing to share details about the fine-tuning procedure, such as the initialization, learning rate schedule, batch size, etc.? I'd love to learn more.<p>Background: I've been playing around with generating image sequences from sliding windows of audio. The idea roughly works, but the model training gets stuck due to the difficulty of the task.
"<a href="https://en.wikipedia.org/wiki/Spectrogram" rel="nofollow">https://en.wikipedia.org/wiki/Spectrogram</a> - can we already do sound via image? probably soon if not already"<p>Me in the Stable Diffusion discord, 10/24/2022<p>The ppl saying this was a genius idea should go check out my other ideas
I wonder if subliminal messaging will somehow make a comeback once we have ai generated audio and video. Something like we type "Fun topic" and those controlling the servers will inject "and praise to our empire/government/tyrant" to the suggestion or something like that.
If such unreasonably good music can be created based on information encoded in an image, I'm wondering what there things we can do with this flow:<p>1) Write text to describe the problem
2) Generate an image Y that encodes that information
3) Parse that image Y to do X<p>Example:
Y = blueprint, X = Constructing a building with that blueprint
If it can do music, can we train better models for different kinds of music? Or different models for different instruments makes more sense? For different instruments we can get better resolution by making the spectrogram represent different frequency ranges. This is terribly exciting, what a time to be alive.
Multiple folks have asked here and in other forums but I'm going to reiterate, what data set of paired music-captions was this trained on? It seems strange to put up a splashy demo and repo with model checkpoints but not explain where the model came from... is there something fishy going on?
Today's music generation is putting my Pop Ballad Generator to shame: <a href="http://jsat.io/blog/2015/03/26/pop-ballad-generator/" rel="nofollow">http://jsat.io/blog/2015/03/26/pop-ballad-generator/</a>
I feel like the next step here is to get a GPT3 like model to parse everything ever written about every piece of music which is on the internet (and in every pdf on libgen and scihub) and link them to spectrograms of that music<p>and then things are going to get wild<p>I am so blessed to live in this era :)
I know it sounds like I am going to be sarcastic, but I mean all of this in earnest and with good intention. Everything this generates is somehow worse than the thing it generated before it. Like the uncanny valley of audio had never been traversed in such high fidelity. Great work!
Anyone interested in joining an unofficial Riffusion Discord, let's organize here: <a href="https://discord.gg/DevkvXMJaa" rel="nofollow">https://discord.gg/DevkvXMJaa</a><p>Would be nice to have a channel where people can share Riffs they come up with.
Anyone know how I could try and use this with
Elixir livebook with <a href="https://github.com/elixir-nx/bumblebee">https://github.com/elixir-nx/bumblebee</a><p>I'm new but this is something that would get me going.
I'm a short fiction writer. Do you think I could get one of these new models to write a good story?<p>I'd want to train it to include foreshadowing, suspense, relatable characters and perhaps a twist ending that cleverely references the beginning.
I read the article:<p>"If you have a GPU powerful enough to generate stable diffusion results in under five seconds, you can run the experience locally using our test flask server."<p>Curious what sort of GPU the author was using or what some of the min requirements might be?
> <a href="https://www.riffusion.com/?&prompt=punk+rock+in+11/8" rel="nofollow">https://www.riffusion.com/?&prompt=punk+rock+in+11/8</a><p>Tried getting something in an odd timing, but still is 4/4.
Very cool, but the music still has a very "rough", almost abrasive tinge to it. My hunch, is that it has to do with the phase estimates being off.<p>Who's going to be first to take this approach and use it to generate human speech instead?
This is just insane. Sooooo incredible. Don't really realize how far things have come until it hits a domain you're extremely familiar with. Spent 8-9 in music production and the transition stuff blew me away.
"Uh oh! Servers are behind, scaling up..." - havent' been able to get past that yet. Anyone getting new output?<p>This is already better than most techno. I can see DJs using this, typing away.
Do you guys think AI creative tools will completely subsume the possibility space of human made music? Or does it open up a new dimension of possibilities orthogonally to it? Hard for me to imagine how AI would be able to create something as unique and human as D'Angelo's Voodoo (esp. before he existed) but maybe it could (eventually).<p>If I understand these AI algorithms at a high level, they're essentially finding patterns in things that already exist and replicate it w some variation quite well. But a good song is perfect/precise in each moment in time. Maybe we'll only be ever be able to get asymptotically closer but never _quite_ there to something as perfectly crafted a human could make? Maybe there will always be a frontier space only humans can explore?
Can anyone confirm/deny my theory that AI audio generation has been lagging behind progress in image generation because it’s way easier to get a billion labeled images than a billion labeled audio clips?
Also check the similar work on arxiv:<p>Multi-instrument Music Synthesis with Spectrogram Diffusion:<p><a href="https://arxiv.org/abs/2206.05408" rel="nofollow">https://arxiv.org/abs/2206.05408</a>
I propose that while you are GPU limited, you make these changes:<p>* Don't do the alpha fade to the next prompt - just jump straight to alpha=1.0.<p>* Pause the playback if the server hasn't responded in time, rather than looping.
Was just watching an interview of Billy Corgan (smashing pumpkins) on Rick Beato’s YouTube[1] last night where billy was lamenting the inevitable future where the “psychopaths” in the music biz will use ai and auto tune to churn out three chord non-music mumble rap for the youth of tomorrow, or something to that effect. It was funny because it’s the sad truth. It’s already here but new tech will allow them to cut costs even more, and increase their margins. No need for musicians. Really cool on one hand, in the same way fentanyl is cool — or the cotton gin, but a bit depressing on the other, if you care about musicians. I and a few others will always pay to go the symphony, so good players will find a way get paid, but this is what kids will listen to, because of the profit margin alone.<p>[1]
<a href="https://m.youtube.com/watch?v=nAfkxHcqWKI" rel="nofollow">https://m.youtube.com/watch?v=nAfkxHcqWKI</a>
Very cool! I was wondering why there wasnt any music diffusion apps out there, it seems more useful because music has stricter copyright and all content creators need some background music ...
This is a brilliant idea.<p>Also, spectrographs will never generate plausible high quality audio. (I think)<p>So I think the next move is to map the generate audio back over to synthesizer and samples via midi …
This happened earlier than I expected, and using a much different technique than I expected.<p>Bracing myself for when major record labels enter the copyright brawl that diffusion technology is sparking.
I wonder if this would be applicable to video game music. Be able to make stuff that's less repetitive but also smoothly transitions to specific things with in-game events.
Coming at this from a layman's perspective, would it be possible to generate a different sort of spectrogram that's designed for SD to iterate upon even more easily?
I got an actual `HTTP 402: PAYMENT_REQUIRED` response (never seen one of those in the wild, according to Mozilla it is experimental). Someone's credit card stopped scaling?
I’m curious about the limitations of using spectrograms and transient-heavy sounds like drums.<p>It seems like you’d need very high resolution spectrograms to get a consistently snappy drum sound.
I prefer to make my spectrograms by hand. <a href="https://youtu.be/HT0HH_fc4ZU" rel="nofollow">https://youtu.be/HT0HH_fc4ZU</a>
This is what I've been talking about all year. It is such a relief to see it actually happen.<p>In summary: The search for AGI is dead. Intelligence was here and more general than we realized this whole time. Humans are not special as far as intelligence goes. Just look how often people predict that an AI cannot X or Y or Z. And then when an AI does one of those things they say, "well it cannot A or B or C".<p>What is next: This trend is going to accelerate as people realize that AI's power isn't in replacing human tasks with AI agents, but letting the AI operate in latent spaces and domains that we never even thought about trying.
This is really cool but can someone tell me why we are automating art? Who asked for this? The future seems depressing when I look at all this AI generated art.
So this is slightly bending my mind again. Somehow image generators were more comprehensible compared to getting coherent music out. This is incredible.
Seems to be victim of its own success:<p>- No new result available, looping previous clip<p>- Uh oh! Servers are behind, scaling up<p>I hope Vercel people can give you some free credits to scale it up.
impressive stuff. reminds me of when ppl started using image classifier networks on spectrograms in order to classify audio. i would not have thought to apply a similar concept for generative models, but it seems obvious in hindsight.
it seems that SD does cover everything in terms of generative ai. Speaking of music, very interesting paper and demo. Just wondering in terms of license and commercialization, what kind of mess are we expecting here?
the problem is it <i>sounds</i> awful, like a 64kbps MP3 or worse<p>Perhaps AI can be trained to create music in different ways than generating spectrograms and converting them to audio?