TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Riffusion – Stable Diffusion fine-tuned to generate music

2421 pointsby MitPittover 2 years ago

142 comments

haykmartirosover 2 years ago
Other author here! This got a posted a little earlier than we intended so we didn&#x27;t have our GPUs scaled up yet. Please hang on and try throughout the day!<p>Meanwhile, please read our about page <a href="http:&#x2F;&#x2F;riffusion.com&#x2F;about" rel="nofollow">http:&#x2F;&#x2F;riffusion.com&#x2F;about</a><p>It’s all open source and the code lives at <a href="https:&#x2F;&#x2F;github.com&#x2F;hmartiro&#x2F;riffusion-app" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;hmartiro&#x2F;riffusion-app</a> --&gt; if you have a GPU you can run it yourself<p>This has been our hobby project for the past few months. Seeing the incredible results of stable diffusion, we were curious if we could fine tune the model to output spectrograms and then convert to audio clips. The answer to that was a resounding yes, and we became addicted to generating music from text prompts. There are existing works for generating audio or MIDI from text, but none as simple or general as fine tuning the image-based model. Taking it a step further, we made an interactive experience for generating looping audio from text prompts in real time. To do this we built a web app where you type in prompts like a jukebox, and audio clips are generated on the fly. To make the audio loop and transition smoothly, we implemented a pipeline that does img2img conditioning combined with latent space interpolation.
评论 #34002073 未加载
评论 #34005525 未加载
评论 #34001908 未加载
评论 #34006222 未加载
评论 #34001508 未加载
评论 #34002862 未加载
评论 #34006133 未加载
评论 #34015549 未加载
评论 #34002721 未加载
评论 #34018431 未加载
评论 #34001223 未加载
评论 #34030218 未加载
评论 #34004426 未加载
评论 #34005223 未加载
评论 #34008025 未加载
评论 #34004434 未加载
评论 #34005063 未加载
评论 #34017319 未加载
评论 #34015307 未加载
评论 #34004032 未加载
评论 #34003752 未加载
评论 #34030004 未加载
评论 #34005955 未加载
评论 #34011886 未加载
评论 #34014303 未加载
评论 #34002326 未加载
评论 #34011930 未加载
评论 #34001597 未加载
valdiornover 2 years ago
This really is unreasonably effective. Spectrograms are a lot less forgiving of minor errors than a painting. Move a brush stroke up or down a few pixels, you probably won&#x27;t notice. Move a spectral element up or down a bit and you have a completely different sound. I don&#x27;t understand how this can possibly be precise enough to generate anything close to a cohesive output.<p>Absolutely blows my mind.
评论 #34004813 未加载
评论 #34002080 未加载
评论 #34000877 未加载
评论 #34005599 未加载
bheadmasterover 2 years ago
This is a genius idea. Using an already-existing and well-performing image model, and just encoding input&#x2F;output as a spectrogram... It&#x27;s elegant, it&#x27;s obvious in retrospective, it&#x27;s just pure genius.<p>I can&#x27;t wait to hear some serious AI music-making a few years from now.
评论 #34004262 未加载
评论 #34000844 未加载
评论 #34000873 未加载
评论 #34005739 未加载
评论 #34001453 未加载
评论 #34000969 未加载
评论 #34001294 未加载
评论 #34003443 未加载
Applejinxover 2 years ago
Some of this is really cool! The 20 step interpolations are very special, because they&#x27;re concepts that are distinct and novel.<p>It absolutely sucks at cymbals, though. Everything sounds like realaudio :) composition&#x27;s lacking, too. It&#x27;s loop-y.<p>Set this up to make AI dubtechno or trip-hop. It likes bass and indistinctness and hypnotic repetitiveness. Might also be good at weird atonal stuff, because it doesn&#x27;t inherently have any notion of what a key or mode is?<p>As a human musician and producer I&#x27;m super interested in the kinds of clarity and sonority we used to get out of classic albums (which the industry has kinda drifted away from for decades) so the way for this to take over for ME would involve a hell of a lot more resolution of the FFT imagery, especially in the highs, plus some way to also do another AI-ification of what different parts of the song exist (like a further layer but it controls abrupt switches of prompt)<p>It could probably do bad modern production fairly well even now :) exaggeration, but not much, when stuff is really overproduced it starts to get way more indistinct, and this can do indistinct. It&#x27;s realaudio grade, it needs to be more like 128kbps mp3 grade.
评论 #34001847 未加载
tomrodover 2 years ago
This is huge.<p>This show me that Stable Diffusion can create anything with the following conditions:<p>1. Can be represented as as static item on two dimensions (their weaving together notwithstanding, it is still piece-by-piece statically built)<p>2. Acceptable with a certain amount of lossiness on the encoding&#x2F;decoding<p>3. Can be presented through a medium that at some point in creation is digitally encoded somewhere.<p>This presents a lot of very interesting changes for the near term. ID.me and similar security approaches are basically dead. Chain of custody proof will become more and more important.<p>Can stable diffusion work across more than two dimensions?
评论 #34005356 未加载
评论 #34007681 未加载
评论 #34005314 未加载
评论 #34008177 未加载
londons_exploreover 2 years ago
I think there has to be a better way to make long songs...<p>For example, you could take half the previous spectrogram, shift it to the left, and then use the inpainting algorithm to make the next bit... Do that repeatedly, while smoothly adjusting the prompt, and I think you&#x27;d get pretty good results.<p>And you could improve on this even more by having a non-linear time scale in the spectrograms. Have 75% of the image be linear, but the remaining 25% represent an exponentially downsampled version of history. That way, the model has access to what was happening seconds, minutes, and hours ago (although less detail for longer time periods ago).
评论 #34003607 未加载
seth_over 2 years ago
Authors here: Fun to wake up to this surprise! We are rushing to add GPUs so you can all experience the app in real-time. Will update asap
评论 #34001123 未加载
评论 #34015556 未加载
评论 #34001228 未加载
评论 #34001082 未加载
baneover 2 years ago
I bet a cool riff on this would be to simply sample an ambient microphone in the workplace and use that the generate and slowly introduce matching background music that fits the current tenor of the environment. Done slowly and subtly enough I&#x27;d bet the listener may not even be entirely aware its happening.<p>If we could measure certain kinds of productivity it might even be useful as a way to &quot;extend&quot; certain highly productive ambient environments a la &quot;music for coding&quot;.
评论 #34001439 未加载
评论 #34010201 未加载
评论 #34005735 未加载
vikpover 2 years ago
Producing images of spectrograms is a genius idea. Great implementation!<p>A couple of ideas that come to mind:<p>- I wonder if you could separate the audio tracks of each instrument, generate separately, and then combine them. This could give more control over the generation. Alignment might be tough, though.<p>- If you could at least separate vocals and instrumentals, you could train a separate model for vocals (LLM for text, then text to speech, maybe). The current implementation doesn&#x27;t seem to handle vocals as well as TTS models.
评论 #34000836 未加载
Frickenover 2 years ago
This opens up ideas. One thing people have tried to do with stable diffusion is create animations. Of course, they all come out pretty janky and gross, you can&#x27;t get the animation smooth.<p>But what if what if a model was trained not on single images, but animated sequential frames, in sets, laid out on a single visual plane. So a panel might show a short sequence of a disney princess expressing a particular emotion as 16 individual frames collected as a single image. One might then be able to generate a clean animated sequence of a previously unimagined disney princess expressing any emotion the model has been trained on. Of course, with big enough models one could (if they can get it working) produced text prompted animations across a wide variety of subjects and styles.
评论 #34009011 未加载
评论 #34009295 未加载
评论 #34014130 未加载
评论 #34027019 未加载
quuxover 2 years ago
The vocals in these tracks are so interesting. They sound like vocals, with the right tone, phonemes. and structure for the different styles and languages but no meaning.<p>Reminds me of the soundtrack to Nier Automata which did a similar thing: <a href="https:&#x2F;&#x2F;youtu.be&#x2F;8jpJM6nc6fE" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;8jpJM6nc6fE</a>
评论 #34006825 未加载
评论 #34005218 未加载
spyderover 2 years ago
Another related audio diffusion model (but without text prompting) here: <a href="https:&#x2F;&#x2F;github.com&#x2F;teticio&#x2F;audio-diffusion" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;teticio&#x2F;audio-diffusion</a>
评论 #34003199 未加载
michpochover 2 years ago
Earlier this year, graphic designers, last month it was software engineers, and now musicians are also feeling the effects.<p>Who else will AI make looking for a new job?
评论 #34000953 未加载
评论 #34007017 未加载
评论 #34003120 未加载
评论 #34000847 未加载
评论 #34000837 未加载
评论 #34004773 未加载
xtractoover 2 years ago
This looks great and the idea is amazing. I tried with the prompt: &quot;speed metal&quot; and &quot;speed metal with guitar riffs&quot; and got some smooth rock-balad type music. I guess there was no heavy metal in the learning samples haha.<p>Great work!
评论 #34002733 未加载
评论 #34012621 未加载
评论 #34007985 未加载
评论 #34012404 未加载
vintermannover 2 years ago
Fun! I tried something similar with DCGAN when it first came out, but that didn&#x27;t exactly make nice noises. The conversion to and from Mel spectrograms was lossy (to put it mildly), and DCGAN, while impressive in its day, is nothing like the stuff we have today.<p>Interesting that it gets so good results with just fine tuning the regular SD model. I assume most of the images it&#x27;s trained on are useless for learning how to generate Mel spectrograms from text, so a model trained from scratch could potentially do even better.<p>There&#x27;s still the issue of reconstructing sound from the spectrograms. I bet it&#x27;s responsible for the somewhat tinny sound we get from this otherwise very cool demo.
zone411over 2 years ago
Interesting. I experimented a bit with the approach of using diffusion on whole audio files, but I ultimately discarded it in favor of generating various elements of music separately. I&#x27;m happy with the results of my project of composing melodies (<a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;playlist?list=PLoCzMRqh5SkFPG0-RIAR8jYRaICWubUdx" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;playlist?list=PLoCzMRqh5SkFPG0-RIAR8...</a>) and I still think this is the way to go and but that was before Stable Diffusion came out. These are interesting results though, maybe it can lead to something more.
billiamover 2 years ago
It may be clearer to those of you who are smarter than me, but I guess I&#x27;ve only recently begun to appreciate what these experiments show--that AI graphical art, literature, music and the like will not succeed in lowering the barriers to humans making things via machines but in training humans to respond to art that is generated by machines. Art will not be challenging but designed by the algorithm to get us to like it. Since such art can be generated for essentially no cost, it will follow a simple popularity model, and will soon suck like your Netflix feed.
评论 #34008204 未加载
评论 #34009894 未加载
orobinsonover 2 years ago
I’d been wondering (naively) if we’d reached the point where we can’t see any new kinds of music now that electronic synthesis allows us to make any possible sound. Changes in musical styles throughout history tend to have been brought about by people embracing new instruments or technology.<p>This is the most exciting thing I’ve seen in ages as it shows we may be on the verge of the next wave of new technology in music that will allow all sorts of weird and wonderful new styles to emerge. I can’t wait to see what these tools can do in the hands of artists as they become more mainstream.
评论 #34006054 未加载
superb-owlover 2 years ago
The interpolation from keyboard typing to jazz is incredible. This is what AI art should be.
talhof8over 2 years ago
Really cool. Can&#x27;t get this to work on the homepage though.<p>Might be a traffic thing?<p>Edit: Works now. A bit laggy but it works. Brilliant!
评论 #34000434 未加载
评论 #34000928 未加载
评论 #34001128 未加载
评论 #34000302 未加载
ZiiSover 2 years ago
For the 30 anniversary? <a href="https:&#x2F;&#x2F;warp.net&#x2F;gb&#x2F;artificial-intelligence" rel="nofollow">https:&#x2F;&#x2F;warp.net&#x2F;gb&#x2F;artificial-intelligence</a>
评论 #34001083 未加载
mastaxover 2 years ago
Wow those examples are shockingly good. It&#x27;s funny that the lyrics are garbled analogously to text in stable diffusion images.<p>The audio quality is surprisingly good, but does sound like it&#x27;s being played through an above-average quality phone line. I bet you could tack on an audio-upres model afterwards. Could train it by turning music into comparable-resolution spectrograms.
Aardwolfover 2 years ago
How comes that the stable diffuse model helps here? Does the fact that it knows what an astronaut on a horse looks like have effect on the audio? Would starting the training from an empty model work too?
评论 #34007720 未加载
flaviuspopanover 2 years ago
I&#x27;m floored, the typing to jazz demo is WILD! Please keep pushing this space, you&#x27;ve got something real special here.
dylan604over 2 years ago
The results of this are similar to my nitpicks of AI generated images (well, duh!). There&#x27;s definitely something recognizable there, but somethings just not quite right about it.<p>I&#x27;m quite impressed that there was enough training data within SD to know what a spectrograph looks like for the different sounds.
lftlover 2 years ago
I just wanted to say you guys did an amazing job packaging this up. I managed to get a local instance up and running against my local GPU in less than 10 minutes.
gedyover 2 years ago
This is so good that I wondered if it&#x27;s fake. Really impressive results from generated spectrographs! Also really interesting that it&#x27;s not exactly trained on the audio files themselves - wonder if the usual copyright-based objections wild even apply here.
评论 #34000308 未加载
评论 #34000518 未加载
评论 #34000577 未加载
senkoover 2 years ago
@haykmartiros, @seth_, thank you for open sourcing this!<p>Played a bit with the very impressive demos, now waiting in queue for my very own riff to get generate.<p>Great as this is, I&#x27;m imagining what it could do for song crossfades (actual mixing instead of plain crossfade even with beat matching).
knicholesover 2 years ago
Does anyone have any good guides&#x2F;tutorials for how to fine-tune Stable Diffusion? I&#x27;m not talking about textual inversion or dreambooth.
MagicMoonlightover 2 years ago
Plug this into a video game and you could have GTA 6 where the NPCs have full dialogue with the appearance of sentience, concerts where imaginary bands play their imaginary catalogue live to you and all kinds of other dynamically generated content.
sagebirdover 2 years ago
Is there a different mapping of FFT information to a two dimensional image that would make harmonic relationships more obvious?<p>IE, use a polar coordinate system where angle 12 oclock is 440hz, and the 12 chromatic notes would be mapped to the angle of the hours. Maybe red pixel intensity is bit mapped to octave, IE first, third and eight octave: 0b10100001.<p>Time would be represented by radius. Unfortunately the space wouldn&#x27;t wrap nicely like if there was a native image format for representing donuts.
评论 #34026945 未加载
sansieationover 2 years ago
This is genuinely amazing. Like with any AI there are areas it&#x27;s better at than others. I hope it doesn&#x27;t go unnoticed just because people try typing &quot;death metal&quot; and are not happy with the results. This one seems to excel at 70-130BPM lo-fi vaporwave&#x2F;washed out electronic beats. Think Ninja Tune or the modern lo-fi beats to study to. Some of this stuff genuinely sounds like something I&#x27;d encounter on Bandcamp or Soundcloud.<p>I think I&#x27;m beginning to crack the code with this one, here&#x27;s my attempt at something like a &quot;DJ set&quot; with this. My goal was to have the smoothest transitions possible and direct the vibe just like I would doing a regular DJ set. <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=BUBaHhDxkIc">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=BUBaHhDxkIc</a><p>I wonder if this could be the future of DJing or perhaps beginning of a new trend in live music making kind of like Sonic Pi. Instead of mixing tracks together, the DJ comes up with prompts on the spot and tweaks AI parameters to achieve the desired direction. Pretty awesome.
adzmover 2 years ago
Really fascinating. I&#x27;d be interested to know more about how it was trained, with what data exactly.
lognover 2 years ago
Congratulations this is an amazing application of technology and truly innovative. This could be leveraged by a wide range of applications that I hope you&#x27;ll capitalize on.
joenot443over 2 years ago
Incredible stuff, Seth &amp; Hayk. I&#x27;ve been thinking nonstop about new things to build using Stable Diffusion and this is the first one that really took my breath away.
peaover 2 years ago
This is amazing! Would it be possible to use it to modify this interpolate between two existing songs (i.e. generate spectrograms from audio and transition between them)?
ricopagsover 2 years ago
This is so completely wild. Love the novelty and inventiveness.<p>Could anyone help me understand whether using SVG instead of bitmap image would be possible? I realize that probably wouldn&#x27;t be taking advantage of the current diffusion part of Stable-Diffusion, but my intuition is maybe it would be less noisy or offer a cleaner&#x2F;more compressible path to parsing transitions in the latent space.<p>Great idea? Off base entirely? Would love some insight either way :D
pmontraover 2 years ago
A musician friend of mine told me that this is (I freely translate) a perversion, building in frequency and returning time. Don&#x27;t shoot the messenger.<p>Personally I like the results. I&#x27;m totally untrained and couldn&#x27;t hear any of the issues many comments are pointing out.<p>I guess that all of lounge&#x2F;elevator music and probably most ad jingles will be automated soon, if automation cost less than human authors.
评论 #34003586 未加载
jansanover 2 years ago
Very impressive. I am quite confident that next years number one Christmas hit will start like &quot;church bells to electronic beats&quot;.
评论 #34007436 未加载
评论 #34000647 未加载
Pepe1voover 2 years ago
I find it really cool that the &quot;uncanny valley&quot; that&#x27;s audible on nearly every sample is exactly as I would imagine that the visual artifacts would sound that crop up in most generated art. Not really surprising I guess, but still cool that there&#x27;s such a direct correlation between completely different mediums!
评论 #34010529 未加载
Abecidover 2 years ago
This is one of the most ingenious thing I&#x27;ve seen in my life
CrypticShiftover 2 years ago
Things similar to the “interpolation” part (not the generative part) are already used extensively especially for game and movie sound design. Kyma [1] is the absolute leader (it requires expensive hardware though). IMO later iterations on this approach may lead to similar or better results.<p>FYI, other apps that use more classic but still complex Spectral&#x2F;Granular algos :<p><a href="https:&#x2F;&#x2F;www.thecargocult.nz&#x2F;products&#x2F;envy" rel="nofollow">https:&#x2F;&#x2F;www.thecargocult.nz&#x2F;products&#x2F;envy</a><p><a href="https:&#x2F;&#x2F;transformizer.com&#x2F;products&#x2F;" rel="nofollow">https:&#x2F;&#x2F;transformizer.com&#x2F;products&#x2F;</a><p><a href="https:&#x2F;&#x2F;www.zynaptiq.com&#x2F;morph&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.zynaptiq.com&#x2F;morph&#x2F;</a><p>[1] <a href="https:&#x2F;&#x2F;kyma.symbolicsound.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;kyma.symbolicsound.com&#x2F;</a>
owlbynightover 2 years ago
If copyright laws don&#x27;t catch up, the sampling industry is cooked.<p>Made this: <a href="https:&#x2F;&#x2F;soundcloud.com&#x2F;obnmusic&#x2F;ai-sampling-riffusion-waves-lapping-on-a-shore-to-nowhere" rel="nofollow">https:&#x2F;&#x2F;soundcloud.com&#x2F;obnmusic&#x2F;ai-sampling-riffusion-waves-...</a>
评论 #34012805 未加载
rbn3over 2 years ago
great stuff, while it comes with the usual smeary iFFT artifacts that AI-generated sound tends to have the results are surprisingly good. i especially love the nonsense vocals it generates in the last example, which remind me of what singing along to foreign songs felt like in my childhood.
andy_pppover 2 years ago
I was thinking about this - what if someone trained a stable diffusion type model on all of the worlds commercial music? This model would probably produce quite amazing music given enough prompting and I&#x27;m wondering if the music industry would be allowed to claim copyright on works created with such a model. Would it be illegal or is this just like a musician picking up ideas from hearing the world of music? Is it really right to make learning a crime, even if machines are doing it? I&#x27;m conflicted after finding out that for sync licensing the music industry want a percentage of revenue based on your subscription fees, sometimes as high as 15%-20%! I&#x27;m surprised such a huge fee isn&#x27;t considered some kind of protection racket.
评论 #34002018 未加载
toasternzover 2 years ago
<a href="https:&#x2F;&#x2F;soundcloud.com&#x2F;toastednz&#x2F;stablediffusiontoddedwards?si=e7018bdda1014b8084d47d0ade6bf1ee&amp;utm_source=clipboard&amp;utm_medium=text&amp;utm_campaign=social_sharing" rel="nofollow">https:&#x2F;&#x2F;soundcloud.com&#x2F;toastednz&#x2F;stablediffusiontoddedwards?...</a><p>40 sec clip of uk garage&#x2F;todd edwards style track made with riffusion -&gt; serato studio with todd beats added
O__________Oover 2 years ago
&gt;&gt; Prompt - When providing prompts, get creative! Try your favorite artists, instruments like saxophone or violin, modifiers like arabic or jamaican, genres like jazz or rock, sounds like church bells or rain, or any combination. Many words that are not present in the training data still work because the text encoder can associate words with similar semantics. The closer a prompt is in spirit to the seed image And BPM, the better the results. For example, a prompt for a genre that is much faster BPM than the seed image will result in poor, generic audio.<p>(1) Is there a corpse the keywords were collected from?<p>(2) Is it possible to model the proximity of the image to keywords and sets of keywords?
vermilinguaover 2 years ago
Finally, post-avant jazzcore [1] and progressive dreamfunk [2] made real.<p>[1] <a href="https:&#x2F;&#x2F;www.riffusion.com&#x2F;?&amp;prompt=Post-avant+jazzcore&amp;seed=8730&amp;denoising=0.75&amp;seedImageId=og_beat" rel="nofollow">https:&#x2F;&#x2F;www.riffusion.com&#x2F;?&amp;prompt=Post-avant+jazzcore&amp;seed=...</a><p>[2] <a href="https:&#x2F;&#x2F;www.riffusion.com&#x2F;?&amp;prompt=Progressive+dreamfunk&amp;seed=3&amp;denoising=0.75&amp;seedImageId=og_beat" rel="nofollow">https:&#x2F;&#x2F;www.riffusion.com&#x2F;?&amp;prompt=Progressive+dreamfunk&amp;see...</a>
r3trohack3rover 2 years ago
I can’t help but see parallels to synesthesia. It’s amazing how capable these models are at encoding arbitrary domain knowledge as long as you can represent it visually w&#x2F; reasonable noise margins.
m3kw9over 2 years ago
They’ve got a looooong way to go man
评论 #34003651 未加载
评论 #34014571 未加载
Slow_Handover 2 years ago
As a musician, I&#x27;ll start worrying once an AI can write at the level of sophistication of a Bill Withers song:<p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=nUXgJkkgWCg">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=nUXgJkkgWCg</a><p>Not simply SOUND like a Bill Withers song, but to have the same depth of meaning and feeling.<p>At that point, even if we lose we win because we&#x27;ll all be drowning in amazing music. Then we&#x27;ll have a different class of problem to contend with.
minaguibover 2 years ago
Absolutely incredible - from idea to implementation to output.
lachlan_grayover 2 years ago
Wow, diffusion could be a game changer for audio restoration.
nathan_f77over 2 years ago
This is awesome! It would be interesting to generate individual stems for each instrument, or even MIDI notes to be rendered by a DAW and VST plugins. It&#x27;s unfortunate that most musicians don&#x27;t release the source files for their songs so it&#x27;s hard to get enough training data. There&#x27;s a lot of MIDI files out there but they don&#x27;t usually have information about effects, EQ, mastering, etc.
评论 #34030064 未加载
evo_9over 2 years ago
Pretty nice, I was just talking to a friend about needing a music version of chatgpt, so thank you for this.<p>Wondering if it would be possible to create a version of this that you can point at a person SoundsCloud and have it emulate their style &#x2F; create more music in the style of the original artist. I have a couple albums worth of downtempo electronic music I would love to point something like this at and see what it comes up with.
评论 #34030071 未加载
评论 #34005021 未加载
primarchover 2 years ago
Could this concept be inverted, to identify music from a text prompt, as in i want a particular vibe and it can go tell me what music fits that description. Ive always thought the ability to find music you like was very lacking, it should not be bounded by a genre instead its usually based on rythmic and melodic structures that appeal to you, regardless of what type of music it might be.
hoschiczover 2 years ago
What did use as training data?
isoprophlexover 2 years ago
I wonder how they got their train data..! The spectrogram trick is genius, but not much useful without high quality, diverse data to train on
stevehiehnover 2 years ago
Really great! I&#x27;ve been using diffusion as well to create sample libraries. My angle is to train models strictly on chord progression annotated data as opposed to the human descriptions so they can be integrated into a DAW plugin. Check it out: <a href="https:&#x2F;&#x2F;signalsandsorcery.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;signalsandsorcery.org&#x2F;</a>
GaggiXover 2 years ago
You can train&#x2F;finetuned a Stable Diffusion model on an arbitrary aspect ratio&#x2F;resolution and then the model starts creating coherent images, would be cool to try finetuning&#x2F;training this model on entire songs by extending the time dimension (also the attention layer at the usual 64x64 resolution should be removed or it would eat too much memory)
r-k-joover 2 years ago
Amazing project! Here is a demo including an input for negative prompt. It&#x27;s impressive how it works. You can try:<p>prompt: relaxing jazz melody bass music negative_prompt: piano music<p><a href="https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;radames&#x2F;spectrogram-to-music" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;radames&#x2F;spectrogram-to-music</a>
ElijahLynnover 2 years ago
I was confused because I must not have read good that the working webapp is at <a href="https:&#x2F;&#x2F;www.riffusion.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.riffusion.com&#x2F;</a>. Go to <a href="https:&#x2F;&#x2F;www.riffusion.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.riffusion.com&#x2F;</a> and press the play button to see it in action!
fernandohurover 2 years ago
I found this awesome podcast that goes into several AI &amp; music related topics <a href="https:&#x2F;&#x2F;open.spotify.com&#x2F;show&#x2F;2wwpj4AacVoL4hmxdsNLIo?si=IAaJ6A3yQ1GT5na4iSZnRA" rel="nofollow">https:&#x2F;&#x2F;open.spotify.com&#x2F;show&#x2F;2wwpj4AacVoL4hmxdsNLIo?si=IAaJ...</a><p>They even talk specifically about about applying stable diffusion and spectrograms.
leodover 2 years ago
Awesome work.<p>Would you be willing to share details about the fine-tuning procedure, such as the initialization, learning rate schedule, batch size, etc.? I&#x27;d love to learn more.<p>Background: I&#x27;ve been playing around with generating image sequences from sliding windows of audio. The idea roughly works, but the model training gets stuck due to the difficulty of the task.
EZ-Cheezeover 2 years ago
&quot;<a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Spectrogram" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Spectrogram</a> - can we already do sound via image? probably soon if not already&quot;<p>Me in the Stable Diffusion discord, 10&#x2F;24&#x2F;2022<p>The ppl saying this was a genius idea should go check out my other ideas
评论 #34002833 未加载
LonelyWolfeover 2 years ago
I wonder if subliminal messaging will somehow make a comeback once we have ai generated audio and video. Something like we type &quot;Fun topic&quot; and those controlling the servers will inject &quot;and praise to our empire&#x2F;government&#x2F;tyrant&quot; to the suggestion or something like that.
raajgover 2 years ago
If such unreasonably good music can be created based on information encoded in an image, I&#x27;m wondering what there things we can do with this flow:<p>1) Write text to describe the problem 2) Generate an image Y that encodes that information 3) Parse that image Y to do X<p>Example: Y = blueprint, X = Constructing a building with that blueprint
newswasboringover 2 years ago
If it can do music, can we train better models for different kinds of music? Or different models for different instruments makes more sense? For different instruments we can get better resolution by making the spectrogram represent different frequency ranges. This is terribly exciting, what a time to be alive.
fowlkesover 2 years ago
Multiple folks have asked here and in other forums but I&#x27;m going to reiterate, what data set of paired music-captions was this trained on? It seems strange to put up a splashy demo and repo with model checkpoints but not explain where the model came from... is there something fishy going on?
jsatover 2 years ago
Today&#x27;s music generation is putting my Pop Ballad Generator to shame: <a href="http:&#x2F;&#x2F;jsat.io&#x2F;blog&#x2F;2015&#x2F;03&#x2F;26&#x2F;pop-ballad-generator&#x2F;" rel="nofollow">http:&#x2F;&#x2F;jsat.io&#x2F;blog&#x2F;2015&#x2F;03&#x2F;26&#x2F;pop-ballad-generator&#x2F;</a>
blaaaaa99aover 2 years ago
I feel like the next step here is to get a GPT3 like model to parse everything ever written about every piece of music which is on the internet (and in every pdf on libgen and scihub) and link them to spectrograms of that music<p>and then things are going to get wild<p>I am so blessed to live in this era :)
birdyroosterover 2 years ago
I know it sounds like I am going to be sarcastic, but I mean all of this in earnest and with good intention. Everything this generates is somehow worse than the thing it generated before it. Like the uncanny valley of audio had never been traversed in such high fidelity. Great work!
bulbosaur123over 2 years ago
Anyone interested in joining an unofficial Riffusion Discord, let&#x27;s organize here: <a href="https:&#x2F;&#x2F;discord.gg&#x2F;DevkvXMJaa" rel="nofollow">https:&#x2F;&#x2F;discord.gg&#x2F;DevkvXMJaa</a><p>Would be nice to have a channel where people can share Riffs they come up with.
thedanglerover 2 years ago
Anyone know how I could try and use this with Elixir livebook with <a href="https:&#x2F;&#x2F;github.com&#x2F;elixir-nx&#x2F;bumblebee">https:&#x2F;&#x2F;github.com&#x2F;elixir-nx&#x2F;bumblebee</a><p>I&#x27;m new but this is something that would get me going.
bogwogover 2 years ago
Damn this is insane. I wonder what other things can be encoded as images and generated with SD?
johndhiover 2 years ago
I&#x27;m a short fiction writer. Do you think I could get one of these new models to write a good story?<p>I&#x27;d want to train it to include foreshadowing, suspense, relatable characters and perhaps a twist ending that cleverely references the beginning.
评论 #34008722 未加载
评论 #34012414 未加载
评论 #34030091 未加载
TuringNYCover 2 years ago
I read the article:<p>&quot;If you have a GPU powerful enough to generate stable diffusion results in under five seconds, you can run the experience locally using our test flask server.&quot;<p>Curious what sort of GPU the author was using or what some of the min requirements might be?
评论 #34002119 未加载
评论 #34006500 未加载
soperjover 2 years ago
&gt; <a href="https:&#x2F;&#x2F;www.riffusion.com&#x2F;?&amp;prompt=punk+rock+in+11&#x2F;8" rel="nofollow">https:&#x2F;&#x2F;www.riffusion.com&#x2F;?&amp;prompt=punk+rock+in+11&#x2F;8</a><p>Tried getting something in an odd timing, but still is 4&#x2F;4.
woeiruaover 2 years ago
Very cool, but the music still has a very &quot;rough&quot;, almost abrasive tinge to it. My hunch, is that it has to do with the phase estimates being off.<p>Who&#x27;s going to be first to take this approach and use it to generate human speech instead?
wmwmwmover 2 years ago
This is amazing, and scary (as a musician) but also reliably kills firefox on iOS!
motoxproover 2 years ago
This is just insane. Sooooo incredible. Don&#x27;t really realize how far things have come until it hits a domain you&#x27;re extremely familiar with. Spent 8-9 in music production and the transition stuff blew me away.
Animatsover 2 years ago
&quot;Uh oh! Servers are behind, scaling up...&quot; - havent&#x27; been able to get past that yet. Anyone getting new output?<p>This is already better than most techno. I can see DJs using this, typing away.
slenocchioover 2 years ago
Do you guys think AI creative tools will completely subsume the possibility space of human made music? Or does it open up a new dimension of possibilities orthogonally to it? Hard for me to imagine how AI would be able to create something as unique and human as D&#x27;Angelo&#x27;s Voodoo (esp. before he existed) but maybe it could (eventually).<p>If I understand these AI algorithms at a high level, they&#x27;re essentially finding patterns in things that already exist and replicate it w some variation quite well. But a good song is perfect&#x2F;precise in each moment in time. Maybe we&#x27;ll only be ever be able to get asymptotically closer but never _quite_ there to something as perfectly crafted a human could make? Maybe there will always be a frontier space only humans can explore?
评论 #34001195 未加载
评论 #34030109 未加载
corysamaover 2 years ago
Can anyone confirm&#x2F;deny my theory that AI audio generation has been lagging behind progress in image generation because it’s way easier to get a billion labeled images than a billion labeled audio clips?
评论 #34002049 未加载
评论 #34007249 未加载
xor99over 2 years ago
Also check the similar work on arxiv:<p>Multi-instrument Music Synthesis with Spectrogram Diffusion:<p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2206.05408" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2206.05408</a>
londons_exploreover 2 years ago
I propose that while you are GPU limited, you make these changes:<p>* Don&#x27;t do the alpha fade to the next prompt - just jump straight to alpha=1.0.<p>* Pause the playback if the server hasn&#x27;t responded in time, rather than looping.
2devnullover 2 years ago
Was just watching an interview of Billy Corgan (smashing pumpkins) on Rick Beato’s YouTube[1] last night where billy was lamenting the inevitable future where the “psychopaths” in the music biz will use ai and auto tune to churn out three chord non-music mumble rap for the youth of tomorrow, or something to that effect. It was funny because it’s the sad truth. It’s already here but new tech will allow them to cut costs even more, and increase their margins. No need for musicians. Really cool on one hand, in the same way fentanyl is cool — or the cotton gin, but a bit depressing on the other, if you care about musicians. I and a few others will always pay to go the symphony, so good players will find a way get paid, but this is what kids will listen to, because of the profit margin alone.<p>[1] <a href="https:&#x2F;&#x2F;m.youtube.com&#x2F;watch?v=nAfkxHcqWKI" rel="nofollow">https:&#x2F;&#x2F;m.youtube.com&#x2F;watch?v=nAfkxHcqWKI</a>
评论 #34003458 未加载
nathiasover 2 years ago
Very cool! I was wondering why there wasnt any music diffusion apps out there, it seems more useful because music has stricter copyright and all content creators need some background music ...
whiddershinsover 2 years ago
This is a brilliant idea.<p>Also, spectrographs will never generate plausible high quality audio. (I think)<p>So I think the next move is to map the generate audio back over to synthesizer and samples via midi …
ubjover 2 years ago
This happened earlier than I expected, and using a much different technique than I expected.<p>Bracing myself for when major record labels enter the copyright brawl that diffusion technology is sparking.
bawolffover 2 years ago
I wonder if this would be applicable to video game music. Be able to make stuff that&#x27;s less repetitive but also smoothly transitions to specific things with in-game events.
esotericseanover 2 years ago
Coming at this from a layman&#x27;s perspective, would it be possible to generate a different sort of spectrogram that&#x27;s designed for SD to iterate upon even more easily?
TechTechTechover 2 years ago
I got an actual `HTTP 402: PAYMENT_REQUIRED` response (never seen one of those in the wild, according to Mozilla it is experimental). Someone&#x27;s credit card stopped scaling?
评论 #34003015 未加载
serverholicover 2 years ago
I’m curious about the limitations of using spectrograms and transient-heavy sounds like drums.<p>It seems like you’d need very high resolution spectrograms to get a consistently snappy drum sound.
评论 #34002422 未加载
nixpulvisover 2 years ago
Sounds a bit &quot;clowny&quot; to me, for lack of a better word.
_spduchampover 2 years ago
I prefer to make my spectrograms by hand. <a href="https:&#x2F;&#x2F;youtu.be&#x2F;HT0HH_fc4ZU" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;HT0HH_fc4ZU</a>
gitfan86over 2 years ago
This is what I&#x27;ve been talking about all year. It is such a relief to see it actually happen.<p>In summary: The search for AGI is dead. Intelligence was here and more general than we realized this whole time. Humans are not special as far as intelligence goes. Just look how often people predict that an AI cannot X or Y or Z. And then when an AI does one of those things they say, &quot;well it cannot A or B or C&quot;.<p>What is next: This trend is going to accelerate as people realize that AI&#x27;s power isn&#x27;t in replacing human tasks with AI agents, but letting the AI operate in latent spaces and domains that we never even thought about trying.
评论 #34003327 未加载
quirkotover 2 years ago
Anyone out there that speaks Arabic, can you let us know if the Arabic Gospel clip contains words? To a speaker do the sounds even sound Arabic?
nonimaover 2 years ago
This is really cool but can someone tell me why we are automating art? Who asked for this? The future seems depressing when I look at all this AI generated art.
评论 #34002094 未加载
评论 #34001720 未加载
评论 #34002907 未加载
评论 #34003789 未加载
评论 #34001059 未加载
评论 #34001113 未加载
评论 #34001888 未加载
评论 #34005342 未加载
xcambarover 2 years ago
I will try it but at least for the name it deserves praise.
XorNotover 2 years ago
So this is slightly bending my mind again. Somehow image generators were more comprehensible compared to getting coherent music out. This is incredible.
slissover 2 years ago
Such a creative application of SD to spectrograms!<p>...now do stock charts
2dvisioover 2 years ago
Very interesting idea :) Unfortunately it breaks when I enter Tarantella or Taranta. Need more training samples from south of Italy :)
NoPicklezover 2 years ago
This, Chat-GPT and the AI image generation. We&#x27;re now at a very interesting time where average joes get to start using incredible tools.
phneutral26over 2 years ago
Right now it still seems to lack the horsepower for this many users. Hope it gets in a better state soon, but I am bookmarking this right now!
Raed667over 2 years ago
Seems to be victim of its own success:<p>- No new result available, looping previous clip<p>- Uh oh! Servers are behind, scaling up<p>I hope Vercel people can give you some free credits to scale it up.
sampoover 2 years ago
GPT-3 has 175 billion parameters (says Wikipedia). What is the size of the neural network used in this riffusion project?
评论 #34001314 未加载
Brogeover 2 years ago
I wonder if it&#x27;s possible to fine-tune an image upscaling model on spectrograms, in order to clean up the sound?
nailloover 2 years ago
It&#x27;s interesting if this can be used for longer tracks by inpainting the right half of the spectrogram.
MarcelOlszover 2 years ago
Now if only it could generate accurate sheet music, then you&#x27;ve got something. Incredible examples.
ElijahLynnover 2 years ago
Wow, I just learned so much about spectograms, had no idea that one could reverse one into audio waves!
fillipvtover 2 years ago
Now the next step is to use Stable Diffusion to create chemical components via molecular graphs
realmodover 2 years ago
Wow, absolutely fascinating. AI will continue to revolutionize our current approaches.
dreilideover 2 years ago
impressive stuff. reminds me of when ppl started using image classifier networks on spectrograms in order to classify audio. i would not have thought to apply a similar concept for generative models, but it seems obvious in hindsight.
winReInstallover 2 years ago
Cant wait to see this in karaoke, you just sing lyrics and it jams along with music.
rmetzlerover 2 years ago
&quot;Jamaican rap&quot; - usually the genre (e.g. Sean Paul) is called Dancehall.
fritzschopenover 2 years ago
it seems that SD does cover everything in terms of generative ai. Speaking of music, very interesting paper and demo. Just wondering in terms of license and commercialization, what kind of mess are we expecting here?
epigramxover 2 years ago
I wasn&#x27;t expecting to see uncanny valley translated to music today.
mensetmanusmanover 2 years ago
This works because songs are images in time. FFT analysis does not care.
bufferoverflowover 2 years ago
A network trained on spectrograms only should do much better.
farminover 2 years ago
That church bell one is amazing. Very creative transition.
throwaway743over 2 years ago
Some how it made Wesley Willis sound even better.
fireover 2 years ago
unfortunately I put in &quot;sonic the hedgehog&quot; and the result was...<p>... nothing? like, there&#x27;s no playback at all. Is that expected?
bluebitover 2 years ago
And we broke it.
aquanextover 2 years ago
Someone please train it on John Coltrane.
lucidrainsover 2 years ago
personalized RL agents that finds aesthetic trajectories through the music latent space... soon, i hope :D
评论 #34006526 未加载
needzover 2 years ago
This website crashes Firefox on iOS
NHQover 2 years ago
In the end there was the word.
wwarnerover 2 years ago
BOOM! Yes!
Scaevolusover 2 years ago
Did you fine-tune the VAE?
xor99over 2 years ago
New era of library music.
up2isomorphismover 2 years ago
These are horrible musics, but of course there is nothing to feel shame about it.
ngcc_hkover 2 years ago
Not generate … steal.
kingcaiover 2 years ago
Absolutely brilliant!
Moosdijkover 2 years ago
Wow this is awesome!
gardenhedgeover 2 years ago
impressive. and this is a hobby project.. amazing
451movover 2 years ago
why not use an image of the waveform as input?
PcChipover 2 years ago
the problem is it <i>sounds</i> awful, like a 64kbps MP3 or worse<p>Perhaps AI can be trained to create music in different ways than generating spectrograms and converting them to audio?
评论 #34004786 未加载
simsspoonsover 2 years ago
this is just great
评论 #34006134 未加载
swfsqlover 2 years ago
snake jazz
mixedenover 2 years ago
wow!
rsliceover 2 years ago
deleted
评论 #34002432 未加载