Try “Riffusion”, an artificial intelligence model that composes music by visualizing it

AI-generated music is already an innovative enough concept, but Riffusion takes it to another level with a clever and quirky approach that produces quirky and engaging music using not sound but Pictures from the sound.

Seems strange, strange. But if it works, it works. And it works! Kind of.

Diffusion is a machine learning approach to image generation that has supercharged the AI ​​world over the past year. DALL-E 2 and Stable Diffusion are the two most famous models that work by gradually replacing visual noise with what the AI ​​thinks it should sound like.

This method has been shown to be effective in many contexts and is very susceptible to fine tuning, often giving the trained model a lot of a particular type of content to make it specialized in producing more examples of that content. For example, you can set it to watercolors or portraits of cars, and it will prove most capable of reproducing any of those things.

What Seth Forsgren and Hayk Martiros did for their hobby project, Riffusion, was to set the stable diffusion on spectrograms.

“Heike and I play in a little band together, and we started the project simply because we love music and we didn’t know if it was even possible for Stable Diffusion to generate a spectrogram image with enough precision to convert it into audio,” Forsgren told TechCrunch. “At each step along the way, we were more and more impressed by what was possible, one idea leading to the next.”

What are spectrum charts, you ask? They are visual representations of sound that show the amplitudes of different frequencies over time. You may have seen wave shapes that show volume over time and make the sound sound like a series of hills and valleys; Imagine that instead of just the overall volume, it showed the volume of each frequency, from the low end to the high end.

This is part of a song I made (“Marconi’s Radio” by Secret Machines, if you were wondering):

Image credits: Devin Koldoy

You can see how it gets louder at all frequencies as the song builds, and you can even select individual notes and instruments if you know what to look for. The process is not inherently perfect or lossless by any means, but it is an accurate and systematic representation of sound. And you can convert it back into sound by doing the same process in reverse.

Forsgren and Martiros made spectrographs of a range of music and labeled the resulting images with the relevant terms, like “guitar blues,” “piano jazz,” “afrobeat,” things like that. Feeding the model gave this group a good idea of ​​what certain sounds “look like” and how they might be recreated or combined.

Here’s what the diffusion process looks like if you try it out while optimizing the image:

Image credits: Seth Forsgren / Shuhada Hayek

Indeed, the model has demonstrated its ability to produce spectrograms that, when converted to sound, are a good match for stimuli such as “funky piano,” “jazzy saxophone,” and so on. This is an example:

Image credits: Seth Forsgren / Shuhada Hayek

But of course the square spectrogram (512 x 512 pixels, standard fixed spread resolution) is only a short clip; A three-minute song will be a much wider rectangle. Nobody wants to listen to music five seconds at a time, but the limitations of the system they built mean they can’t just create a spectrogram 512 pixels long and 10,000 wide.

After trying a few things, they took advantage of the infrastructure of large models such as Stable Diffusion, which have a large amount of “latent space”. This is a bit like the no-man’s-land between more specific nodes. Like if you have an area of ​​the model that represents cats, and another that represents dogs, so what’s “in between” is a latent space that, if you just tell the AI ​​to draw, it will kind of be a dog, or a cat, although there is no such thing.

By the way, the lurking space stuff gets even weirder than that:

However, there are no frightening nightmare worlds for the Riffusion project. Instead, they’ve found that if you have two routers, like “church bells” and “electronic tones,” you can kind of go from one to the other a little bit at a time and fade in gradually and surprisingly naturally from one to the other, on an even beat:

It sounds weird and interesting, though it’s clearly not complex or high fidelity; Remember, they weren’t even sure spread models could ever do that, so the ease with which this device turns bells into chimes or typewriter clicks on pianos and basses is pretty remarkable.

It is possible to produce segments of a longer shape but it is still theoretical:

“We weren’t really trying to create a classic 3-minute song with chanted choruses and verses,” Forsgren said. “I think this can be done with some clever tricks like building a higher level model of the song structure, and then using the lower level model for the individual syllables. Alternatively, you could train our model in depth using much higher resolution images of the complete songs.”

Where do they go from here? Other groups are trying to create AI-generated music in different ways, from using speech synthesis models to specially trained vocal ones like Dance Diffusion.

Riffusion is more of a “wow, look at that demo” than any kind of grand plan to reinvent music, and Forsgren said he and Martiros were just happy to see people engaging with their work, enjoying it and repeating it:

“There are so many directions we can take from here, and we’re excited to keep learning along the way. It’s been fun to see other people really building their own ideas on top of code this morning, too. One of the amazing things about the Stable Diffusion community is how fast people are at it.” Building on top of things in directions the original authors couldn’t predict.”

You can test it out in a live demo at, but you might have to wait a bit for your clip to premiere – this got a bit more attention than the creators expected. The entire code is available via the About page, so feel free to run your own as well, if you have slides of your own.

#Riffusion #artificial #intelligence #model #composes #music #visualizing

Leave a Reply

Your email address will not be published. Required fields are marked *