Riffusion’s artificial intelligence generates music from text using audiovisual waves

Zoom in / An AI-generated image of musical notes exploding from a computer screen.

Ars Technica

On Thursday, a pair of tech buffs released Riffusion, an AI model that generates music from text prompts by creating a visual representation of audio and converting it into audio for playback. It uses an exact version of the Stable Diffusion 1.5 image synthesis model, applying optical latent diffusion to sound processing in a new way.

Created as a hobby project by Seth Forsgren and Hayk Martiros, Riffusion works by creating audiograms that store sound in a 2D image. On an ultrasound chart, the X axis represents time (the order in which the frequencies are played, from left to right), and the Y axis represents the frequency of the sounds. Meanwhile, the color of each pixel in the image represents the amplitude of the sound at that particular moment in time.

Because an ultrasound chart is a type of image, Stable Diffusion can process it. Forsgren and Martiros trained a custom Stable Diffusion model with phonetic examples linked to descriptions of the sounds or musical genres they represent. With this knowledge, Riffusion can instantly create new music based on text prompts describing the type of music or sound you want to hear, such as “jazz,” “rock,” or even keyboard typing.

After the ultrasound image is created, Riffusion Torchaudio is used to change the sound wave outline into sound, and play it back as a sound.

The sound wavegraph represents time, frequency, and amplitude in a two-dimensional image.
Zoom in / The sound wavegraph represents time, frequency, and amplitude in a two-dimensional image.

“This is the v1.5 Stable Diffusion model with no modifications, just fine-tuned spectrogram images coupled with text,” Riffusion’s creators wrote on its explanation page. “Countless variations of the vector can be generated by diversifying the seeds. Web UIs and techniques like img2img, inpainting, negative prompts, and interpolation all work out of the box.”

Visitors to the Riffusion website can experience the AI ​​model thanks to an interactive web application that generates interpolated audiograms (seamlessly stitched together for uninterrupted playback) in real time while visualizing the spectrogram continuously on the left side of the page.

Screenshot of the Riffusion website, which allows you to type prompts and hear the resulting sound waves.
Zoom in / Screenshot of the Riffusion website, which allows you to type prompts and hear the resulting sound waves.

It can also incorporate patterns. For example, writing in “Tropical Soft Jazz” brings in elements of different genres for a fresh score, encouraging experimentation by mixing styles.

Of course, Riffusion isn’t the first AI-powered music generator. Earlier this year, Harmonai released Dance Diffusion, an AI-powered generative music model. OpenAI’s Jukebox, announced in 2020, also generates new music with a neural network. And websites like Soundraw create non-stop music on the go.

Compared to those simplified AI musical efforts, Riffusion looks more like a hobby project. The music he generates ranges from interesting to incomprehensible, but remains a notable application of the latent diffusion technique that processes sound in visual space.

The Riffusion sample code and checkpoint are available on GitHub.

#Riffusions #artificial #intelligence #generates #music #text #audiovisual #waves

Leave a Reply

Your email address will not be published. Required fields are marked *