Javier Nistal

Sony CSL – Paris

Javier Nistal is an Associate Researcher with the Music Team at Sony Computer Science Laboratories in Paris. He studied Telecommunications Engineering at Universidad Politecnica de Madrid and received a Master’s in Sound and Music Computing from Universitat Pompeu Fabra. He completed his doctoral studies at Telecom Paris in a collaborative effort with Sony CSL, where he researched Generative Adversarial Networks for musical audio synthesis.
In the music tech industry, Javier has worked on diverse projects involving machine learning (ML) and music, including recommendation systems, instrument recognition, and automatic mixing. He contributed to the development of the Midas Heritage D, the first ML-driven audio mixing console, and created DrumGAN, the first ML-powered sound synthesizer to hit the market.
Javier’s current research interest lies at the intersection of music production and deep learning. He is dedicated to devising generative models for music co-creation, aiming to enhance artistic creativity and enable musicians to explore new realms of musical expression.

Diff-A-Riff: Musical Accompaniment Co-creation with Text-driven Latent Diffusion Models

ecent advancements in deep generative models have transformed our capacity for digital creativity, offering sophisticated tools for generating visual and auditory content with levels of control that were previously unimaginable. Projects like DALL-E and Midjourney have set new benchmarks in visual media by turning text prompts into detailed images and videos. Simultaneously, the music industry has witnessed parallel innovations through developments like MusicLM and Stable Audio, which convert descriptive text into complete musical compositions. These technological milestones have significantly expanded the creative horizons for artists and creators. However, the practical application of these models in music creation faces notable obstacles. Their high computational demands exceed the capacity of standard personal computers and challenge even the most advanced GPUs, limiting accessibility for everyday creators. Additionally, the audio quality falls short of the professional standards required for music production, primarily due to inadequate sample rates. Furthermore, while text has proven to be an effective medium for steering visual content generation, its potential to guide music generation in an equally intuitive way has yet to be fully realized. In this seminar, I will introduce “Diff-A-Riff,” an innovative Latent Diffusion Model designed for creating instrumental accompaniments. “Diff-A-Riff” responds to complex musical contexts and offers controllability through text prompts and audio examples, effectively closing the gap between high-level conceptualization and musical realization. The model operates at a 48kHz sample rate and demonstrates remarkable inference speeds due to its efficient, compressed representation space—achieving generation times of 0.5 seconds for 10 seconds of audio on standard GPU and 9 seconds on CPU. This performance makes high-quality, AI-assisted music creation more accessible to those with conventional computing setups. “Diff-A-Riff” introduces a novel perspective on AI-assisted music creation, prioritizing user-friendliness, high-quality output, and expressive interaction, paving the way for a new era of collaborative creativity between humans and machines.