Date
Share
Stefan Lattner PhD

Stefan Lattner

Sony CSL – Paris Music Team

Stefan Lattner serves as a researcher leader at the music team at Sony CSL Paris, where he focuses on generative AI for music production, music information retrieval, and computational music perception. He earned his PhD in 2019 from Johannes Kepler University (JKU) in Linz, Austria, following his research at the Austrian Research Institute for Artificial Intelligence in Vienna and the Institute of Computational Perception Linz. His studies centered on the modeling of musical structure, encompassing transformation learning and computational relative pitch perception. His current interests include human-computer interaction in music creation, live staging, and information theory in music. He specializes in generative sequence models, computational short-term memories, (self-supervised) representation learning and musical audio generation.

A New Era in Music Creation: Generative AI at Sony CSL - Paris

In this talk, I will present the recent work of the music team at Sony CSL Paris revolving around music generation, representation learning and information theory. I will mention recent developments in Diff-A-Riff, a latent diffusion music accompaniment generation model that is particularly designed for music production use cases. It produces high-quality 48kHz audio stems that fit a given musical context and can be controlled by either audio references or text input. Diff-A-Riff is relatively lightweight because it builds upon Music2Latent, a Consistency Autoencoder that achieves a 64x compression ratio for musical audio. The generative decoder of Music2Latent produces high-quality reconstructions by filling in information that gets lost during compression. Meanwhile, a second generation of Music2Latent has been developed and I will explain its novelties. The representations of Music2Latent are also utilized by an autoregressive transformer model that can generate continuous-valued sequences. That’s based on an exciting new method that I will also include in my talk. Furthermore, I will present novel developments in Stem-JEPA, a self-supervised learning model that is trained to assess the compatibility between stems and mixes, potentially extending the range of data that can be used to train Diff-A-Riff. Also, he will introduce a method to control the information content of generated symbolic music using beam search, which is applicable to any autoregressive language model, as well as a method to estimate surprisal in musical audio.