Alexandre Defossez


Alexandre is part of the founding research team at Kyutai, a leading non profit research lab in Paris. Before he was a research scientist for 3 years at Meta AI Research, leading in particular the development of the AudioCraft framework (EnCodec, AudioGen, MusicGen). Alexandre completed his CIFRE PhD at Facebook AI Research and INRIA Paris, working in particular on music source separation (Demucs).

Recent advances in text to audio generation

The last year has been rich in ground breaking releases in the field of text to audio and music generation. While JukeBox by Dhariwal et al. (2020) opened up the way 3 years ago, it suffered from impractically slow generation and training complexity. The development of efficient discrete neural audio codecs such as SoundStream (Zeghidour et al. 2021), and EnCodec (Défossez et al. 2022), allowed to easily applied auto-regressive Transformer based modeling to the field of audio. Nonetheless, audio representations have their own specificities compared with NLP tokens, and it takes special care to obtain a model that is both stable, fast, and generates high quality outputs. In this talk we will present the recent evolutions in auto regressive audio modeling, in particular the AudioGen (Kreuk et al. 2022) and MusicGen (Copet et al. 2023) models.