The last year has been rich in ground breaking releases in the field of text to audio and music generation. While JukeBox by Dhariwal et al. (2020) opened up the way 3 years ago, it suffered from impractically slow generation and training complexity. The development of efficient discrete neural audio codecs such as SoundStream (Zeghidour et al. 2021), and EnCodec (Défossez et al. 2022), allowed to easily applied auto-regressive Transformer based modeling to the field of audio. Nonetheless, audio representations have their own specificities compared with NLP tokens, and it takes special care to obtain a model that is both stable, fast, and generates high quality outputs. In this talk we will present the recent evolutions in auto regressive audio modeling, in particular the AudioGen (Kreuk et al. 2022) and MusicGen (Copet et al. 2023) models.