Sony CSL


Learning Disentangled Representations of Timbre and Pitch for Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders

In this talk, I will share our empirical results on learning disentangled representations using Gaussian mixture variational autoencoders (GMVAEs) for music instrument sounds. Specifically, we achieve disentanglement of note timbre and pitch, respectively, represented as latent timbre and pitch variables, by learning separate neural network encoders. The distributions of the two latent variables are regularized by distinct Gaussian mixture distributions. A neural network decoder is used to synthesize sounds with the desired timbre and pitch, which takes a concatenation of the timbre and pitch variables as input. The performance of the disentanglement network is evaluated by both qualitative and quantitative approaches, which further demonstrate the model’s applicability in both controllable sound synthesis and many-to-many timbre transfer.