Towards transferable speech emotion representation

Towards transferable speech emotion representation: on loss functions for cross-lingual latent representations

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Standard

Towards transferable speech emotion representation : on loss functions for cross-lingual latent representations. / Das, Sneha; Lønfeldt, Nicole Nadine; Pagsberg, Anne Katrine; Clemmensen, Line H.

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2022. s. 6452-6456 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Bind 2022-May).

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Harvard

Das, S, Lønfeldt, NN, Pagsberg, AK & Clemmensen, LH 2022, Towards transferable speech emotion representation: on loss functions for cross-lingual latent representations. i ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, bind 2022-May, s. 6452-6456, 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022, Virtual, Online, Singapore, 23/05/2022. https://doi.org/10.1109/ICASSP43922.2022.9746450

APA

Das, S., Lønfeldt, N. N., Pagsberg, A. K., & Clemmensen, L. H. (2022). Towards transferable speech emotion representation: on loss functions for cross-lingual latent representations. I ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (s. 6452-6456). IEEE. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings Bind 2022-May https://doi.org/10.1109/ICASSP43922.2022.9746450

Vancouver

Das S, Lønfeldt NN, Pagsberg AK, Clemmensen LH. Towards transferable speech emotion representation: on loss functions for cross-lingual latent representations. I ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE. 2022. s. 6452-6456. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Bind 2022-May). https://doi.org/10.1109/ICASSP43922.2022.9746450

Author

Das, Sneha ; Lønfeldt, Nicole Nadine ; Pagsberg, Anne Katrine ; Clemmensen, Line H. / Towards transferable speech emotion representation : on loss functions for cross-lingual latent representations. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2022. s. 6452-6456 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Bind 2022-May).

Bibtex

@inproceedings{074d63306c0d45a89126f1300c66a5a8,

title = "Towards transferable speech emotion representation: on loss functions for cross-lingual latent representations",

abstract = "In recent years, speech emotion recognition (SER) has been used in wide ranging applications, from healthcare to the commercial sector. In addition to signal processing approaches, methods for SER now also use deep learning techniques which provide transfer learning possibilities. However, generalizing over languages, corpora and recording conditions is still an open challenge. In this work we address this gap by exploring loss functions that aid in transferability, specifically to non-tonal languages. We propose a variational autoencoder (VAE) with KL annealing and a semi-supervised VAE to obtain more consistent latent embedding distributions across data sets. To ensure transferability, the distribution of the latent embedding should be similar across non-tonal languages (data sets). We start by presenting a low-complexity SER based on a denoising-autoencoder, which achieves an unweighted classification accuracy of over 52.09% for four-class emotion classification. This performance is comparable to that of similar baseline methods. Following this, we employ a VAE, the semi-supervised VAE and the VAE with KL annealing to obtain a more regularized latent space. We show that while the DAE has the highest classification accuracy among the methods, the semi-supervised VAE has a comparable classification accuracy and a more consistent latent embedding distribution over data sets.",

keywords = "cross-lingual, latent representation, loss functions, speech emotion recognition (SER), transfer learning",

author = "Sneha Das and L{\o}nfeldt, {Nicole Nadine} and Pagsberg, {Anne Katrine} and Clemmensen, {Line H.}",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE; 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 ; Conference date: 23-05-2022 Through 27-05-2022",

year = "2022",

doi = "10.1109/ICASSP43922.2022.9746450",

language = "English",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "IEEE",

pages = "6452--6456",

booktitle = "ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing",

}

RIS

TY - GEN

T1 - Towards transferable speech emotion representation

T2 - 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022

AU - Das, Sneha

AU - Lønfeldt, Nicole Nadine

AU - Pagsberg, Anne Katrine

AU - Clemmensen, Line H.

PY - 2022

Y1 - 2022

N2 - In recent years, speech emotion recognition (SER) has been used in wide ranging applications, from healthcare to the commercial sector. In addition to signal processing approaches, methods for SER now also use deep learning techniques which provide transfer learning possibilities. However, generalizing over languages, corpora and recording conditions is still an open challenge. In this work we address this gap by exploring loss functions that aid in transferability, specifically to non-tonal languages. We propose a variational autoencoder (VAE) with KL annealing and a semi-supervised VAE to obtain more consistent latent embedding distributions across data sets. To ensure transferability, the distribution of the latent embedding should be similar across non-tonal languages (data sets). We start by presenting a low-complexity SER based on a denoising-autoencoder, which achieves an unweighted classification accuracy of over 52.09% for four-class emotion classification. This performance is comparable to that of similar baseline methods. Following this, we employ a VAE, the semi-supervised VAE and the VAE with KL annealing to obtain a more regularized latent space. We show that while the DAE has the highest classification accuracy among the methods, the semi-supervised VAE has a comparable classification accuracy and a more consistent latent embedding distribution over data sets.

AB - In recent years, speech emotion recognition (SER) has been used in wide ranging applications, from healthcare to the commercial sector. In addition to signal processing approaches, methods for SER now also use deep learning techniques which provide transfer learning possibilities. However, generalizing over languages, corpora and recording conditions is still an open challenge. In this work we address this gap by exploring loss functions that aid in transferability, specifically to non-tonal languages. We propose a variational autoencoder (VAE) with KL annealing and a semi-supervised VAE to obtain more consistent latent embedding distributions across data sets. To ensure transferability, the distribution of the latent embedding should be similar across non-tonal languages (data sets). We start by presenting a low-complexity SER based on a denoising-autoencoder, which achieves an unweighted classification accuracy of over 52.09% for four-class emotion classification. This performance is comparable to that of similar baseline methods. Following this, we employ a VAE, the semi-supervised VAE and the VAE with KL annealing to obtain a more regularized latent space. We show that while the DAE has the highest classification accuracy among the methods, the semi-supervised VAE has a comparable classification accuracy and a more consistent latent embedding distribution over data sets.

KW - cross-lingual

KW - latent representation

KW - loss functions

KW - speech emotion recognition (SER)

KW - transfer learning

U2 - 10.1109/ICASSP43922.2022.9746450

DO - 10.1109/ICASSP43922.2022.9746450

M3 - Article in proceedings

AN - SCOPUS:85131228396

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 6452

EP - 6456

BT - ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing

PB - IEEE

Y2 - 23 May 2022 through 27 May 2022

ER -

ID: 324664969