On scaling contrastive representations for low-resource speech recognition
Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Standard
On scaling contrastive representations for low-resource speech recognition. / Borgholt, Lasse; Tax, Tycho M.S.; Havtorn, Jakob D.; Maaløe, Lars; Igel, Christian.
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Bind 2021-June IEEE, 2021. s. 3885-3889.Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - On scaling contrastive representations for low-resource speech recognition
AU - Borgholt, Lasse
AU - Tax, Tycho M.S.
AU - Havtorn, Jakob D.
AU - Maaløe, Lars
AU - Igel, Christian
N1 - Publisher Copyright: © 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a stateof- the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose a bidirectional extension to the original wav2vec framework that consistently improves performance.
AB - Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a stateof- the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose a bidirectional extension to the original wav2vec framework that consistently improves performance.
KW - Automatic speech recognition
KW - Representation learning
KW - Self-supervised learning
KW - Semi-supervised learning
KW - Unsupervised learning
UR - http://www.scopus.com/inward/record.url?scp=85115116838&partnerID=8YFLogxK
U2 - 10.1109/ICASSP39728.2021.9414310
DO - 10.1109/ICASSP39728.2021.9414310
M3 - Article in proceedings
AN - SCOPUS:85115116838
VL - 2021-June
SP - 3885
EP - 3889
BT - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PB - IEEE
T2 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021
Y2 - 6 June 2021 through 11 June 2021
ER -
ID: 282683659