On scaling contrastive representations for low-resource speech recognition

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Lasse Borgholt
Tycho M.S. Tax
Jakob D. Havtorn
Lars Maaløe
Igel, Christian

Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a stateof- the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose a bidirectional extension to the original wav2vec framework that consistently improves performance.

Originalsprog	Engelsk
Titel	ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Vol/bind	2021-June
Forlag	IEEE
Publikationsdato	2021
Sider	3885-3889
DOI	https://doi.org/10.1109/ICASSP39728.2021.9414310
Status	Udgivet - 2021
Begivenhed	2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 - Virtual, Toronto, Canada Varighed: 6 jun. 2021 → 11 jun. 2021

Konference

Konference	2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021
Land	Canada
By	Virtual, Toronto
Periode	06/06/2021 → 11/06/2021
Sponsor	The Institute of Electrical and Electronics Engineers Signal Processing Society

Bibliografisk note

ID: 282683659