On scaling contrastive representations for low-resource speech recognition

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Standard

On scaling contrastive representations for low-resource speech recognition. / Borgholt, Lasse; Tax, Tycho M.S.; Havtorn, Jakob D.; Maaløe, Lars; Igel, Christian.

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Bind 2021-June IEEE, 2021. s. 3885-3889.

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Harvard

Borgholt, L, Tax, TMS, Havtorn, JD, Maaløe, L & Igel, C 2021, On scaling contrastive representations for low-resource speech recognition. i ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). bind 2021-June, IEEE, s. 3885-3889, 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021, Virtual, Toronto, Canada, 06/06/2021. https://doi.org/10.1109/ICASSP39728.2021.9414310

APA

Borgholt, L., Tax, T. M. S., Havtorn, J. D., Maaløe, L., & Igel, C. (2021). On scaling contrastive representations for low-resource speech recognition. I ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Bind 2021-June, s. 3885-3889). IEEE. https://doi.org/10.1109/ICASSP39728.2021.9414310

Vancouver

Borgholt L, Tax TMS, Havtorn JD, Maaløe L, Igel C. On scaling contrastive representations for low-resource speech recognition. I ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Bind 2021-June. IEEE. 2021. s. 3885-3889 https://doi.org/10.1109/ICASSP39728.2021.9414310

Author

Borgholt, Lasse ; Tax, Tycho M.S. ; Havtorn, Jakob D. ; Maaløe, Lars ; Igel, Christian. / On scaling contrastive representations for low-resource speech recognition. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Bind 2021-June IEEE, 2021. s. 3885-3889

Bibtex

@inproceedings{6724498f12894bfe988664cddf789a03,

title = "On scaling contrastive representations for low-resource speech recognition",

abstract = "Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a stateof- the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose a bidirectional extension to the original wav2vec framework that consistently improves performance. ",

keywords = "Automatic speech recognition, Representation learning, Self-supervised learning, Semi-supervised learning, Unsupervised learning",

author = "Lasse Borgholt and Tax, {Tycho M.S.} and Havtorn, {Jakob D.} and Lars Maal{\o}e and Christian Igel",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE.; 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 ; Conference date: 06-06-2021 Through 11-06-2021",

year = "2021",

doi = "10.1109/ICASSP39728.2021.9414310",

language = "English",

volume = "2021-June",

pages = "3885--3889",

booktitle = "ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",

publisher = "IEEE",

}

RIS

TY - GEN

T1 - On scaling contrastive representations for low-resource speech recognition

AU - Borgholt, Lasse

AU - Tax, Tycho M.S.

AU - Havtorn, Jakob D.

AU - Maaløe, Lars

AU - Igel, Christian

PY - 2021

Y1 - 2021

N2 - Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a stateof- the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose a bidirectional extension to the original wav2vec framework that consistently improves performance.

AB - Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a stateof- the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose a bidirectional extension to the original wav2vec framework that consistently improves performance.

KW - Automatic speech recognition

KW - Representation learning

KW - Self-supervised learning

KW - Semi-supervised learning

KW - Unsupervised learning

UR - http://www.scopus.com/inward/record.url?scp=85115116838&partnerID=8YFLogxK

U2 - 10.1109/ICASSP39728.2021.9414310

DO - 10.1109/ICASSP39728.2021.9414310

M3 - Article in proceedings

AN - SCOPUS:85115116838

VL - 2021-June

SP - 3885

EP - 3889

BT - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

PB - IEEE

T2 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021

Y2 - 6 June 2021 through 11 June 2021

ER -

ID: 282683659