Self-Supervised Speech Representation Learning: A Review

Publikation: Bidrag til tidsskriftReviewForskningfagfællebedømt

Standard

Self-Supervised Speech Representation Learning : A Review. / Mohamed, Abdelrahman; Lee, Hung yi; Borgholt, Lasse; Havtorn, Jakob D.; Edin, Joakim; Igel, Christian; Kirchhoff, Katrin; Li, Shang Wen; Livescu, Karen; Maaloe, Lars; Sainath, Tara N.; Watanabe, Shinji.

I: IEEE Journal on Selected Topics in Signal Processing, Bind 16, Nr. 6, 2022, s. 1179-1210.

Publikation: Bidrag til tidsskriftReviewForskningfagfællebedømt

Harvard

Mohamed, A, Lee, HY, Borgholt, L, Havtorn, JD, Edin, J, Igel, C, Kirchhoff, K, Li, SW, Livescu, K, Maaloe, L, Sainath, TN & Watanabe, S 2022, 'Self-Supervised Speech Representation Learning: A Review', IEEE Journal on Selected Topics in Signal Processing, bind 16, nr. 6, s. 1179-1210. https://doi.org/10.1109/JSTSP.2022.3207050

APA

Mohamed, A., Lee, H. Y., Borgholt, L., Havtorn, J. D., Edin, J., Igel, C., Kirchhoff, K., Li, S. W., Livescu, K., Maaloe, L., Sainath, T. N., & Watanabe, S. (2022). Self-Supervised Speech Representation Learning: A Review. IEEE Journal on Selected Topics in Signal Processing, 16(6), 1179-1210. https://doi.org/10.1109/JSTSP.2022.3207050

Vancouver

Mohamed A, Lee HY, Borgholt L, Havtorn JD, Edin J, Igel C o.a. Self-Supervised Speech Representation Learning: A Review. IEEE Journal on Selected Topics in Signal Processing. 2022;16(6):1179-1210. https://doi.org/10.1109/JSTSP.2022.3207050

Author

Mohamed, Abdelrahman ; Lee, Hung yi ; Borgholt, Lasse ; Havtorn, Jakob D. ; Edin, Joakim ; Igel, Christian ; Kirchhoff, Katrin ; Li, Shang Wen ; Livescu, Karen ; Maaloe, Lars ; Sainath, Tara N. ; Watanabe, Shinji. / Self-Supervised Speech Representation Learning : A Review. I: IEEE Journal on Selected Topics in Signal Processing. 2022 ; Bind 16, Nr. 6. s. 1179-1210.

Bibtex

@article{9cee81bbfc6e4057aae54ddfe0012d34,
title = "Self-Supervised Speech Representation Learning: A Review",
abstract = "Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Such methods have shown success in natural language processing and computer vision domains, achieving new levels of performance while reducing the number of labels required for many downstream scenarios. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. Other approaches rely on multi-modal data for pre-training, mixing text or visual data streams with speech. Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years. This review presents approaches for self-supervised speech representation learning and their connection to other research areas. Since many current methods focus solely on automatic speech recognition as a downstream task, we review recent efforts on benchmarking learned representations to extend the application beyond speech recognition.",
keywords = "Data models, Hidden Markov models, Representation learning, Self-supervised learning, Speech processing, speech representations, Task analysis, Training",
author = "Abdelrahman Mohamed and Lee, {Hung yi} and Lasse Borgholt and Havtorn, {Jakob D.} and Joakim Edin and Christian Igel and Katrin Kirchhoff and Li, {Shang Wen} and Karen Livescu and Lars Maaloe and Sainath, {Tara N.} and Shinji Watanabe",
note = "Publisher Copyright: IEEE",
year = "2022",
doi = "10.1109/JSTSP.2022.3207050",
language = "English",
volume = "16",
pages = "1179--1210",
journal = "IEEE Journal on Selected Topics in Signal Processing",
issn = "1932-4553",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "6",

}

RIS

TY - JOUR

T1 - Self-Supervised Speech Representation Learning

T2 - A Review

AU - Mohamed, Abdelrahman

AU - Lee, Hung yi

AU - Borgholt, Lasse

AU - Havtorn, Jakob D.

AU - Edin, Joakim

AU - Igel, Christian

AU - Kirchhoff, Katrin

AU - Li, Shang Wen

AU - Livescu, Karen

AU - Maaloe, Lars

AU - Sainath, Tara N.

AU - Watanabe, Shinji

N1 - Publisher Copyright: IEEE

PY - 2022

Y1 - 2022

N2 - Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Such methods have shown success in natural language processing and computer vision domains, achieving new levels of performance while reducing the number of labels required for many downstream scenarios. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. Other approaches rely on multi-modal data for pre-training, mixing text or visual data streams with speech. Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years. This review presents approaches for self-supervised speech representation learning and their connection to other research areas. Since many current methods focus solely on automatic speech recognition as a downstream task, we review recent efforts on benchmarking learned representations to extend the application beyond speech recognition.

AB - Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Such methods have shown success in natural language processing and computer vision domains, achieving new levels of performance while reducing the number of labels required for many downstream scenarios. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. Other approaches rely on multi-modal data for pre-training, mixing text or visual data streams with speech. Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years. This review presents approaches for self-supervised speech representation learning and their connection to other research areas. Since many current methods focus solely on automatic speech recognition as a downstream task, we review recent efforts on benchmarking learned representations to extend the application beyond speech recognition.

KW - Data models

KW - Hidden Markov models

KW - Representation learning

KW - Self-supervised learning

KW - Speech processing

KW - speech representations

KW - Task analysis

KW - Training

U2 - 10.1109/JSTSP.2022.3207050

DO - 10.1109/JSTSP.2022.3207050

M3 - Review

AN - SCOPUS:85139425711

VL - 16

SP - 1179

EP - 1210

JO - IEEE Journal on Selected Topics in Signal Processing

JF - IEEE Journal on Selected Topics in Signal Processing

SN - 1932-4553

IS - 6

ER -

ID: 322793323