The Hidden Folk: Linguistic Properties Encoded in Multilingual Contextual Character Representations
Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Standard
The Hidden Folk : Linguistic Properties Encoded in Multilingual Contextual Character Representations. / Agirrezabal, Manex; Boldsen, Sidsel; Hollenstein, Nora.
Proceedings of the Workshop on Computation and Written Language (CAWL 2023). Toronto : Association for Computational Linguistics, 2023. s. 6-13.Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Harvard
APA
Vancouver
Author
Bibtex
}
RIS
TY - GEN
T1 - The Hidden Folk
T2 - Linguistic Properties Encoded in Multilingual Contextual Character Representations
AU - Agirrezabal, Manex
AU - Boldsen, Sidsel
AU - Hollenstein, Nora
PY - 2023
Y1 - 2023
N2 - To gain a better understanding of the linguistic information encoded in character-based language models, we probe the multilingual contextual CANINE model. We design a range of phonetic probing tasks in six Nordic languages, including Faroese as an additional zero-shot instance. We observe that some phonetic information is indeed encoded in the character representations, as consonants and vowels can be well distinguished using a linear classifier. Furthermore, results for the Danish and Norwegian language seem to be worse for the consonant/vowel distinction in comparison to other languages. The information encoded in these representations can also be learned in a zero-shot scenario, as Faroese shows a reasonably good performance in the same vowel/consonant distinction task.
AB - To gain a better understanding of the linguistic information encoded in character-based language models, we probe the multilingual contextual CANINE model. We design a range of phonetic probing tasks in six Nordic languages, including Faroese as an additional zero-shot instance. We observe that some phonetic information is indeed encoded in the character representations, as consonants and vowels can be well distinguished using a linear classifier. Furthermore, results for the Danish and Norwegian language seem to be worse for the consonant/vowel distinction in comparison to other languages. The information encoded in these representations can also be learned in a zero-shot scenario, as Faroese shows a reasonably good performance in the same vowel/consonant distinction task.
U2 - 10.18653/v1/2023.cawl-1.2
DO - 10.18653/v1/2023.cawl-1.2
M3 - Article in proceedings
SP - 6
EP - 13
BT - Proceedings of the Workshop on Computation and Written Language (CAWL 2023)
PB - Association for Computational Linguistics
CY - Toronto
ER -
ID: 374969148