The Hidden Folk: Linguistic Properties Encoded in Multilingual Contextual Character Representations
Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Dokumenter
- Fulltext
Forlagets udgivne version, 247 KB, PDF-dokument
To gain a better understanding of the linguistic information encoded in character-based language models, we probe the multilingual contextual CANINE model. We design a range of phonetic probing tasks in six Nordic languages, including Faroese as an additional zero-shot instance. We observe that some phonetic information is indeed encoded in the character representations, as consonants and vowels can be well distinguished using a linear classifier. Furthermore, results for the Danish and Norwegian language seem to be worse for the consonant/vowel distinction in comparison to other languages. The information encoded in these representations can also be learned in a zero-shot scenario, as Faroese shows a reasonably good performance in the same vowel/consonant distinction task.
Originalsprog | Engelsk |
---|---|
Titel | Proceedings of the Workshop on Computation and Written Language (CAWL 2023) |
Udgivelsessted | Toronto |
Forlag | Association for Computational Linguistics |
Publikationsdato | 2023 |
Sider | 6-13 |
DOI | |
Status | Udgivet - 2023 |
ID: 374969148