MULTIFIN: A Dataset for Multilingual Financial NLP
Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt
Dokumenter
- Fulltext
Forlagets udgivne version, 395 KB, PDF-dokument
Financial information is generated and distributed across the world, resulting in a vast amount of domain-specific multilingual data. Multilingual models adapted to the financial domain would ease deployment when an organization needs to work with multiple languages on a regular basis. For the development and evaluation of such models, there is a need for multilingual financial language processing datasets. We describe MULTIFIN– a publicly available financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families. The dataset consists of hierarchical label structure providing two classification tasks: multi-label and multi-class. We develop our annotation schema based on a real-world application and annotate our dataset using both ‘label by native-speaker’ and ‘translate-then-label’ approaches. The evaluation of several popular multilingual models, e.g., mBERT, XLM-R, and mT5, show that although decent accuracy can be achieved in high-resource languages, there is substantial room for improvement in low-resource languages.
Originalsprog | Engelsk |
---|---|
Titel | EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023 |
Forlag | Association for Computational Linguistics (ACL) |
Publikationsdato | 2023 |
Sider | 864-879 |
ISBN (Elektronisk) | 9781959429470 |
Status | Udgivet - 2023 |
Begivenhed | 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Findings of EACL 2023 - Dubrovnik, Kroatien Varighed: 2 maj 2023 → 6 maj 2023 |
Konference
Konference | 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Findings of EACL 2023 |
---|---|
Land | Kroatien |
By | Dubrovnik |
Periode | 02/05/2023 → 06/05/2023 |
Sponsor | Adobe, Babelscape, Bloomberg Engineering, Duolingo, Liveperson |
Bibliografisk note
Funding Information:
We thank PwC for providing the data and thank Lars Silberg Hansen for his support and valuable contribution to the creation of this dataset.
Publisher Copyright:
© 2023 Association for Computational Linguistics.
Links
- https://aclanthology.org/2023.findings-eacl.66
Forlagets udgivne version
Antal downloads er baseret på statistik fra Google Scholar og www.ku.dk
ID: 355143987