Enhancing CLARIN-DK Resources While Building the Danish ParlaMint Corpus

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

In this paper we describe the Danish CLARIN resources, corpora, tools and workflow, which we used and enhanced in order to build the Danish ParlaMint corpus, as part of the CLARIN founded ParlaMint project. More specifically, the article accounts for the manual and automatic processes involved in the preparation of the Danish Parliamentary speeches with focus on the CLARIN-DK tools and Text Tonsorium workflow management. The tools annotated the speeches with metadata and linguistic information in compliance with the common ParlaMint TEI P5 format. As a spin-off of the project, the CLARIN-DK sen-tence tokenizer and the CST Named Entity Recognizer were improved. These tools, to-gether with the CST-lemmatiser, Danish UD-Pipe software and several data transformation utilities, produced all the linguistic annotations in the correct format. We conclude the pa-per with a report of a pilot evaluation of the quality of some of the linguistic annotations in the Danish ParlaMint corpus.
TitelCLARIN Annual Conference 2021 Proceedings
StatusUdgivet - 2021

ID: 279626708