S1000 - Resultat

S1000: a better taxonomic name corpus for biomedical information extraction

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › fagfællebedømt

Dokumenter

Fulltext
Forlagets udgivne version, 772 KB, PDF-dokument

Jouni Luoma
Katerina Nastou
Tomoko Ohta
Harttu Toivonen
Evangelos Pafilis
Jensen, Lars Juhl
Sampo Pyysalo

Motivation: The recognition of mentions of species names in text is a critically important task for biomedical text mining. While deep learningbased
methods have made great advances in many named entity recognition tasks, results for species name recognition remain poor. We hypothesize
that this is primarily due to the lack of appropriate corpora.
Results: We introduce the S1000 corpus, a comprehensive manual re-annotation and extension of the S800 corpus. We demonstrate that
S1000 makes highly accurate recognition of species names possible (F-score¼93.1%), both for deep learning and dictionary-based methods.
Availability and implementation: All resources introduced in this study are available under open licenses from https://jensenlab.org/resources/
s1000/. The webpage contains links to a Zenodo project and three GitHub repositories associated with the study.

Originalsprog	Engelsk
Artikelnummer	btad369
Tidsskrift	Bioinformatics
Vol/bind	39
Udgave nummer	6
Antal sider	8
ISSN	1367-4803
DOI	https://doi.org/10.1093/bioinformatics/btad369
Status	Udgivet - 2023

Bibliografisk note

Funding Information:
This work was supported by Novo Nordisk Foundation [grant number NNF14CC0001]; and by the Academy of Finland [grant number 332844]. K.N. has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie [grant number 101023676].

Publisher Copyright:
© 2023 The Author(s).

ID: 360982850