Unsupervised Evaluation for Question Answering with Transformers

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

Standard

Unsupervised Evaluation for Question Answering with Transformers. / Muttenthaler, Lukas; Augenstein, Isabelle; Bjerva, Johannes.

Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 2020. s. 83-90.

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningfagfællebedømt

Harvard

Muttenthaler, L, Augenstein, I & Bjerva, J 2020, Unsupervised Evaluation for Question Answering with Transformers. i Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, s. 83-90, The 2020 Conference on Empirical Methods in Natural Language Processing, 16/11/2020. https://doi.org/10.18653/v1/2020.blackboxnlp-1.8

APA

Muttenthaler, L., Augenstein, I., & Bjerva, J. (2020). Unsupervised Evaluation for Question Answering with Transformers. I Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (s. 83-90). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.blackboxnlp-1.8

Vancouver

Muttenthaler L, Augenstein I, Bjerva J. Unsupervised Evaluation for Question Answering with Transformers. I Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics. 2020. s. 83-90 https://doi.org/10.18653/v1/2020.blackboxnlp-1.8

Author

Muttenthaler, Lukas ; Augenstein, Isabelle ; Bjerva, Johannes. / Unsupervised Evaluation for Question Answering with Transformers. Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 2020. s. 83-90

Bibtex

@inproceedings{0814b8ae81b94c41a9ccb411626a6cfc,
title = "Unsupervised Evaluation for Question Answering with Transformers",
abstract = "It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalise. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labelled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model{\textquoteright}s answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in semi-automatic development of QA datasets.",
author = "Lukas Muttenthaler and Isabelle Augenstein and Johannes Bjerva",
year = "2020",
doi = "10.18653/v1/2020.blackboxnlp-1.8",
language = "English",
pages = "83--90",
booktitle = "Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP",
publisher = "Association for Computational Linguistics",
note = "The 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020 ; Conference date: 16-11-2020 Through 20-11-2020",
url = "http://2020.emnlp.org",

}

RIS

TY - GEN

T1 - Unsupervised Evaluation for Question Answering with Transformers

AU - Muttenthaler, Lukas

AU - Augenstein, Isabelle

AU - Bjerva, Johannes

PY - 2020

Y1 - 2020

N2 - It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalise. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labelled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model’s answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in semi-automatic development of QA datasets.

AB - It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalise. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labelled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model’s answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in semi-automatic development of QA datasets.

U2 - 10.18653/v1/2020.blackboxnlp-1.8

DO - 10.18653/v1/2020.blackboxnlp-1.8

M3 - Article in proceedings

SP - 83

EP - 90

BT - Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

PB - Association for Computational Linguistics

T2 - The 2020 Conference on Empirical Methods in Natural Language Processing

Y2 - 16 November 2020 through 20 November 2020

ER -

ID: 254996871