Validity of the large language model ChatGPT (GPT4) as a patient information source in otolaryngology by a variety of doctors in a tertiary otorhinolaryngology department
Publikation: Bidrag til tidsskrift › Leder › Forskning › fagfællebedømt
Background
A high number of patients seek health information online, and large language models (LLMs) may produce a rising amount of it.
Aim
This study evaluates the performance regarding health information provided by ChatGPT, a LLM developed by OpenAI, focusing on its utility as a source for otolaryngology-related patient information.
Material and method
A variety of doctors from a tertiary otorhinolaryngology department used a Likert scale to assess the chatbot’s responses in terms of accuracy, relevance, and depth. The responses were also evaluated by ChatGPT.
Results
The composite mean of the three categories was 3.41, with the highest performance noted in the relevance category (mean = 3.71) when evaluated by the respondents. The accuracy and depth categories yielded mean scores of 3.51 and 3.00, respectively. All the categories were rated as 5 when evaluated by ChatGPT.
Conclusion and significance
Despite its potential in providing relevant and accurate medical information, the chatbot’s responses lacked depth and were found to potentially perpetuate biases due to its training on publicly available text. In conclusion, while LLMs show promise in healthcare, further refinement is necessary to enhance response depth and mitigate potential biases.
A high number of patients seek health information online, and large language models (LLMs) may produce a rising amount of it.
Aim
This study evaluates the performance regarding health information provided by ChatGPT, a LLM developed by OpenAI, focusing on its utility as a source for otolaryngology-related patient information.
Material and method
A variety of doctors from a tertiary otorhinolaryngology department used a Likert scale to assess the chatbot’s responses in terms of accuracy, relevance, and depth. The responses were also evaluated by ChatGPT.
Results
The composite mean of the three categories was 3.41, with the highest performance noted in the relevance category (mean = 3.71) when evaluated by the respondents. The accuracy and depth categories yielded mean scores of 3.51 and 3.00, respectively. All the categories were rated as 5 when evaluated by ChatGPT.
Conclusion and significance
Despite its potential in providing relevant and accurate medical information, the chatbot’s responses lacked depth and were found to potentially perpetuate biases due to its training on publicly available text. In conclusion, while LLMs show promise in healthcare, further refinement is necessary to enhance response depth and mitigate potential biases.
Originalsprog | Engelsk |
---|---|
Tidsskrift | Acta Oto-Laryngologica |
Vol/bind | 143 |
Udgave nummer | 9 |
Sider (fra-til) | 779-782 |
Antal sider | 4 |
ISSN | 0001-6489 |
DOI | |
Status | Udgivet - 2023 |
Bibliografisk note
Publisher Copyright:
© 2023 Acta Oto-Laryngologica AB (Ltd).
ID: 395912764