Doctors are human, all too human (to borrow from Friedrich Nietzsche)!
"ChatGPT outperformed human physicians in assessing a series of medical case histories, a new, small study published in JAMA Network Open found—demonstrating the power of A.I. systems to be “doctor extenders,” providing niche insights or second opinions.
The study: 50 doctors and ChatGPT—and some doctors equipped with ChatGPT— were all fed the same medical case details and asked to provide a diagnosis. Each was graded on their ability to diagnose correctly, and on their ability to explain why they landed on potential diagnoses.
The results: The doctors operating alone had an average score of 74%. ChatGPT scored an average of 90%. Doctors using the chatbot got an average score of 76%—underscoring how doctors are often wedded to their own conclusions, despite the chatbot’s suggestions."
From the key points and abstract:
"Key Points
- Question Does the use of a large language model (LLM) improve diagnostic reasoning performance among physicians in family medicine, internal medicine, or emergency medicine compared with conventional resources?
- Findings In a randomized clinical trial including 50 physicians, the use of an LLM did not significantly enhance diagnostic reasoning performance compared with the availability of only conventional resources.
- Meaning In this study, the use of an LLM did not necessarily enhance diagnostic reasoning of physicians beyond conventional resources; further development is needed to effectively integrate LLMs into clinical practice.
Abstract
Importance
Large language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves physician diagnostic reasoning.
Objective
To assess the effect of an LLM on physicians’ diagnostic reasoning compared with conventional resources.
Design, Setting, and Participants
A single-blind randomized clinical trial was conducted from November 29 to December 29, 2023. Using remote video conferencing and in-person participation across multiple academic medical institutions, physicians with training in family medicine, internal medicine, or emergency medicine were recruited.
Intervention
Participants were randomized to either access the LLM in addition to conventional diagnostic resources or conventional resources only, stratified by career stage. Participants were allocated 60 minutes to review up to 6 clinical vignettes.
Main Outcomes and Measures
The primary outcome was performance on a standardized rubric of diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps, validated and graded via blinded expert consensus.
Secondary outcomes included time spent per case (in seconds) and final diagnosis accuracy. All analyses followed the intention-to-treat principle. A secondary exploratory analysis evaluated the standalone performance of the LLM by comparing the primary outcomes between the LLM alone group and the conventional resource group.
Results
Fifty physicians (26 attendings, 24 residents; median years in practice, 3 [IQR, 2-8]) participated virtually as well as at 1 in-person site. The median diagnostic reasoning score per case was 76% (IQR, 66%-87%) for the LLM group and 74% (IQR, 63%-84%) for the conventional resources-only group, with an adjusted difference of 2 percentage points (95% CI, −4 to 8 percentage points; P = .60). The median time spent per case for the LLM group was 519 (IQR, 371-668) seconds, compared with 565 (IQR, 456-788) seconds for the conventional resources group, with a time difference of −82 (95% CI, −195 to 31; P = .20) seconds. The LLM alone scored 16 percentage points (95% CI, 2-30 percentage points; P = .03) higher than the conventional resources group.
Conclusions and Relevance
In this trial, the availability of an LLM to physicians as a diagnostic aid did not significantly improve clinical reasoning compared with conventional resources. The LLM alone demonstrated higher performance than both physician groups, indicating the need for technology and workforce development to realize the potential of physician-artificial intelligence collaboration in clinical practice."
"... Secondary Outcomes
...
Accuracy of the final diagnosis (eTable 3 in Supplement 2) using the ordinal scale showed the LLM intervention group had 1.4 times higher odds (95% CI, 0.7-2.8; P = .39) of a correct diagnosis than the control group. ..."
Large Language Model Influence on Diagnostic Reasoning "A Randomized Clinical Trial"
Visual abstract
No comments:
Post a Comment