LLMs Cannot Be Relied Upon for Autonomous Medical Diagnostics and Treatment Decisions
Current foundational large language models are not ready for independent clinical decision-making due to unacceptable accuracy rates, significant risk of hallucinations, and failure to adhere to clinical guidelines—they require physician oversight and enhancement strategies before safe deployment. 1, 2
Current Performance Limitations
Accuracy Variability
- Foundational LLM accuracy for medical diagnostics ranges widely from 6.4% to 91.4%, with ChatGPT-3.5 performing at 6.4-45.4% and ChatGPT-4 at 40-91.4% across digestive disease questions. 1
- This variability stems from lack of standardized reporting methodologies, absence of universally accepted accuracy definitions, and evaluation without reference to clinical guidelines. 1
- LLMs perform significantly worse than physicians in real-world clinical decision-making scenarios involving actual patient cases. 2
Critical Safety Concerns
- LLMs produce inaccurate results and hallucinations, generating fabricated information that poses serious patient safety risks. 1
- Models fail to follow diagnostic and treatment guidelines, cannot accurately interpret laboratory results, and do not integrate properly into clinical workflows. 2
- Even sophisticated models like ChatGPT provide incorrect companion diagnostic information (e.g., incorrectly identifying CD19 expression as a prerequisite for Axicabtagene ciloleucel therapy when it is not). 1
- LLMs are sensitive to both the quantity and order of information presented, failing to consistently follow instructions. 2
Strategies to Improve Accuracy Before Clinical Use
Retrieval Augmented Generation (RAG)
- RAG implementation shows a 1.35 odds ratio increase in performance compared to baseline LLMs by integrating up-to-date, relevant information from trusted sources. 1
- RAG addresses the limitation that LLMs are trained on fixed datasets with knowledge cutoffs (e.g., GPT-4o only includes data through October 2023). 1
- However, RAG faces challenges with context-window limits and accurate information retrieval from provided sources. 1
Supervised Fine-Tuning with Human Feedback
- Supervised Fine-Tuning (SFT) with Reinforcement Learning from Human Feedback (RLHF) represents a deeper adaptation method for infusing domain knowledge. 1
- This approach is computationally demanding and requires specialized medical expertise to implement effectively. 1
- Domain-specific models like MedFound (176 billion parameters trained on medical text and clinical records) demonstrate superior performance across common, rare, and external validation scenarios. 3
Clinical Application Framework
Current Appropriate Uses
- LLMs may assist with administrative tasks: clinical documentation reduction, extracting information from electronic health records, and evaluating clinical trial eligibility. 1
- Potential for automated follow-ups in chronic disease management (e.g., diuretics in cirrhosis, immunosuppressants in inflammatory bowel disease). 1
- Support for determining follow-up intervals based on specific findings (e.g., pancreatic cyst imaging, colonoscopy intervals). 1
Mandatory Safeguards
- Physician oversight is essential—LLMs should function as assistants within clinical workflows, not autonomous decision-makers. 1, 2
- Conversational interactions with verification steps are necessary rather than simple question-and-answer formats, as models can justify incorrect answers when challenged. 1
- Integration must align with clinical guidelines and incorporate human feedback mechanisms for quality assurance. 1
Critical Pitfalls to Avoid
Do Not Trust Foundational Models Alone
- Baseline LLMs without enhancement strategies pose life-or-death risks in clinical settings where accuracy is paramount for patient safety. 1
- Models trained on broad datasets lack the specificity required for biomedical applications and may include unreliable sources. 1
Verify All Outputs
- Always cross-reference LLM recommendations against current clinical guidelines and evidence-based medicine. 1, 2
- Be aware that models may provide plausible-sounding but factually incorrect information with apparent confidence. 1
- References provided by LLMs may be fabricated placeholders rather than actual citations. 1