Can Large Language Models (LLMs) be relied upon for accurate medical diagnostics and treatment decisions?

Medical Advisory BoardAll articles are reviewed for accuracy by our Medical Advisory Board
Educational purpose only • Exercise caution as content is pending human review
Article Review Status
Submitted
Under Review
Approved

Last updated: December 19, 2025View editorial policy

Personalize

Help us tailor your experience

Which best describes you? Your choice helps us use language that's most understandable for you.

LLMs Cannot Be Relied Upon for Autonomous Medical Diagnostics and Treatment Decisions

Current foundational large language models are not ready for independent clinical decision-making due to unacceptable accuracy rates, significant risk of hallucinations, and failure to adhere to clinical guidelines—they require physician oversight and enhancement strategies before safe deployment. 1, 2

Current Performance Limitations

Accuracy Variability

  • Foundational LLM accuracy for medical diagnostics ranges widely from 6.4% to 91.4%, with ChatGPT-3.5 performing at 6.4-45.4% and ChatGPT-4 at 40-91.4% across digestive disease questions. 1
  • This variability stems from lack of standardized reporting methodologies, absence of universally accepted accuracy definitions, and evaluation without reference to clinical guidelines. 1
  • LLMs perform significantly worse than physicians in real-world clinical decision-making scenarios involving actual patient cases. 2

Critical Safety Concerns

  • LLMs produce inaccurate results and hallucinations, generating fabricated information that poses serious patient safety risks. 1
  • Models fail to follow diagnostic and treatment guidelines, cannot accurately interpret laboratory results, and do not integrate properly into clinical workflows. 2
  • Even sophisticated models like ChatGPT provide incorrect companion diagnostic information (e.g., incorrectly identifying CD19 expression as a prerequisite for Axicabtagene ciloleucel therapy when it is not). 1
  • LLMs are sensitive to both the quantity and order of information presented, failing to consistently follow instructions. 2

Strategies to Improve Accuracy Before Clinical Use

Retrieval Augmented Generation (RAG)

  • RAG implementation shows a 1.35 odds ratio increase in performance compared to baseline LLMs by integrating up-to-date, relevant information from trusted sources. 1
  • RAG addresses the limitation that LLMs are trained on fixed datasets with knowledge cutoffs (e.g., GPT-4o only includes data through October 2023). 1
  • However, RAG faces challenges with context-window limits and accurate information retrieval from provided sources. 1

Supervised Fine-Tuning with Human Feedback

  • Supervised Fine-Tuning (SFT) with Reinforcement Learning from Human Feedback (RLHF) represents a deeper adaptation method for infusing domain knowledge. 1
  • This approach is computationally demanding and requires specialized medical expertise to implement effectively. 1
  • Domain-specific models like MedFound (176 billion parameters trained on medical text and clinical records) demonstrate superior performance across common, rare, and external validation scenarios. 3

Clinical Application Framework

Current Appropriate Uses

  • LLMs may assist with administrative tasks: clinical documentation reduction, extracting information from electronic health records, and evaluating clinical trial eligibility. 1
  • Potential for automated follow-ups in chronic disease management (e.g., diuretics in cirrhosis, immunosuppressants in inflammatory bowel disease). 1
  • Support for determining follow-up intervals based on specific findings (e.g., pancreatic cyst imaging, colonoscopy intervals). 1

Mandatory Safeguards

  • Physician oversight is essential—LLMs should function as assistants within clinical workflows, not autonomous decision-makers. 1, 2
  • Conversational interactions with verification steps are necessary rather than simple question-and-answer formats, as models can justify incorrect answers when challenged. 1
  • Integration must align with clinical guidelines and incorporate human feedback mechanisms for quality assurance. 1

Critical Pitfalls to Avoid

Do Not Trust Foundational Models Alone

  • Baseline LLMs without enhancement strategies pose life-or-death risks in clinical settings where accuracy is paramount for patient safety. 1
  • Models trained on broad datasets lack the specificity required for biomedical applications and may include unreliable sources. 1

Verify All Outputs

  • Always cross-reference LLM recommendations against current clinical guidelines and evidence-based medicine. 1, 2
  • Be aware that models may provide plausible-sounding but factually incorrect information with apparent confidence. 1
  • References provided by LLMs may be fabricated placeholders rather than actual citations. 1

Recognize Knowledge Limitations

  • LLMs cannot access information beyond their training cutoff dates, making them unreliable for recent medical advances. 1
  • Performance degrades significantly when dealing with rare diseases or complex clinical scenarios requiring nuanced guideline interpretation. 2

Professional Medical Disclaimer

This information is intended for healthcare professionals. Any medical decision-making should rely on clinical judgment and independently verified information. The content provided herein does not replace professional discretion and should be considered supplementary to established clinical guidelines. Healthcare providers should verify all information against primary literature and current practice standards before application in patient care. Dr.Oracle assumes no liability for clinical decisions based on this content.

Have a follow-up question?

Our Medical A.I. is used by practicing medical doctors at top research institutions around the world. Ask any follow up question and get world-class guideline-backed answers instantly.