Dokument: Evaluating Retrieval Augmented Generation-enhanced Large Language Models for Question Answering On German Neurovascular Guidelines

Titel:Evaluating Retrieval Augmented Generation-enhanced Large Language Models for Question Answering On German Neurovascular Guidelines
URL für Lesezeichen:https://docserv.uni-duesseldorf.de/servlets/DocumentServlet?id=73023
URN (NBN):urn:nbn:de:hbz:061-20260422-111938-8
Kollektion:Publikationen
Sprache:Englisch
Dokumententyp:Wissenschaftliche Texte » Artikel, Aufsatz
Medientyp:Text
Autoren: Vach, Marius [Autor]
Gliem, Michael [Autor]
Weiss, Daniel [Autor]
Ivan, Vivien Lorena [Autor]
Hauke, Frederik [Autor]
Boschenriedter, Christian [Autor]
Rubbert, Christian [Autor]
Caspers, Julian [Autor]
Dateien:
[Dateien anzeigen]Adobe PDF
[Details]827,5 KB in einer Datei
[ZIP-Datei erzeugen]
Dateien vom 22.04.2026 / geändert 22.04.2026
Stichwörter:Artificial intelligence , Evidence-based medicine , Large language model , Carotid stenosis , Neurovascular disease , Medical guidelines , Ischemic stroke
Beschreibung:Purpose

To investigate the feasibility of Retrieval-augmented Generation (RAG)-enhanced Large Language Models (LLMs) in answering questions about two German neurovascular guidelines.
Methods

Four LLMs (GPT-4o-mini, Llama 3.1 405B Instruct Turbo, Mixtral 8 × 22B Instruct, and Claude 3.5 Sonnet) with RAG as well as GPT-4o-mini without RAG were evaluated for generating answers about two German neurovascular guidelines (“S3 Guideline for Diagnosis, Treatment, and Follow-up of Extracranial Carotid Stenosis” and “S2e Guideline for Acute Therapy of Ischemic Stroke”). The answers were classified as “correct”, “inaccurate”, or “incorrect” by two neurovascular experts in consensus. Additionally, retrieval performance of five retrieval strategies was analyzed on a synthetic dataset of 384 questions.
Results

Claude Sonnet 3.5 achieved the highest answer correctness (70.6% correct, 10.6% wrong), followed by Llama 3.1 (64.7%, 15.3% wrong), GPT-4o-mini with RAG (57.6%, 15.3% wrong), and Mixtral (56.6%, 17.6% wrong). GPT-4o-mini without RAG performed significantly worse (20.0%, 32.9% wrong). Retrieval errors were the primary cause of incorrect answers (80%). For retrieval, BM25 achieved the highest accuracy (82.0%), outperforming vector-based methods like “BAAI/bge-m3” (78.4%).
Conclusion

RAG significantly improves LLM accuracy for medical guideline question answering compared to the inherent knowledge of pretrained LLMs alone while still showing significant error rates. Improved accuracy and confidence metrics are needed for safer implementation in clinical routine. Additionally, our results demonstrate the strong performance of general LLMs in medical question answering for non-English languages, such as German, even without specific training.
Rechtliche Vermerke:Originalveröffentlichung:
Vach, M., Gliem, M., Weiß, D. A., Ivan, V. L., Hauke, F., Boschenriedter, C., Rubbert, C., & Caspers, J. (2025). Evaluating Retrieval Augmented Generation-enhanced Large Language Models for Question Answering On German Neurovascular Guidelines. Clinical Neuroradiology, 36(1), 119–127. https://doi.org/10.1007/s00062-025-01562-z
Lizenz:Creative Commons Lizenzvertrag
Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz
Fachbereich / Einrichtung:Medizinische Fakultät
Dokument erstellt am:22.04.2026
Dateien geändert am:22.04.2026
english
Benutzer
Status: Gast
Aktionen