Dokument: Evaluating Retrieval Augmented Generation-enhanced Large Language Models for Question Answering On German Neurovascular Guidelines
| Titel: | Evaluating Retrieval Augmented Generation-enhanced Large Language Models for Question Answering On German Neurovascular Guidelines | |||||||
| URL für Lesezeichen: | https://docserv.uni-duesseldorf.de/servlets/DocumentServlet?id=73023 | |||||||
| URN (NBN): | urn:nbn:de:hbz:061-20260422-111938-8 | |||||||
| Kollektion: | Publikationen | |||||||
| Sprache: | Englisch | |||||||
| Dokumententyp: | Wissenschaftliche Texte » Artikel, Aufsatz | |||||||
| Medientyp: | Text | |||||||
| Autoren: | Vach, Marius [Autor] Gliem, Michael [Autor] Weiss, Daniel [Autor] Ivan, Vivien Lorena [Autor] Hauke, Frederik [Autor] Boschenriedter, Christian [Autor] Rubbert, Christian [Autor] Caspers, Julian [Autor] | |||||||
| Dateien: |
| |||||||
| Stichwörter: | Artificial intelligence , Evidence-based medicine , Large language model , Carotid stenosis , Neurovascular disease , Medical guidelines , Ischemic stroke | |||||||
| Beschreibung: | Purpose
To investigate the feasibility of Retrieval-augmented Generation (RAG)-enhanced Large Language Models (LLMs) in answering questions about two German neurovascular guidelines. Methods Four LLMs (GPT-4o-mini, Llama 3.1 405B Instruct Turbo, Mixtral 8 × 22B Instruct, and Claude 3.5 Sonnet) with RAG as well as GPT-4o-mini without RAG were evaluated for generating answers about two German neurovascular guidelines (“S3 Guideline for Diagnosis, Treatment, and Follow-up of Extracranial Carotid Stenosis” and “S2e Guideline for Acute Therapy of Ischemic Stroke”). The answers were classified as “correct”, “inaccurate”, or “incorrect” by two neurovascular experts in consensus. Additionally, retrieval performance of five retrieval strategies was analyzed on a synthetic dataset of 384 questions. Results Claude Sonnet 3.5 achieved the highest answer correctness (70.6% correct, 10.6% wrong), followed by Llama 3.1 (64.7%, 15.3% wrong), GPT-4o-mini with RAG (57.6%, 15.3% wrong), and Mixtral (56.6%, 17.6% wrong). GPT-4o-mini without RAG performed significantly worse (20.0%, 32.9% wrong). Retrieval errors were the primary cause of incorrect answers (80%). For retrieval, BM25 achieved the highest accuracy (82.0%), outperforming vector-based methods like “BAAI/bge-m3” (78.4%). Conclusion RAG significantly improves LLM accuracy for medical guideline question answering compared to the inherent knowledge of pretrained LLMs alone while still showing significant error rates. Improved accuracy and confidence metrics are needed for safer implementation in clinical routine. Additionally, our results demonstrate the strong performance of general LLMs in medical question answering for non-English languages, such as German, even without specific training. | |||||||
| Rechtliche Vermerke: | Originalveröffentlichung:
Vach, M., Gliem, M., Weiß, D. A., Ivan, V. L., Hauke, F., Boschenriedter, C., Rubbert, C., & Caspers, J. (2025). Evaluating Retrieval Augmented Generation-enhanced Large Language Models for Question Answering On German Neurovascular Guidelines. Clinical Neuroradiology, 36(1), 119–127. https://doi.org/10.1007/s00062-025-01562-z | |||||||
| Lizenz: | ![]() Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung 4.0 International Lizenz | |||||||
| Fachbereich / Einrichtung: | Medizinische Fakultät | |||||||
| Dokument erstellt am: | 22.04.2026 | |||||||
| Dateien geändert am: | 22.04.2026 |

