LEARNING TO REASON ABOUT RARE DISEASES
THROUGH RETRIEV AL-AUGMENTED AGENTS
Ha Young Kim1, 2, Jun Li1, 3, Ana Beatriz Solana2, Carolin M. Pirkl2, Benedikt Wiestler1, 3, 4,
Julia A. Schnabel1, 3, 5, 6, *, Cosmin I. Bercea1, 3, *, On behalf of the PREDICTOM consortium
1Technical University of Munich, Munich, Germany
2GE HealthCare, Munich, Germany
3Munich Center for Machine Learning, Munich, Germany
4Klinikum Rechts der Isar, Munich, Germany
5Helmholtz Munich, Munich, Germany
6King’s College London, London, United Kingdom
ABSTRACT
Rare diseases represent the long tail of medical imaging,
where AI models often fail due to the scarcity of represen-
tative training data. In clinical workflows, radiologists fre-
quently consult case reports and literature when confronted
with unfamiliar findings. Following this line of reasoning,
we introduceRADAR(Retrieval-Augmented Diagnostic Rea-
soning Agents), an agentic system for rare disease detection
in brain MRI. Our approach uses AI agents with access to
external medical knowledge by embedding both case re-
ports and literature using sentence transformers and index-
ing them with FAISS to enable efficient similarity search.
The agent retrieves clinically relevant evidence to guide di-
agnostic decision-making on unseen diseases, without the
need of additional training. Designed as a model-agnostic
reasoning module, RADAR can be seamlessly integrated
with diverse large–language models, consistently improving
their rare pathology recognition and interpretability. On the
NOV A dataset comprising 280 distinct rare diseases, RADAR
achieves up to a 10.2% performance gain, with the strongest
improvements observed for open-source models such as
DeepSeek. Beyond accuracy, the retrieved examples provide
interpretable, literature-grounded explanations, highlighting
retrieval-augmented reasoning as a powerful paradigm for
low-prevalence conditions in medical imaging. Code and
Details: https://anonymous.4open.science/r/RADAR-3232
Index Terms—medical imaging, brain disorders, disease
diagnosis, agentic AI, retreival augmented generation
1. INTRODUCTION
Rare brain disorders collectively affect millions worldwide,
yet their diagnosis remains notoriously difficult due to the
scarcity of expert knowledge, the heterogeneity of clinical
presentations, and the extremely low prevalence of individual
* Co-Senior authors
Fig. 1: Overview of the proposed RADAR (Retrieval-
Augmented Diagnostic Reasoning Agents) framework. The
system employs coordinated agents that retrieve and integrate
medical knowledge from external text-based databases to sup-
port diagnostic reasoning on rare diseases, achieving up to a
10.2% accuracy gain over non-agentic baselines.
conditions. These challenges often result in misdiagnosis, de-
layed interventions, and inappropriate treatments, underscor-
ing the need for diagnostic systems that are not only accurate
but also interpretable and evidence-grounded [1].
Recent advances in artificial intelligence (AI), especially
large language models (LLMs) and agentic AI systems, have
demonstrated high potential in complex reasoning and clin-
ical decision support across medical domains [2, 3]. How-
ever, the current high-performing LLM-based systems oper-
ate as closed models and lack domain-specific medical train-
ing. Consequently, this often leads to misdiagnoses or hallu-
cinated recommendations in clinical settings [4].
Retrieval-augmented generation (RAG) addresses these
limitations by combining the generative reasoning capabil-
ities of LLMs with real-time access to sources of domain-
specific knowledge [5, 6]. This paradigm has a notablearXiv:2511.04720v1  [cs.CL]  6 Nov 2025
resemblance to clinical workflows: when radiologists en-
counter unfamiliar findings, they consult case reports and
literature to guide their reasoning. Inspired by this clinical
workflow, we propose RADAR (Retrieval-Augmented Di-
agnostic Reasoning Agents), an agentic system for retrieval-
augmented diagnostic reasoning that can dynamically retrieve
relevant medical evidence during inference, without requir-
ing additional fine-tuning or domain-specific retraining. This
approach enhances diagnostic accuracy while simultaneously
mitigating hallucinations and ensuring evidence-based out-
puts. By explicitly linking model decisions to supporting
evidence, RADAR reduces hallucinations and enables trans-
parent, literature-backed decisions.
RADAR employs a set of coordinated agents that (i) re-
trieve semantically relevant case reports and literature from
external sources, (ii) ground diagnostic reasoning on re-
trieved evidence, and (iii) synthesize interpretable diagnostic
hypotheses. Our main contributions are:
• We introduce RADAR, a retrieval-augmented, model-
agnostic reasoning framework that integrates radiological
understanding from brain MRI with external medical
knowledge, improving diagnostic accuracy and inter-
pretability without additional training.
• We conduct a comprehensive evaluation on the NOV A
dataset covering 280 rare brain diseases, demonstrating
that retrieval-based reasoning consistently improves both
diagnostic accuracy and interpretability across diverse vi-
sion–language models, establishing a scalable path toward
trustworthy AI in data-scarce medical imaging.
2. METHOD
We propose RADAR(Retrieval-Augmented Diagnostic Rea-
soning Agents), a system designed to improve diagnostic rea-
soning for rare diseases by integrating multi-agent collabora-
tion with retrieval-augmented generation as illustrated in Fig-
ure 2 (d). RADAR iteratively generates diagnostic hypothe-
ses, retrieves external medical knowledge, and refines diag-
nosic conclusions. The framework comprises three special-
ized agents: aninitial doctor agent, aretrieval agent, and afi-
nal doctor agent, that collaborate to produce an interpretable,
evidence-grounded diagnosis.
2.1. Initial Doctor Agent
The initial doctor agent is implemented as a large language
model (LLM) prompted to act as a diagnostic expert. It takes
the MRI image findings (image caption) and patient clinical
history data as input and produces a list of ten candidate diag-
noses:
finit:C= (Image caption,Clinical data)7→ {d i}10
i=1.
To promote diagnostic diversity, the underlying LLM is con-
figured with a high temperature and top-psampling. Thisstage provides a broad but plausible hypothesis to guide fur-
ther reasoning.
2.2. RAG agent
The RAG agent assists the diagnostic reasoning process by
generating targeted queries, retrieving external evidence, and
synthesizing contextual answers. It operates in three stages:
1. Query generation.The query generation system takes the
image caption and patient clinical data as input and uses an
LLM to generate a set of question-keyword pairs:
fLLM :C= (Image caption,Clinical data)7→ {(q i, ki)}n
i=1
whereq iis a question designed to enhance diagnostic rea-
soning, andk iis the corresponding keyword extracted from
qi. To generate a wide variety of search queries helpful for
diagnosis, the LLM is set with a high temperature and top-p.
2. Knowledge retrieval and indexing.The knowledge re-
trieval and indexing system retrieves relevant information for
each keywordk ias follows:
• Internal check: If information about the keywordk iexists
in the internal knowledge base, it retrieves relevant docu-
ments. Otherwise, the system updates its knowledge base
dynamically by constructing a new index through the re-
trieval of new data from an external source.
• External retrieval: The agent queries Radiopaedia [7] for
each keywordk i, retrieving 10 relevant documents, 5 from
the articles section and 5 from the cases section. Each re-
trieved document is segmented into overlapping chunks
{ci,j}Mi
j=1. These are embedded into dense vector embed-
dings using using the all-MiniLM-L6-v2 sentence trans-
former model (g embed ) [8].
vi,j=gembed (ci,j),v i,j∈Rd.
Then, these embeddings are stored in the internal knowl-
edge base as a FAISS index [9] to enable efficient similar-
ity search, i.e., cosine similarity:
I ← I ∪ {v i,j}Mi
j=1.
3. Answer generation.For each questionq i, the agent re-
trieves the top-kmost relevant chunks based on cosine simi-
larity:
Ri=Top-k(v qi,I, k= 5),
wherev qi=g embed (qi). Using this, an LLM analyzes the re-
trieved content and generates a concise answer to the question
qibased on the retrieved evidence. We configure this LLM
with a low temperature to generate an answer based solely on
the retrieved content.
(a) Single agent
 (b) Collaborative system
 (c) Challenger system
 (d) RADAR system (ours)
Fig. 2: Comparison of multi-agent diagnostic reasoning setups. (a) A single-agent system: a single doctor agent generates
a diagnosis. (b) Collaborative system: agents exchange independent diagnoses and reach a consensus through discussion
rounds. (c) Challenger system: one agent introduces adversarial information to test the robustness of others, and (d) RADAR
(ours): retrieval-augmented framework where agents access external medical knowledge via Radiopaedia to refine and ground
diagnostic reasoning.
2.3. Final doctor agent
The final doctor agent integrates all the available data includ-
ing image captions, clinical information, retrieved knowl-
edge, and the candidate diagnosis, to produce one primary
diagnosis and four differential diagnoses:
ffinal: (C,{d i}, R)7→D final={d primary, d(1−4)
diff}.
This agent operates at a mid-range temperature setting to bal-
ance reasoning flexibility with factual consistency. By ex-
plicitly conditioning on retrieved evidence, it produces inter-
pretable, literature-grounded diagnostic outputs.
3. EXPERIMENTS
Our experiments aim to assess three key aspects of RADAR:
(1) whether retrieval-augmented reasoning improves diagnos-
tic accuracy for rare brain diseases, (2) whether the system
generalizes across diverse large language models (LLMs),
and (3) whether the retrieved evidence enhances interpretabil-
ity and trustworthiness.
3.1. Dataset and Metrics
We evaluate RADAR on the publicly available NOV A dataset
[10], which includes around 900 brain MRI scans spanning
281 rare pathologies and multiple acquisition protocols. Each
case provides patient clinical information and an expert-
written caption describing the imaging findings. To mitigate
potential bias toward single phrasing we paraphrased each
caption into four alternative formulations using GPT-4o.We measure the diagnostic performance using Top-1 and
Top-5 accuracy, following the evaluation protocol described
in the original NOV A paper [10]. Top-1 indicates exact agree-
ment with the ground-truth diagnosis, while Top-5 considers
whether the correct diagnosis appears among the five most
likely predictions. Because medical terminology may differ
across sources, each prediction is normalized via GPT-4o to
align synonyms and variant expressions.
3.2. Baselines and Model Configurations
To ensure a fair and comprehensive comparison, we evaluate
multiple reasoning paradigms across both closed- and open-
source LLMs. Specifically, we test two proprietary models
(GPT-4o [11] and Gemini-2.0-Flash [12]) and three open-
source models (Qwen3-32B [13], DeepSeek-R1-70B [14],
and MedGemma-27B [15]).
Each model is evaluated under three reasoning setups: a
single-agent system (Figure 2a), a collaborative multi-agent
system (Figure 2b), and a challenger multi-agent system (Fig-
ure 2c). In thesingle-agent systeman LLM generates a diag-
nostic output directly from the patient information. Thecol-
laborative systeminvolves three independent doctor agents
who provide initial diagnoses and subsequently engage in it-
erative discussions to reach consensus. In contrast, thechal-
lenger systemintroduces an adversarial agent that challenges
the reasoning of the doctor agents.RADARextends these se-
tups by integrating retrieval-augmented reasoning, allowing
agents to access external medical knowledge to ground and
refine their diagnostic decisions as shown in Figure 2d. Im-
plementation details and prompts are available in the reposi-
tory.
Fig. 3: Examples of the results generated by our RADAR system. The ground-truth diagnosis is marked with bold.
4. RESULTS AND DISCUSSION
4.1. Results
We evaluate RADAR against the single-agent, collaborative,
and challenger multi-agent configurations. Tables 1 and 2
summarize Top-1 and Top-5 diagnostic accuracy across the
five LLM backbones. RADAR consistently outperforms all
baselines across models and metrics. For instance, it achieves
up to+7.97Top-1 improvement with Qwen3-32B and+10.19
Top-5 improvement with DeepSeek-R1-70B over the single-
agent baseline. Performance gains are especially pronounced
for open-source models, suggesting that retrieval-augmented
reasoning compensates for smaller model capacity and lim-
ited medical pretraining. The collaborative and challenger
setups provide limited or inconsistent gains over the single-
agent baseline, and in some cases, even degrade performance.
This suggests that multi-agent interaction can amplify diag-
nostic uncertainty, particularly when the agents lack domain-
specific knowledge. Our best performance was achieved us-
ing GPT-4o on our RADAR method, reaching Top-1 accu-
racy of 54.40% and Top-5 accuracy of 75.05%. For con-
text, the NOV A paper [10] reports resident neuroradiologist
performance of 48–52% Top-1 and 68–76% Top-5 accuracy,
measured on a 25-case subset. RADAR achieves compara-
ble accuracy evaluated across the full dataset, underscoring
its potential of diagnostic reasoning agentic system in clini-
cal settings. Nevertheless, RADAR still relies on radiologist-
provided captions. Bridging this gap from textual reasoning
to direct visual understanding remains an open challenge.
Figure 3 shows illustrative outputs from RADAR. It
demonstrates how the RAG agent formulates targeted diag-
nostic questions, retrieves relevant content from Radiopaedia,
and generates concise evidence-based answers strictly from
the retrieved text. The final doctor agent integrates this infor-
mation and gives a ranked list of differential diagnoses with
confidence estimates. In the second example, retrieved evi-
dence causes RADAR to update its prediction, recovering theTable 1: Top 1 Accuracy comparison
Model Single-agent Collaboration Challenge RADAR
Gemini-2.045.55±1.58 45.56±1.09 41.48±1.0948.54±0.29
GPT-4o49.94±0.86 48.61±1.11 47.68±2.2654.40±1.02
Qwen3-32b35.19±0.51 36.85±2.11 35.07±0.8443.16±3.16
DeepSeek-R1-70B35.47±2.72 38.38±2.36 37.89±0.3741.85±1.89
Medgemma-27B31.00±1.20 33.64±1.04 36.06±1.0538.23±1.43
Table 2: Top 5 Accuracy comparison
Model Single-agent Collaboration Challenge RADAR
Gemini-2.066.21±1.40 65.09±0.56 68.98±1.2272.10±2.34
GPT-4o68.10±1.65 68.74±0.55 69.29±1.1775.05±2.19
Qwen3-32b52.54±0.39 54.35±1.23 55.01±2.4562.51±2.71
DeepSeek-R1-70B54.62±2.53 60.18±2.02 59.11±1.8564.81±1.52
Medgemma-27B56.21±0.54 57.65±1.36 61.53±0.7864.40±2.11
correct diagnosis that was initially ranked lower. This shows
the system’s capacity to adjust its reasoning by integrating
new clinical information.
5. CONCLUSION
We introducedRADAR, a retrieval-augmented, agentic
framework for rare disease diagnosis in brain MRI. By cou-
pling large language models with external medical knowl-
edge, RADAR enhances diagnostic reasoning and provides
interpretable, evidence-grounded outputs. Our results demon-
strate that integrating retrieval mechanisms consistently im-
proves diagnostic accuracy—particularly for open-source
models—showing that explicit knowledge injection can com-
plement model size. Although the current system relies on
radiologist-provided captions rather than direct image inter-
pretation, bridging this gap between text-based reasoning
and image-based understanding represents a key direction for
future research.
6. ACKNOWLEDGMENTS
This project is supported by the Innovative Health Initia-
tive Joint Undertaking (IHI JU) under grant agreement No
101132356 as part of the project PREDICTOM.
PREDICTOM is supported by the Innovative Health Initia-
tive Joint Undertaking (IHI JU), under Grant Agreement No
101132356. JU receives support from the European Union’s
Horizon Europe research and innovation programme, CO-
CIR, EFPIA, EuropaBio, MedTechEurope and Vaccines Eu-
rope. The UK participants are supported by UKRI Grant
No 10083467 (National Institute for Health and Care Excel-
lence), Grant No 10083181 (King’s College London), and
Grant No 10091560 (University of Exeter). University of
Geneva is supported by the Swiss State Secretariat for Ed-
ucation, Research and Innovation Ref No 113152304. See
www.ihi.europa.eu for more details.”
This work is supported by the DAAD programme under Kon-
rad Zuse Schools of Excellence for Reliable AI (RelAI)
C.I.B. is funded via the EVUK program (”Next-generation AI
for Integrated Diagnostics”) of the Free State of Bavaria.
7. REFERENCES
[1] Arrigo Schieppati, Jan-Inge Henter, Erica Daina, and
Anita Aperia, “Why rare diseases are an important med-
ical and social issue,”The Lancet, vol. 371, no. 9629,
pp. 2039–2041, June 2008.
[2] Taeyoon Kwon, Kai Tzu-iunn Ong, Dongjin Kang, Se-
ungjun Moon, Jeong Ryong Lee, et al., “Large lan-
guage models are clinical reasoners: reasoning-aware
diagnosis framework with prompt-generated rationales,”
inProceedings of the Thirty-Eighth AAAI Conference
on Artificial Intelligence and Thirty-Sixth Conference
on Innovative Applications of Artificial Intelligence and
Fourteenth Symposium on Educational Advances in Ar-
tificial Intelligence. 2024, AAAI’24/IAAI’24/EAAI’24,
AAAI Press.
[3] Thomas Savage, Ashwin Nayak, Robert Gallo, Ekanath
Rangan, and Jonathan H. Chen, “Diagnostic reasoning
prompts reveal the potential for large language model
interpretability in medicine,”npj Digital Medicine, vol.
7, no. 1, pp. 20, Jan. 2024.
[4] Zhihong Zhu, Yunyan Zhang, Xianwei Zhuang, Fan
Zhang, et al., “Can we trust AI doctors? a survey of
medical hallucination in large language and large vision-
language models,” inFindings of the Association for
Computational Linguistics: ACL 2025, Wanxiang Che
et al., Eds., Vienna, Austria, July 2025, pp. 6748–6769,
Association for Computational Linguistics.
[5] Fnu Neha, Deepshikha Bhati, and Deepak Kumar
Shukla, “Retrieval-augmented generation (rag) inhealthcare: A comprehensive review,”AI, vol. 6, no.
9, 2025.
[6] Omid Kohandel Gargari and Gholamreza Habibi, “En-
hancing medical ai with retrieval-augmented generation:
A mini narrative review,”Digital Health, vol. 11, pp.
20552076251337177, Apr. 2025, eCollection 2025 Jan-
Dec.
[7] Radiopaedia.org contributors, “Radiopaedia.org: The
peer-reviewed collaborative radiology resource,” 2005,
Accessed on 2025-09-30.
[8] Sentence Transformers, “all-minilm-l6-v2,” 2021,
Model card on Hugging Face.
[9] Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff
Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazar ´e,
Maria Lomeli, Lucas Hosseini, and Herv ´e J´egou, “The
faiss library,” 2025.
[10] Cosmin I. Bercea, Jun Li, Philipp Raffler, Evamaria O.
Riedel, Lena Schmitzer, Angela Kurz, Felix Bitzer,
Paula Roßm ¨uller, Julian Canisius, Mirjam L. Beyrle,
Che Liu, Wenjia Bai, Bernhard Kainz, Julia A. Schn-
abel, and Benedikt Wiestler, “Nova: A benchmark
for anomaly localization and clinical reasoning in brain
mri,”arXiv preprint arXiv:2505.14064, 2025.
[11] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo
Almeida, Janko Altenschmidt, Sam Altman, Shyamal
Anadkat, et al., “Gpt-4 technical report,”arXiv preprint
arXiv:2303.08774, 2023.
[12] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Bur-
nell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien
Vincent, Zhufeng Pan, Shibo Wang, et al., “Gemini 1.5:
Unlocking multimodal understanding across millions of
tokens of context,”arXiv preprint arXiv:2403.05530,
2024.
[13] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang,
et al., “Qwen3 technical report,” 2025.
[14] DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang,
et al., “Deepseek-r1: Incentivizing reasoning capability
in llms via reinforcement learning,” 2025.
[15] Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroen-
sri, Atilla Kiraly, et al., “Medgemma technical report,”
2025.