Medical diagnosis using Large Multimodal Models (LMMs) has gained increasing attention due to capability of these models in providing precise diagnoses. These models generally combine medical questions with visual inputs to generate diagnoses or treatments. However, they are often overly general and unsuitable under the wide range of medical conditions in real-world healthcare. In clinical practice, diagnosis is performed by multiple specialists, each contributing domain-specific expertise. To emulate this process, a potential solution is to deploy a dynamic multi-agent LMM framework, where each agent functions as a medical specialist. Current approaches in this emerging area, typically relying on static or predefined selection of various specialists, cannot be adapted to the changing practical scenario. In this paper, we propose MedRoute, a flexible and dynamic multi-agent framework that comprises of a collaborative system of specialist LMM agents. Furthermore, we add a General Practitioner with an RL-trained router for dynamic specialist selection, and a Moderator that produces the final decision. In this way, our framework closely mirrors real clinical workflows. Extensive evaluations on text and image-based medical datasets demonstrate improved diagnostic accuracy, outperforming the state-of-the-art baselines. Our work lays a strong foundation for future research.
The question along with the image is input into the General Practitioner(GP) Agent which is a router for specialist allocation. The GP inputs all potential agents from the specialist pool and allocates the first specialist based on the question (Neurologist in this case). The Neurologist agent gives its diagnosis which is passed into the GP as history, and used by the GP to route to the next specialist(Radiologist here). Radiologist agent's diagnosis acts as history for the GP to consult the third specialist (Neurosurgeon). The process repeats till the GP deems it necessary. Finally all diagnoses from selected specialists are passed on to the Moderator agent which summarizes them and outputs the final decision. The working mechanism of the GP Routing process is outlined below.
The image (if multimodal), is input into the image captioner agent to capture detail for specialist allocation. This caption along with the question is input into the TE(.) to generate a combined task embedding. This is simultaneously used to generate Specialist Vector and Specialist History Vector representing the next specialist and previously selected specialists respectively. At the same time, all 'k' potential candidates from the specialist pool and history (previous diagnoses) are embedded by the SRE(.) and HE(.) respectively. Finally all these 5 embeddings (task, specialist vector, specialist history vector, candidate specialists, history) are concatenated and passed to a Routing Transformer. The output is further passed through an MLP which outputs a k-dimensional vector used to determine the final route.
The input image shows axial CT scans of the heart and an ECG, which the question asks about. Out of the variety of specialists in the pool, the GP agent first selects a Cardiologist, which analyzes the ECG to give its diagnosis. Based on this diagnosis the GP consults a Thoracic Surgeon that gives a diagnosis consistent with the previous agent. Finally the GP routes to a Hematologist agent which also gives its own diagnosis. All 3 diagnoses are passed on to the Moderator agent that understands and summarizes them to give a final diagnosis “Prominent enlargement of the right atrium and right ventricle”.
Our model can perform diagnosis on text-based and image-based medical questions. We show our model’s performance compared to other Large Language/Multimodal Models and Multimodal agentic framework on 2 text-only datasets and 3 image-text datasets.
Table shows results on text-only datasets (MedQA and PubMedQA). Both datasets are similarly sized at 1273 samples and 1000 samples respectively. Our model outperforms the state-of-the-art model on both datasets by ∼ 6% and ∼ 2% respectively. Out of the baselines GPT-4.1-mini performs the best with 85.86% accurate answers on MedQA, while MAM performs the best on PubMedQA. In general the medical models fare much better than general purpose LLMs (with the exception of GPT) which is the expected trend.
We also evaluate on image-text multimodal medical datasets (PMC-VQA, DeepLesion and PathVQA) as seen in the table. All image-text datasets are larger than their text-only counterparts. Out of the 3, PMC-VQA is the most similar in size with 2000 samples, while DeepLesion and PathVQA are much larger with 4927 samples and 6719 samples respectively. Also PMC-VQA consists of general medical questions, DeepLesion has QA pairs constructed from coarse lesion class labels, and PathVQA consists of open-ended questions. We observe that our model again shows substantial improvement on all 3 datasets over state-of-the-art agentic framework, especially in DeepLesion (∼ 5.5%). We observe a mixed trend in vision models where Medical VLMs for some datasets perform better than general purpose VLMs, while performing worse for others.