Bridging The Gap: A Generalist-Specialist Framework For Generalizable Medical AI

Achiam, J. et al. GPT-4 technical report. Preprint at arXiv (2023).

Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at arXiv (2023).

Anil, R. et al. PaLM 2 technical report. Preprint at arXiv (2023).

Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).

Tung, C., Lin, Y., Yin, J., Ye, Q. & Chen, H. Exploring vision language pretraining with knowledge enhancement via large language model. In International Workshop on Trustworthy Artificial Intelligence for Healthcare, 81–91 (Springer, 2024).

Xu, Y. et al. A multimodal knowledge-enhanced whole-slide pathology foundation model. Nat. Commun. 16, 11406 (2025).

Article CAS PubMed PubMed Central Google Scholar

Zhu, D., Chen, J., Shen, X., Li, X. & Elhoseiny, M. MiniGPT-4: enhancing vision-language understanding with advanced large language models. In Proc. 12th International Conference on Learning Representations (OpenReview.net, 2024).

Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 36, 34892–34916 (2023).

Google Scholar

Chen, Z. et al. InternVL: scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 24185–24198 (IEEE, 2024).

Antol, S. et al. VQA: visual question answering. In Proc. IEEE International Conference on Computer Vision 2425–2433 (IEEE, 2015).

Lin, T.-Y. et al. Microsoft COCO: common objects in context. In Proc. 13th European Conference on Computer Vision (eds Fleet, D., Pajdla, T., Schiele, B. & Tuytelaars, T.) 740–755 (Springer, 2014).

Li, C. et al. LLaVA-Med: training a large language-and-vision assistant for biomedicine in one day. Adv. Neural Inf. Process. Syst. 36, 28541–28564 (2024).

Google Scholar

Moor, M. et al. Med-Flamingo: a multimodal medical few-shot learner. In Proc. 3rd Machine Learning for Health Symposium (eds Hegselmann, S. et al.) 353–367 (PMLR, 2023).

Wu, C. et al. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. Nat. Commun. 16, 7866 (2025).

Article CAS PubMed PubMed Central Google Scholar

Tu, T. et al. Towards generalist biomedical AI. NEJM AI 1, AIoa2300138 (2024).

Article Google Scholar

Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).

Article CAS PubMed Google Scholar

Zhou, H.-Y., Acosta, J. N., Adithan, S., Datta, S., Topol, E. J. & Rajpurkar, P. MedVersa: a generalist foundation model for diverse medical imaging tasks. NEJM AI 3, e2500595 (2026).

Article Google Scholar

He, X. et al. Towards visual question answering on pathology images. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (eds Zong, C., Xia, F., Li, W. & Navigli, R.) 708–718 (Association for Computational Linguistics, 2021).

Ben Abacha, A. et al. VQA-Med: overview of the medical visual question answering task at ImageCLEF 2019. In Working Notes of CLEF 2019 (eds Cappellato, L., Ferro, N., Losada, D. E. & Müller, H.) (CEUR-WS.org, 2019).

Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317 (2019).

Article PubMed PubMed Central Google Scholar

Demner-Fushman, D. et al. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 23, 304–310 (2016).

Article PubMed PubMed Central Google Scholar

Article
PubMed

Google Scholar

Jin, H., Che, H., Lin, Y. & Chen, H. PromptMRG: diagnosis-driven prompts for medical report generation. In Proc. 38th AAAI Conference on Artificial Intelligence 2607–2615 (AAAI Press, 2024).

Chen, Z., Luo, L., Bie, Y. & Chen, H. Dia-LLaMA: towards large language model-driven CT report generation. In Proc. 28th International Conference on Medical Image Computing and Computer-Assisted Intervention (eds Gee, J. C. et al.) 141–151 (Springer, 2025).

Nguyen, H. T. et al. VinDr-SpineXR: a deep learning framework for spinal lesions detection and classification from radiographs. In Proc. 24th International Conference on Medical Image Computing and Computer-Assisted Intervention (eds de Bruijne, M. et al.) 291–301 (Springer, 2021).

Yang, J. et al. MedMNIST v2 – a large-scale lightweight benchmark for 2D and 3D biomedical image classification. Sci. Data 10, 41 (2023).

Article
PubMed
PubMed Central

Google Scholar

Wang, D. et al. A real-world dataset and benchmark for foundation model adaptation in medical image classification. Sci. Data 10, 574 (2023).

Article
PubMed
PubMed Central

Google Scholar

Hu, Y. et al. OmniMedVQA: a new large-scale comprehensive evaluation benchmark for medical LVLM. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 22170–22183 (IEEE, 2024).

Zong, Z. et al. MoVA: adapting mixture of vision experts to multimodal context. Adv. Neural Inf. Process. Syst. 37, 103305–103333 (2024).

Google Scholar

Nath, V. et al. VILA-M3: enhancing vision-language models with medical expert knowledge. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 14788–14798 (IEEE, 2025).

Xiong, C. et al. MoME: mixture of multimodal experts for cancer survival prediction. In Proc. 27th International Conference on Medical Image Computing and Computer-Assisted Intervention (eds Linguraru, M. G. et al.) 318–328 (Springer, 2024).

Li, B. et al. MMedAgent: learning to use medical tools with multi-modal agent. In Findings of the Association for Computational Linguistics: EMNLP 2024 (eds Al-Onaizan, Y., Bansal, M. & Chen, Y.-N.) 8745–8760 (Association for Computational Linguistics, 2024).

Pham, H. H., Tran, T. T. & Nguyen, H. Q. VinDr-PCXR: an open, large-scale pediatric chest X-ray dataset for interpretation of common thoracic diseases. PhysioNet (version 1.0. 0), 10, (2022).

Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 1–9 (2018).

Article

Google Scholar

Panchal, S. et al. Retinal Fundus Multi-disease Image Dataset (RFMID) 2.0: a dataset of frequently and rarely identified diseases. Data 8, 29 (2023).

Wang, X. et al. ChestX-ray8: hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2097–2106 (IEEE, 2017).

Pacheco, A. G. C. et al. PAD-UFES-20: a skin lesion dataset composed of patient data and clinical images collected from smartphones. Data Br. 32, 106221 (2020).

Article

Google Scholar

Smedsrud, P. H. et al. Kvasir-Capsule, a video capsule endoscopy dataset. Sci. Data 8, 142 (2021).

Article
PubMed
PubMed Central

Google Scholar

Demner-Fushman, D., Antani, S., Simpson, M. & Thoma, G. R. Design and development of a multimodal biomedical information retrieval system. J. Comput. Sci. Eng. 6, 168–177 (2012).

Article

Google Scholar

Lau, J. J., Gayen, S., Ben Abacha, A. & Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 5, 1–10 (2018).

Article

Google Scholar

Jacobs, R. A., Jordan, M. I., Nowlan, S. J. & Hinton, G. E. Adaptive mixtures of local experts. Neural Comput. 3, 79–87 (1991).

Article

Google Scholar

PubMed

Google Scholar

Fedus, W., Zoph, B. & Shazeer, N. Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 23, 1–39 (2022).

Google Scholar

Luo, L. et al. A large model for non-invasive and personalized management of breast cancer from multiparametric MRI. Nat. Commun. 16, 3647 (2025).

Article
CAS
PubMed
PubMed Central

Google Scholar

Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 33, 9459–9474 (2020).

Google Scholar

Subramanian, M., Shanmugavadivel, K., Naren, O. S., Premkumar, K. & Rankish, K. Classification of retinal OCT images using deep learning. In 2022 International Conference on Computer Communication and Informatics 1–7 (IEEE, 2022).

Liu, B. et al. SLAKE: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In IEEE 18th International Symposium on Biomedical Imaging 1650–1654 (IEEE, 2021).

Zhang, X. et al. Development of a large-scale medical visual question-answering dataset. Commun. Med. 4, 277 (2024).

Article
PubMed
PubMed Central

Google Scholar

Doerrich, S., Di Salvo, F., Brockmann, J. & Ledig, C. Rethinking model prototyping through the MedMNIST+ dataset collection. Sci. Rep. 15, 7669 (2025).

Article
CAS
PubMed
PubMed Central

Google Scholar

Wang, Z., Liu, L., Wang, L. & Zhou, L. R2GenGPT: radiology report generation with frozen LLMs. Meta-Radiol. 1, 100033 (2023).

Article

Google Scholar

Yang, A. et al. Qwen2.5 technical report. Preprint at arXiv (2024).

Tang, X. et al. MedAgents: large language models as collaborators for zero-shot medical reasoning. In Findings of the Association for Computational Linguistics: ACL 2024 (eds Ku, L.-W., Martins, A. & Srikumar, V.) 599–621 (Association for Computational Linguistics, 2024).

Zhou, J., Liu, Z., Xiao, S., Zhao, B. & Xiong, Y. VISTA: visualized text embedding for universal multi-modal retrieval. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (eds Ku, L.-W., Martins, A. & Srikumar, V.) 3185–3200 (Association for Computational Linguistics, 2024).

Yang, P. et al. LMKG: a large-scale and multi-source medical knowledge graph for intelligent medicine applications. Knowl. Based Syst. 284, 111323 (2024).

Article

Google Scholar

Ni, P., Okhrati, R., Guan, S. & Chang, V. Knowledge graph and deep learning-based text-to-graphQL model for intelligent medical consultation chatbot. Inform. Syst. Front. 26, 137–156 (2024).

Article

Google Scholar

Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022).

Google Scholar

Snell, C., Lee, J., Xu, K. & Kumar, A. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In Proc. International Conference on Learning Representations (ICLR) (2025).

Hu, E. J. et al. LoRA: low-rank adaptation of large language models. ICLR 1, 3 (2022).

Google Scholar

Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S. & He, Y. ZeRO-Infinity: breaking the GPU memory wall for extreme scale deep learning. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis Article 59 (ACM, 2021).

Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (eds Bengio, Y. & LeCun, Y.) (Computational and Biological Learning Society, 2015).

Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105

Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172, 1122–1131 (2018).

Article
CAS
PubMed

Google Scholar

Anouk Stein, M. D. et al. RSNA Pneumonia Detection Challenge. (Kaggle, 2018).

Al-Dhabyani, W., Gomaa, M., Khaled, H. & Fahmy, A. Dataset of breast ultrasound images. Data Br. 28, 104863 (2020).

Article

Google Scholar

Kather, J. N. et al. Predicting survival from colorectal cancer histology slides using deep learning: a retrospective multicenter study. PLoS Med. 16, e1002730 (2019).

Article
PubMed
PubMed Central

Google Scholar

Sawyer-Lee, R., Gimenez, F., Hoogi, A. & Rubin, D. Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) (The Cancer Imaging Archive, 2016).

Kawahara, J., Daneshvar, S., Argenziano, G. & Hamarneh, G. Seven-point checklist and skin lesion classification using multitask multimodal neural nets. IEEE J. Biomed. Health Inform. 23, 538–546 (2018).

Article

Google Scholar

Nakayama, L. F. et al. A Brazilian multilabel ophthalmological dataset (BRSET). PhysioNet, 13026 (2023).

Top Posts

Open-Sourcing the Azure Integrated HSM: A Bold Step Toward Trust and Transparency

Secure Data, Smarter IoT: Inside the Federated Computing Platform Revolution

The Secret Sauce Behind ZDNET’s AI Testing Process

Bridging the Gap: A Generalist-Specialist Framework for Generalizable Medical AI

Qwen AI Unveils Qwen-Scope: Open-Source Sparse AutoEncoders Suite Revolutionizing LLM Internal Features into Developer Tools

Unlocking Smarter Decisions with Stochastic Programming: A Beginner’s Guide

5 Surprise Python Decorators That Spark Clean AI Code

Taming 33M Transit Tokens: Replicating Karpathy’s Autoresearch Method for a 14% lift

Cursor Unveils a Game-Changing TypeScript SDK for Powering Intelligent Coding Agents Using Cloud Sandbox VMs

“Mastering the Art of Stacking: Building Powerful Ensembles of Ensembles of Ensembles”

Open-Sourcing the Azure Integrated HSM: A Bold Step Toward Trust and Transparency

Secure Data, Smarter IoT: Inside the Federated Computing Platform Revolution

The Secret Sauce Behind ZDNET’s AI Testing Process

Morning Minute: MegaETH’s MEGA Launch Powers the Biggest Token Debut of 2026

Bridging the Gap: A Generalist-Specialist Framework for Generalizable Medical AI

“Security’s Blind Eye: Why BEC Bypasses MFA and Keeps Winning”

durable execution that follows the tenant

Securing Digital Identities: How Modern IT and IoT Systems Defend Against Evolving Threats

Trending

Open-Sourcing the Azure Integrated HSM: A Bold Step Toward Trust and Transparency

Secure Data, Smarter IoT: Inside the Federated Computing Platform Revolution

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Bridging the Gap: A Generalist-Specialist Framework for Generalizable Medical AI

Related Posts