Jaech, A. et al. OpenAI o1 system card. Preprint available online (2024).
Guo, D. et al. DeepSeek-R1 promotes reasoning in large language models via reinforcement learning. Nature 645, 633–638 (2025).
Google Scholar
Trinh, T. H., Wu, Y., Le, Q. V., He, H. & Luong, T. Solving Olympic-level geometry problems without human examples. Nature 625, 476–482 (2024).
Google Scholar
Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. Artificial intelligence in healthcare and medicine. Nat. Med. 28, 31–38 (2022).
Google Scholar
Mamede, S. et al. Protecting physicians from availability bias in diagnostic reasoning: a randomized controlled trial. BMJ Qual. Saf. 29, 550–559 (2020).
Google Scholar
Mamede, S. et al. How can students’ diagnostic skills benefit most from clinical case practice? The impact of structured reflection on diagnosing both familiar and new diseases. Acad. Med. 89, 121–127 (2014).
Google Scholar
Mamede, S., Schmidt, H. G. & Penaforte, J. C. The influence of reflective practice on the accuracy of medical diagnoses. Med. Educ. 42, 468–475 (2008).
Google Scholar
Norman, G. R. et al. Sources of errors in clinical reasoning: cognitive biases, knowledge gaps, and dual-process thinking. Acad. Med. 92, 23–30 (2017).
Google Scholar
Shao, Z. et al. DeepSeekMath: advancing the boundaries of mathematical reasoning in open language models. Preprint available online (2024).
Singhal, K. et al. Large language models capture clinical knowledge. Nature 620, 172–180 (2023).
Google Scholar
Yao, S. et al. ReAct: combining reasoning and action in language models. In International Conference on Learning Representations (ICLR, 2023).
Bakken, S. AI in healthcare: maintaining human oversight. J. Am. Med. Inform. Assoc. 30, 1225–1226 (2023).
Google Scholar
To ensure trustworthy AI, keep humans involved. Nat. Med. 31, 3207 (2025).
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K. & Yao, S. Reflexion: language agents that learn through verbal feedback. Adv. Neural Inf. Process. Syst. 36, 8634–8652 (2023).
Rodman, A. & Topol, E. J. Can generative artificial intelligence perform clinical reasoning? Lancet 405, 689 (2025).
Google Scholar
Zou, J. & Topol, E. J. The emergence of agentic AI collaborators in medicine. Lancet 405, 457 (2025).
Google Scholar
Tordjman, M. et al. Benchmarking the DeepSeek large language model on medical tasks and clinical reasoning. Nat. Med. 31, 2550–2555 (2025).
Google Scholar
Sandmann, S. et al. Evaluating DeepSeek large language models in clinical decision-making through benchmarking. Nat. Med. 31, 2546–2549 (2025).
Google Scholar
McDuff, D. et al. Advancing accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).
Google Scholar
Tu, T. et al. Advancing artificial intelligence for conversational medical diagnosis. Nature 642, 442–450 (2025).
Google Scholar
Johri, S. et al. A framework for evaluating large language models in clinical patient interactions. Nat. Med. 31, 77–86 (2025).
Google Scholar
Johnson, A. E. W. et al. MIMIC-IV: an openly available electronic health record dataset. Sci. Data 10, 1 (2023).
Google Scholar
Tomczak, K., Czerwińska, P. & Wiznerowicz, M. The Cancer Genome Atlas (TCGA): a vast resource of biological knowledge. Contemp. Oncol. 2015, 68–77 (2015).
Roberts, R. J. PubMed Central: serving as the GenBank equivalent for published literature. Proc. Natl Acad. Sci. USA 98, 381–382 (2001).
Google Scholar
Chakradhar, S. Reliable outcomes: using artificial intelligence to identify optimal drugs and dosages. Nat. Med. 23, 1244–1247 (2017).
Google Scholar
Lek, M. et al. Examining protein-coding genetic variants across 60,706 individuals. Nature 536, 285–291 (2016).
Google Scholar
Jiang, L. Y. et al. Large-scale health system language models function as versatile prediction tools. Nature 619, 357–362 (2023).
Google Scholar
Asai, A., Wu, Z., Wang, Y., Sil, A. & Hajishirzi, H. Self-RAG: training models to retrieve, generate, and critique via self-reflection. In The Twelfth International Conference on Learning Representations (ICLR, 2023).
Kim, T. et al. MindfulDiary: leveraging large language models to assist psychiatric patients with journaling. In Proc. 2024 CHI Conference on Human Factors in Computing Systems 1–20 (ACM, 2024).
Ni, Y., Chen, Y., Ding, R. & Ni, S. Beatrice: a chatbot designed to gather psychoecological data and answer questions. In Proc. 16th International Conference on Pervasive Technologies Related to Assistive Environments 429–435 (ACM, 2023).
Holderried, F. et al. A GPT-based chatbot acting as a simulated patient for practicing medical history taking: a prospective, mixed methods study. JMIR Med. Educ. 10, e53961 (2024).
Google Scholar
Rodin, G. et al. Communication between clinicians and patients: a comprehensive systematic review. Support. Care Cancer 17, 627–644 (2009).
Google Scholar
Liu, S., McCoy, A. B. & Wright, A. Enhancing large language model use in biomedicine through retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines. J. Am. Med. Inform. Assoc. 32, 605–615 (2025).
Google Scholar
Qiu, J. et al. Agentic systems powered by large language models in medicine and healthcare. Nat. Mach. Intell. 6, 1418–1420 (2024).
Google Scholar
Ting, D. S. W. et al. Creating and validating a deep learning system to detect diabetic retinopathy and related eye conditions using retinal images from multiethnic diabetic populations. JAMA 318, 2211–2223 (2017).
Google Scholar
Liu, Y. et al. A deep learning platform for differential diagnosis of skin conditions. Nat. Med. 26, 900–908 (2020).
Google Scholar
Groh, M. et al. Deep learning–based decision support for diagnosing skin diseases across diverse skin tones. Nat. Med. 30, 573–583 (2024).
Google Scholar
Lu, M. Y. and colleagues. A multimodal generative AI assistant designed for human pathology. Nature 634, 466–473 (2024).
Google Scholar
Tiu, E. and colleagues. Expert-level identification of pathologies from unlabeled chest X-ray images through self-supervised learning. Nat. Biomed. Eng. 6, 1399–1406 (2022).
Google Scholar
Zhou, H.-Y. and colleagues. A transformer-based representation-learning approach that unifies multimodal inputs for clinical diagnostics. Nat. Biomed. Eng. 7, 743–755 (2023).
Google Scholar
DeGrave, A. J., Cai, Z. R., Janizek, J. D., Daneshjou, R. & Lee, S.-I. Auditing the decision-making processes of medical-image classifiers using generative AI combined with physician expertise. Nat. Biomed. Eng. 9, 294–306 (2025).
Google Scholar
Brodeur, P. G. and colleagues. A large language model achieves superhuman performance on physician-level reasoning tasks. Preprint at (2024).
Loh, H. W. and colleagues. A systematic review of explainable artificial intelligence applications in healthcare over the past decade (2011–2022). Comput. Methods Programs Biomed. 226, 107161 (2022).
Google Scholar
Saraswat, D. and colleagues. Explainable AI for healthcare 5.0: opportunities and challenges. IEEE Access 10, 84486–84517 (2022).
Google Scholar
Schork, N. J. Artificial intelligence and the advancement of personalized medicine. Precis. Med. Cancer Therapy 178, 265–283 (2019).
Google Scholar
Parekh, A.-D. E., Shaikh, O. A., Simran, S., Manan, S. & Hasibuzzaman, M. A. Artificial intelligence in personalized medicine: AI-generated treatment plans tailored to genetic profiles and medical histories. Ann. Med. Surg. 85, 5831–5833 (2023).
Google Scholar
Guk, K. and colleagues. The evolution of wearable devices enabling real-time disease monitoring for personalized healthcare. Nanomaterials 9, 813 (2019).
Google Scholar
Gao, S. and colleagues. TxAgent: an AI agent for therapeutic reasoning across a vast array of tools. Preprint at (2025).
Ji, C., Jiang, T., Liu, L., Zhang, J. & You, L. Continuous glucose monitoring integrated with artificial intelligence: reshaping the approach to prediabetes management. Front. Endocrinol. 16, 1571362 (2025).
Google Scholar
Subbiah, V. The next era of evidence-based medicine. Nat. Med. 29, 49–58 (2023).
Google Scholar
Wang, H. and colleagues. Scientific discovery in the era of artificial intelligence. Nature 620, 47–60 (2023).
Google Scholar
Gao, S. and colleagues. Empowering biomedical discovery through AI agents. Cell 187, 6125–6151 (2024).
Google Scholar
Jumper, J. and colleagues. Highly accurate protein structure prediction achieved with AlphaFold. Nature 596, 583–589 (2021).
Google Scholar
Baek, M. and colleagues. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Google Scholar
Watson, J. L. and colleagues. De novo design of protein structure and function using RFdiffusion. Nature 620, 1089–1100 (2023).
Google Scholar
Kortemme, T. Designing proteins from scratch—moving from novel structures to engineered functions. Cell 187, 526–544 (2024).
Google Scholar
Swanson, K., Wu, W., Bulaong, N. L., Pak, J. E. & Zou, J. An AI-powered virtual lab creates new nanobodies targeting SARS-CoV-2. Nature 646, 716–723 (2025).
Google Scholar
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 6000–6010 (2017).
Yu, Q. et al. DAPO: a large-scale open-source system for LLM reinforcement learning. Adv. Neural Inf. Process. Syst. 38, 113222–113244 (2026).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. Preprint at (2017).
Muennighoff, N. et al. s1: a straightforward approach to test-time scaling. In Proc. 2025 Conference on Empirical Methods in Natural Language Processing 20286–20332 (ACL, 2025).
Huang, X., Wu, J., Liu, H., Tang, X. & Zhou, Y. m1: unlocking the power of test-time scaling for medical reasoning in large language models. In Proc. Machine Learning for Health Vol. 297, 369–383 (PMLR, 2025).
Liévin, V., Hother, C. E., Motzfeldt, A. G. & Winther, O. Are large language models capable of reasoning about medical questions? Patterns 5, 100943 (2024).
Google Scholar
Nori, H. et al. Can general-purpose foundation models outperform task-specific fine-tuning? A medical case study. Preprint at (2023).
Sonoda, Y. et al. A structured clinical reasoning prompt boosts LLM diagnostic performance on diagnosis please quiz cases. Jpn J. Radiol. 43, 586–592 (2025).
Google Scholar
Savage, T., Nayak, A., Gallo, R., Rangan, E. & Chen, J. H. Diagnostic reasoning prompts unlock interpretability potential in medical large language models. npj Digit. Med. 7, 20 (2024).
Google Scholar
Yuksekgonul, M. et al. Improving generative AI by backpropagating feedback from language models. Nature 639, 609–616 (2025).
Google Scholar
Aali, A. et al. Enhancing the robustness of language model benchmarks for medical tasks through prompt optimization. In Machine Learning for Health (2025).
Bogireddy, S. P. T. R. et al. Neural at ArchEHR-QA 2025: using agentic prompt optimization for evidence-based clinical question answering. In Proc. 24th Workshop on Biomedical Language Processing (Shared Tasks) 104–109 (ACL, 2025).
Khattab, O. et al. DSPy: turning declarative language model calls into top-performing pipelines. In The Twelfth International Conference on Learning Representations (ICLR, 2024).
Kim, Y. et al. MDAgents: adaptive collaboration among LLMs for medical decision-making. Adv. Neural Inf. Process. Syst. 37, 79410–79452 (2024).
Li, X., Zou, H. & Liu, P. ToRL: advancing tool-integrated reinforcement learning. Preprint at (2025).
Jin, B. et al. Search-R1: teaching LLMs to reason and use search engines through reinforcement learning. In Second Conference on Language Modeling (2025).
Chen, M. et al. Teaching LLMs to reason with search via reinforcement learning. Adv. Neural Inf. Process. Syst. 38, 85287–85307 (2026).
Wang, H. et al. OTC: optimizing tool calls through reinforcement learning. Preprint at (2025).
Zheng, Q. et al. Training an end-to-end agentic RAG system for traceable diagnostic reasoning. Preprint at (2025).
Gulshan, V. et al.
Gilson, A. ChatGPT’s Performance on USMLE: What It Means for Medical Education and Assessment. 9, e45312 (2023).
Liu, N., Zhang, Z., Ho, A. F. W. & Ong, M. E. H. The Role of AI in Emergency Medicine. J. Emerg. Crit. Care Med. 2, 82 (2018).
De Novo Classification Request. FDA (2020).
McNamara, S. L., Yi, P. H. & Lotter, W. Intended Use and Explainability in FDA-Cleared AI Medical Imaging Devices. npj Digit. Med. 7, 80 (2024).
Feng, J. et al. Continuous Monitoring and Updating of Clinical AI Algorithms. npj Digit. Med. 5, 66 (2022).
Tai-Seale, M. et al. AI-Drafted Replies in Electronic Health Records: A Study. JAMA Netw. Open 7, e246565–e246565 (2024).
Yin, J., Ngiam, K. Y., Tan, S. S.-L. & Teo, H. H. Timing AI Advice in Healthcare Workflows: Impact on Diagnostic Decisions. Manag. Sci. 71, 8995–9868 (2025).
Vaccaro, M., Almaatouq, A. & Malone, T. Human-AI Collaboration: When Is It Truly Useful? A Systematic Review and Meta-Analysis. Nat. Hum. Behav. 8, 2293–2303 (2024).
Turpin, M., Michael, J., Perez, E. & Bowman, S. Chain-of-Thought Prompts and Unfaithful Explanations from Language Models. Adv. Neural Inf. Process. Syst. 36, 74952–74965 (2023).
Perrier, E. Typed Chain-of-Thought: A Curry–Howard Framework for Verifying LLM Reasoning. Preprint at (2025).
Lee, J. & Hockenmaier, J. Evaluating Step-by-Step Reasoning Traces: A Survey. In Findings of the Association for Computational Linguistics: EMNLP 1789–1814 (ACL, 2025).
Ling, Z. et al. Formal Verification of Chain-of-Thought Reasoning. Adv. Neural Inf. Process. Syst. 36, 36407–36433 (2023).
Sag, M. Copyright Considerations for Generative AI. Houst. Law Rev. 61, 295 (2023).
Giuffrè, M. & Shung, D. L. Synthetic Data in Healthcare: Uses, Benefits, and Privacy Concerns. npj Digit. Med. 6, 186 (2023).
Seyyed-Kalantari, L., Zhang, H., McDermott, M. B., Chen, I. Y. & Ghassemi, M. AI Underdiagnosis Bias in Chest X-Rays for Underserved Populations. Nat. Med. 27, 2176–2182 (2021).
Wongvibulsin, S. et al. Current Trends in Dermatology Mobile Apps with AI Features. JAMA Dermatol. 160, 646–650 (2024).
Tanno, R. et al. Clinician–Vision-Language Model Partnerships in Radiology Reporting. Nat. Med. 31, 599–608 (2025).
Bharadwaj, P. et al. Measuring the ROI of Hospital AI Initiatives. J. Am. Coll. Radiol. 21, 1677–1685 (2024).
Reardon, S. The Emergence of Robot Radiologists. Nature 576, S54–S58 (2019).
Robert, D. et al. AI as a Second Reader on Chest X-Rays: A Multicenter Study of Lung Nodule Detection Accuracy. Acad. Radiol. 32, 1706–1717 (2025).



