About Publications Mentoring Honors Thesis

Clinical NLP · AI Evaluation · Healthcare AI · Post-training

I am Shan Chen, an NLP researcher at R37 Lab working to make language models safer and more useful for care sites to actually adapt.

Clinical NLP AI Evaluation Healthcare AI Post-training

My research develops evaluations, post-training methods, and agentic systems for high-stakes healthcare workflows, with a focus on reliability, bias, hallucinations, and clinical safety.

I earned my Ph.D. cum laude through Maastricht University and Harvard-MGB AIM, advised by Hugo Aerts and Danielle S. Bitterman. My work was supported by the 2024 Google PhD Fellowship in Natural Language Processing. I collaborate extensively with Guergana Savova and Tim Miller at the Boston Children's Hospital Computational Health Informatics Program.

My research has been featured by Bloomberg, The New York Times, NBC, and New Scientist; highlighted by the FDA, NCI, and NIH; and cited in U.S. congressional hearings.

Previously, I studied computational linguistics at Brandeis University with Nianwen Xue, and math, Japanese, and linguistics at St. Olaf College. Outside research, I enjoy basketball, dragon boat, and kyudo.

Selected Publications

(* indicates equal contribution)

An agentic AI system enhances clinical detection of immunotherapy toxicities: a multi-phase validation study
Jack Gallifant, Shan Chen, Kee-Young Shin, Katherine C. Kellogg, Patrick F. Doyle, Joyce Guo, ... Danielle S. Bitterman
medRxiv 2026
Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning
Zidi Xiong, Shan Chen, Himabindu Lakkaraju
ICML 2026
Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments
*Bingyang Ye, *Shan Chen, Jingxuan Tu, Chen Liu, Zidi Xiong, Samuel Schmidgall, Danielle S. Bitterman
arXiv 2026
When helpfulness backfires: LLMs and the risk of false medical information due to sycophantic behavior
Shan Chen, Mingye Gao, Kuleen Sasse, Thomas Hartvigsen, Brian Anthony, ... Hugo Aerts, Jack Gallifant, Danielle S Bitterman
npj Digital Medicine | AMIA 2025 Oral
When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy
*Jirui Qi, *Shan Chen, Zidi Xiong, Raquel Fernández, Danielle S. Bitterman, Arianna Bisazza
Findings of EMNLP 2025 | ICLR Blog 2026
MedBrowseComp: Benchmarking Medical Deep Research and Computer Use
Shan Chen, Pedro Moreira, Yuxin Xiao, Sam Schmidgall, Jeremy Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, Danielle S. Bitterman
arXiv 2025
Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models
Zidi Xiong, Shan Chen, Zhenting Qi, Himabindu Lakkaraju
Neurips 2025
KScope: A Framework for Characterizing the Knowledge Status of Language Models
Yuxin Xiao, Shan Chen, Jack Gallifant, Danielle Bitterman, Thomas Hartvigsen, Marzyeh Ghassemi
Neurips 2025
Sparse autoencoder features for classifications and transferability
*Jack Gallifant, *Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S Bitterman
EMNLP 2025
🌏 WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation
*Joao Matos, *Shan Chen, ... A. Ian Wong, Danielle S. Bitterman, and Jack Gallifant
NAACL 2025
🐰 RABBITS: Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks
*Jack Gallifant, *Shan Chen, Pedro Moreira, ... Thomas Hartvigsen, and Danielle S. Bitterman
EMNLP 2024
Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias
*Shan Chen, *Jack Gallifant, Mingye Gao, Pedro Moreira, ... William G. La Cava, and Danielle S. Bitterman
Neurips 2024
LCD Benchmark: Long Clinical Document Benchmark on Mortality Prediction for Language Models
Wonjin Yoon, Shan Chen, ... Danielle S. Bitterman, Majid Afshar, and Timothy Miller
JAMIA
OncQA: The impact of using an AI chatbot to respond to patient questions
Shan Chen, Marco Guevara ... Hugo Aerts, Timothy Miller, Guergana Savova, Raymond Mak, Majid Afshar, and Danielle S. Bitterman
Lancet Digital Health
Measuring Pointwise V-Usable Information In-Context-ly
Sheng Lu, Shan Chen, Yingya Li, Danielle S. Bitterman, Guergana Savova, and Iryna Gurevych
EMNLP 2023
Large Language Models to Identify Social Determinants of Health in Electronic Health Records
*Marco Guevara, *Shan Chen, Spencer Thomas ... Hugo Aerts, Guergana Savova, Raymond Mak, and Danielle S. Bitterman
Nature Digital Medicine
Featured & most cited journal wide in 2024
Use of Artificial Intelligence Chatbots for Cancer Treatment Information
Shan Chen, Benjamin Kann, Michael Foote, Hugo Aerts, Guergana Savova, Raymond Mak and Danielle S. Bitterman
JAMA ONC
Evaluation of ChatGPT Family of Models for Biomedical Reasoning and Classification
*Shan Chen, *Yingya Li, Sheng Lu, ... Hugo Aerts, Guergana Savova, and Danielle S. Bitterman
JAMIA
Natural language processing to automatically extract the presence and severity of esophagitis in notes of patients undergoing radiotherapy
Shan Chen, Marco Guevara, Nicolas Ramirez ... Hugo Aerts, Tim Miller, Guergana Savova, Raymond Mak, and Danielle S. Bitterman
JCO CCI
Oral Presentation @ ASTRO 2023 < 9%
Medications detection in tweets using transformer networks and multi-task learning
Dongfang Xu, Shan Chen, and Tim Miller
Proceedings of the BioCreative VII Challenge 2021
🏆 First Place

Mentoring

Students and projects

  • Pedro Moreira - MIT Master Thesis student, Now MLE at Google
  • Kuleen Sasse - Undergraduate student, now Ph.D. student at JHU
  • Shayan Chowdhury - Undergrad student at Columbia
  • Javier Mora, MD - Harvard Medical School Residency research year 2025
  • Kraig Tou - Master student, Now MLE at AWS Annapurna Labs
  • Yanan (Lance) Lu - Harvard DBMI Master Thesis student, Now MLE at TikTok
  • Nikolaj Munch - Master Thesis student, Now Chief AI Advisor at The Ministry of Foreign Affairs of Denmark
  • Vikram Goddla - High school student, Now at Harvard College

Honors and Service

Honors

  • Google PhD Fellowship in Natural Language Processing, 2024
  • CHIL Doctoral Consortium, 2024, 25 (Oral)
  • Brandeis Merit Scholarship, 2020
  • JASSO Scholarship, 日本文部科学省, 2019
  • National Japanese Exam Silver Prize, AATJ 全米日本語教育学会, 2019
  • Henry Luce Research Grant, Henry Luce Foundation, 2018
  • Pi Mu Epsilon Society, National Math honor society, 2018

Service

  • Scientific Advisory Board: Mass General Brigham HPC Scientific Advisory, 2025-present
  • Peer Reviewer: ACL, EMNLP, NAACL, EACL, ICLR, NeurIPS, COLM
  • Workshop Organizer: DAIH @ COLM 2026
  • Program Committee: Clinical NLP Workshop 2023, 24
  • Journal Reviewer: JAMIA, JBI, JMIR, JNCI, Nature communication, npj Digital Medicine, Nature Medicine

Invited Talks

  • Shan Chen; Joint Statistical Meetings (JSM) - Panel Speaker on AI in Healthcare; 2026
  • Shan Chen; UMich NLP Seminar - How Far are we from reliable LLMs Applications in clinical settings; 2025
  • Shan Chen; CHIP Journal Club - LLMs Applications in clinical settings; 2025
  • Shan Chen; UAB Annual Methods Symposium - Tutorial - LLMs Applications in clinical settings; 2025
  • Shan Chen; MIT HST 953 - Towards More Robust Large Language Models Applications in Clinical Settings; 2024
  • Shan Chen; City of Hope - The Role and Risks of Large Language Models in Clinical Settings; 2024
  • Shan Chen; Harvard - Beacon hill lecture seminars: current progress of AI4Healthcare; 2024