Shan Chen

Hi, I am a Ph.D. student at Harvard-MGB AIM, jointly with Maastricht University, under the guidance of Hugo Aerts, Ph.D. and Danielle S. Bitterman, M.D. My works are supported by the 2024 Google PhD Fellowship in Natural Language Processing. I am also affiliated with the Boston Children's Hospital Computational Health Informatics Program (CHIP), where we have the privilege of collaborating closely with Guergana Savova, Ph.D. and Tim Miller, Ph.D.

On one hand, I am deeply interested in the knowledge and features representation of large language models, aiming to develop more interpretable AI systems for critical domains such as healthcare. On the other hand, I am passionate about enhancing patient communication and establishing robust safety evaluation methods for high-stakes tasks. It is crucial to assess the impact of AI on all healthcare stakeholders—including patients, providers, and others.

My research has been featured in major media outlets such as Bloomberg, The New York Times, NBC, and New Scientist, among others. It has also been highlighted by government agencies including the FDA, NCI, and NIH, and has been cited in U.S. congressional hearings.

During COVID-19, I completed with M.S. in Computational Linguistics from Brandeis University, where I was fortunate to be advised by Professor Nianwen Xue Ph.D. where I fully explored my interests and met many wonderful people and friends. Before Brandeis, I spent 4 years as an undergraduate in Math, Japanese and Linguistics at St. Olaf College enjoying the snow!

During my free time, I enjoy basketball, dragonboat and kyudo 🏹.

If you want to work with me or my group, please email bittermanlab@gmail.com instead!

News

[01/04/2025] Our paper on using LLMs to identify social determinants of health in electronic health records was the most cited journal-wide (Nature/NPJ Digital Medicine) in 2024! This paper was also selected for the AI and Data Science Year in Review 2024 at AMIA!
[11/11/2024] Heading to EMNLP and wrote a blog post on what we learnt this year on various things in AI4healthcare.
[10/10/2024] Our paper "WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation" is now available on arXiv.
[09/15/2024] Honored to receive the 2024 Google PhD Fellowship in Natural Language Processing!
[06/19/2024] RABBITS is out! We examined current biomedical benchmarks and found the language models are more familar with generic terms!
[05/09/2024] Cross-Care is out! The first grounded bias benchmark that analyzes how pre-training data impacts model misalignment with real-world medical concepts.
[04/02/2024] LCD Benchmark is out! Try this long clinical documents benchmark that LLMs are bad at!
[11/07/2023] Our SDoH paper got accepted at Nature Digital Medicine, front page featured article from Jan-May 2024. Highlight research at NCI!
[08/24/2023] Check out our work and editorial highlights @ JAMA Onc. Also used during US congress hearing!

Selected Publications

(* indicates equal contribution)

When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy

*Jirui Qi, *Shan Chen, Zidi Xiong, Raquel Fernández, Danielle S. Bitterman, Arianna Bisazza

arXiv 2025

arXiv PDF 🤗 Dataset

Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models

Zidi Xiong, Shan Chen, Zhenting Qi, Himabindu Lakkaraju

arXiv 2025

arXiv PDF

MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

Shan Chen, Pedro Moreira, Yuxin Xiao, Sam Schmidgall, Jeremy Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, Danielle S. Bitterman

arXiv 2025

arXiv PDF Project Page

🌏 WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation

*Joao Matos, *Shan Chen, ... A. Ian Wong, Danielle S. Bitterman, and Jack Gallifant

NAACL 2025

🤗 Dataset Tweet arXiv Code

🐰 RABBITS: Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

*Jack Gallifant, *Shan Chen, Pedro Moreira, ... Thomas Hartvigsen, and Danielle S. Bitterman

EMNLP 2024

🤗 Tweet arXiv Industrial adaptation Code

Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

*Shan Chen, *Jack Gallifant, Mingye Gao, Pedro Moreira, ... William G. La Cava, and Danielle S. Bitterman

Neurips 2024

Website Code Data

LCD Benchmark: Long Clinical Document Benchmark on Mortality Prediction for Language Models

Wonjin Yoon, Shan Chen, ... Danielle S. Bitterman, Majid Afshar, and Timothy Miller

JAMIA

medrXiv Code Coda Bench

OncQA: The impact of using an AI chatbot to respond to patient questions

Shan Chen, Marco Guevara ... Hugo Aerts, Timothy Miller, Guergana Savova, Raymond Mak, Majid Afshar, and Danielle S. Bitterman

Lancet Digital Health

arXiv Code 🤗 Article NYTimes

Measuring Pointwise V-Usable Information In-Context-ly

Sheng Lu, Shan Chen, Yingya Li, Danielle S. Bitterman, Guergana Savova, and Iryna Gurevych

EMNLP 2023

EMNLP Code Tweet Tutorial

Large Language Models to Identify Social Determinants of Health in Electronic Health Records

*Marco Guevara, *Shan Chen, Spencer Thomas ... Hugo Aerts, Guergana Savova, Raymond Mak, and Danielle S. Bitterman

Nature Digital Medicine

Featured & most cited journal wide in 2024

Paper Dataset arXiv Code 🤗

Use of Artificial Intelligence Chatbots for Cancer Treatment Information

Shan Chen, Benjamin Kann, Michael Foote, Hugo Aerts, Guergana Savova, Raymond Mak and Danielle S. Bitterman

JAMA ONC

arXiv Code Data Article News

Evaluation of ChatGPT Family of Models for Biomedical Reasoning and Classification

*Shan Chen, *Yingya Li, Sheng Lu, ... Hugo Aerts, Guergana Savova, and Danielle S. Bitterman

JAMIA

JAMIA arXiv Code

Natural language processing to automatically extract the presence and severity of esophagitis in notes of patients undergoing radiotherapy

Shan Chen, Marco Guevara, Nicolas Ramirez ... Hugo Aerts, Tim Miller, Guergana Savova, Raymond Mak, and Danielle S. Bitterman

JCO CCI

Oral Presentation @ ASTRO 2023 < 9%

ASTRO Paper arXiv Code

Medications detection in tweets using transformer networks and multi-task learning

Dongfang Xu, Shan Chen, and Tim Miller

Proceedings of the BioCreative VII Challenge 2021

🏆 First Place

Paper arXiv Code

Mentoring

Students and projects

SYNPO: Synthetic Data Augmentation Through Iterative Preference Optimization on LLMs for Clinical Problem Summarization, Shayan Chowdhury | 2024 Summer
Mapping Data Bias in MLLMs: Signposts, Pitfalls, and the Road Ahead, Kuleen Sasse and Jackson Pond | 2024 Summer
Nikolaj Munch - Thesis student, Now at Office of Denmark's Tech Ambassador
Javier Mora, MD - Residency research year 2025
Yanan (Lance) Lu (Harvard DBMI, 2022-2023) - Thesis student, Now MLE at Apple
Vikram Goddla (High School, 2022-2023) - Now at Harvard College
Nick Ramirez - Thesis student, Now DS at Genentech

Honors and Service

Honors

Google PhD Fellowship in Natural Language Processing, 2024
CHIL Doctoral Consortium, 2024, 25 (Oral)
Brandeis Merit Scholarship, 2020
JASSO Scholarship, 日本文部科学省, 2019
National Japanese Exam Silver Prize, AATJ 全米日本語教育学会, 2019
Henry Luce Research Grant, Henry Luce Foundation, 2018
Pi Mu Epsilon Society, National Math honor society, 2018

Service

Peer Reviewer: ACL, EMNLP, NAACL, EACL, ICLR, Neurips
Program Committee: Clinical NLP Workshop 2023, 24
Journal Reviewer: JAMIA, JBI, JMIR, JNCI, Nature communication, npj Digital Medicine, Nature Medicine

Invited Talks

Shan Chen; UAB Annual Methods Symposium - Tutorial - LLMs Applications in clinical settings; 2025
Shan Chen; MIT HST 953 - Towards More Robust Large Language Models Applications in Clinical Settings; 2024
Shan Chen; City of Hope - The Role and Risks of Large Language Models in Clinical Settings; 2024
Shan Chen; Harvard - Beacon hill lecture seminars: current progress of AI4Healthcare; 2024

Contact

News

Selected Publications

Mentoring

Students and projects

Honors and Service

Honors

Service

Invited Talks