Timothy Ossowski, Sheng Zhang, Qianchu Liu, Guanghui Qin, Reuben Tan, Tristan Naumann, Junjie Hu, Hoifung Poon

🤗 HuggingFace Model

📄 Read the Paper

Overview

text_vs_multimodal.svg

High-quality medical data is essential for building reliable medical AI systems. In this work, we explore how careful data curation and supervised fine-tuning can significantly improve multimodal medical reasoning.

We introduce a scalable data recipe that distills structured reasoning traces, resulting in the largest multimodal medical reasoning dataset of over 8 million reasoning traces and 6.8 billion response tokens. We then trained Qwen2.5-VL-7B-Instruct on this data to produce OctoMed, a state-of-the-art open-source model that performs robustly across a wide range of out-of-distribution benchmarks.

Below we share the core ideas behind our data strategy, what we learned from scaling it up, and future extensions towards developing medical vision-language reasoning systems.

Left: Distribution of imaging modalities and anatomical regions represented in our SFT mixture. A small fraction of samples from other less common modalities is omitted for visual clarity.
Right: Breakdown of task types and source datasets used for distillation.

Left: Distribution of imaging modalities and anatomical regions represented in our SFT mixture. A small fraction of samples from other less common modalities is omitted for visual clarity. Right: Breakdown of task types and source datasets used for distillation.

Data Recipe

Question Sourcing

When distilling SFT models, where the training questions come from matters greatly for downstream medical performance. We grouped both our training data and evaluation benchmarks into three knowledge sources:

Text-Only Reasoning: USMLE-style reasoning questions such as MedQA

Multimodal Reasoning: Multiple-Choice VQA questions about medical images such as PMC-VQA

Multimodal Classification: Perceptual medical diagnostic tasks such as diabetic retinopathy grading

We trained the same Qwen base model on different combinations of these sources. This let us test two things: how well models generalize when trained on a single source, and whether mixing sources hurts performance. We found that models perform best on tasks that match the knowledge source they were trained on, and cross-source generalization remains difficult. However, combining multiple knowledge sources does not cause interference; instead, the model effectively leverages each source without any drop in overall performance.

question_sourcing.svg

Takeaway: Text-only questions are the strongest individual question source. Combining sources boosts generalization without affecting in-domain performance.

Question Filtering