A visual-language foundation model for computational pathology

Lu, Ming Y.; Chen, Bowen; Williamson, Drew F. K.; Chen, Richard J.; Liang, Ivy; Ding, Tong; Jaume, Guillaume; Odintsov, Igor; Le, Long Phi; Gerber, Georg; Parwani, Anil V.; Zhang, Andrew; Mahmood, Faisal

doi:10.1038/s41591-024-02856-4

Article
Published: 19 March 2024

A visual-language foundation model for computational pathology

Ming Y. Lu ORCID: orcid.org/0000-0003-0009-9699^1,2,3,4,5^na1,
Bowen Chen^1,2^na1,
Drew F. K. Williamson ORCID: orcid.org/0000-0003-1745-8846^1,2,3^na1,
Richard J. Chen ORCID: orcid.org/0000-0003-0389-1331^1,2,3,4,6,
Ivy Liang^1,7,
Tong Ding^1,7,
Guillaume Jaume^1,2,3,4,
Igor Odintsov¹,
Long Phi Le²,
Georg Gerber ORCID: orcid.org/0000-0002-9149-5509¹,
Anil V. Parwani⁸,
Andrew Zhang ORCID: orcid.org/0000-0002-9432-2793^1,2,3,4,9 &
…
Faisal Mahmood ORCID: orcid.org/0000-0001-7587-1562^1,2,3,4,10

Nature Medicine volume 30, pages 863–874 (2024)Cite this article

12k Accesses
1 Citations
145 Altmetric
Metrics details

Subjects

Abstract

The accelerated adoption of digital pathology and advances in deep learning have enabled the development of robust models for various pathology tasks across a diverse array of diseases and patient cohorts. However, model training is often difficult due to label scarcity in the medical domain, and a model’s usage is limited by the specific task and disease for which it is trained. Additionally, most models in histopathology leverage only image data, a stark contrast to how humans teach each other and reason about histopathologic entities. We introduce CONtrastive learning from Captions for Histopathology (CONCH), a visual-language foundation model developed using diverse sources of histopathology images, biomedical text and, notably, over 1.17 million image–caption pairs through task-agnostic pretraining. Evaluated on a suite of 14 diverse benchmarks, CONCH can be transferred to a wide range of downstream tasks involving histopathology images and/or text, achieving state-of-the-art performance on histology image classification, segmentation, captioning, and text-to-image and image-to-text retrieval. CONCH represents a substantial leap over concurrent visual-language pretrained systems for histopathology, with the potential to directly facilitate a wide array of machine learning-based workflows requiring minimal or no further supervised fine-tuning.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Data curation and model schematic.**

**Fig. 2: Zero-shot and supervised classification.**

**Fig. 3: Slide-level few-shot classification experiments.**

**Fig. 4: Zero-shot cross-modal retrieval.**

A visual–language foundation model for pathology image analysis using medical Twitter

Article 17 August 2023

Towards a general-purpose foundation model for computational pathology

Article 19 March 2024

Cluster-based histopathology phenotype representation learning by self-supervised multi-class-token hierarchical ViT

Article Open access 08 February 2024

Data availability

TCGA whole-slide data and labels are available from the NIH genomic data commons (http://portal.gdc.cancer.gov). DHMC LUAD whole-slide data and labels can be accessed through the Dartmouth Biomedical Informatics Research and Data Science website (http://bmirds.github.io/LungCancer/). SICAP whole-slide and tile data with corresponding labels can be accessed through the data portal at http://data.mendeley.com/datasets/9xxm58dvs3/1. CRC100k tile data and labels can be found at http://zenodo.org/record/1214456. WSSS4LUAD image tiles and labels can be found at http://wsss4luad.grand-challenge.org/. Pretraining data were curated from image–caption pairs in educational resources and PubMed. EBRAINS WSIs can be found at http://search.kg.ebrains.eu/instances/Dataset/8fc108ab-e2b4-406-8999-60269dc1f994. AGGC and PANDA WSIs can be accessed through their respective Grand Challenge portals (http://aggc22.grand-challenge.org/data/ and http://panda.grand-challenge.org/data/). The unprocessed PubMed Central Open Access dataset is available from the NIH PubMed Central website (http://ncbi.nlm.nih.gov/pmc/tools/openftlist/). Restrictions apply to the availability of anonymized patient data that were used retrospectively for this project with institutional permission and are, thus, not publicly available. All requests for processed or raw data collected or curated in house should be made to the corresponding author and will be evaluated according to institutional and departmental policies to determine whether the data requested are subject to intellectual property or patient privacy obligations.

Code availability

Model weights for CONCH can be assessed for academic research purposes at http://huggingface.co/MahmoodLab/conch. Code for using the pretrained model is provided at http://github.com/mahmoodlab/CONCH. We have documented all technical deep learning methods and software libraries used in the study while ensuring the paper is accessible to the broader clinical and scientific audience.

References

Song, A. H. et al. Artificial intelligence for digital and computational pathology. Nat. Rev. Bioeng. 1, 930–949 (2023).
Article Google Scholar
Bera, K., Schalper, K. A., Rimm, D. L., Velcheti, V. & Madabhushi, A. Artificial intelligence in digital pathology—new tools for diagnosis and precision oncology. Nat. Rev. Clin. Oncol. 16, 703–715 (2019).
Article PubMed PubMed Central Google Scholar
Shmatko, A., Ghaffari Laleh, N., Gerstung, M. & Kather, J. N. Artificial intelligence in histopathology: enhancing cancer research and clinical oncology. Nat. Cancer 3, 1026–1038 (2022).
Article PubMed Google Scholar
Lipkova, J. et al. Artificial intelligence for multimodal data integration in oncology. Cancer Cell 40, 1095–1110 (2022).
Article CAS PubMed PubMed Central Google Scholar
Bejnordi, B. E. et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 318, 2199–2210 (2017).
Article Google Scholar
Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Article PubMed PubMed Central Google Scholar
Skrede, O.-J. et al. Deep learning for prediction of colorectal cancer outcome: a discovery and validation study. Lancet 395, 350–360 (2020).
Article CAS PubMed Google Scholar
Chen, R. J. et al. Pan-cancer integrative histology–genomic analysis via multimodal deep learning. Cancer Cell 40, 865–878 (2022).
Article CAS PubMed PubMed Central Google Scholar
Courtiol, P. et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 25, 1519–1525 (2019).
Article CAS PubMed Google Scholar
Lu, M. Y. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021).
Article CAS PubMed Google Scholar
Zhu, L. et al. An accurate prediction of the origin for bone metastatic cancer using deep learning on digital pathological images. EBioMedicine 87, 104426 (2023).
Article PubMed Google Scholar
Kalra, S. et al. Yottixel—an image search engine for large archives of histopathology whole slide images. Med. Image Anal. 65, 101757 (2020).
Article PubMed Google Scholar
Hegde, N. et al. Similar image search for histopathology: SMILY. NPJ Digit. Med. 2, 56 (2019).
Article PubMed PubMed Central Google Scholar
Wang, X. et al. RetCCL: clustering-guided contrastive learning for whole-slide image retrieval. Med. Image Anal. 83, 102645 (2023).
Article PubMed Google Scholar
Chen, C. et al. Fast and scalable search of whole-slide images via self-supervised deep learning. Nat. Biomed. Eng. 6, 1420–1434 (2022).
Article PubMed PubMed Central Google Scholar
Kather, J. N. et al. Pan-cancer image-based detection of clinically actionable genetic alterations. Nat. Cancer 1, 789–799 (2020).
Article CAS PubMed PubMed Central Google Scholar
Saldanha, O. L. et al. Self-supervised attention-based deep learning for pan-cancer mutation prediction from histopathology. NPJ Precis. Oncol. 7, 35 (2023).
Article PubMed PubMed Central Google Scholar
Graham, S. et al. Hover-Net: simultaneous segmentation and classification of nuclei in multi-tissue histology images. Med. Image Anal. 58, 101563 (2019).
Article PubMed Google Scholar
Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25, 1301–1309 (2019).
Article CAS PubMed PubMed Central Google Scholar
Bulten, W. et al. Automated deep-learning system for Gleason grading of prostate cancer using biopsies: a diagnostic study. Lancet Oncol. 21, 233–241 (2020).
Article PubMed Google Scholar
Nagpal, K. et al. Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer. NPJ Digit. Med. 2, 48 (2019).
Article PubMed PubMed Central Google Scholar
Mobadersany, P. et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc. Natl Acad. Sci. USA 115, E2970–E2979 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chen, R. J. et al. Multimodal co-attention transformer for survival prediction in gigapixel whole slide images. In Proc. IEEE/CVF International Conference on Computer Vision 4015–4025 (IEEE, 2021).
Fu, Y. et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat. Cancer 1, 800–810 (2020).
Article CAS PubMed Google Scholar
Sammut, S.-J. et al. Multi-omic machine learning predictor of breast cancer therapy response. Nature 601, 623–629 (2022).
Article CAS PubMed Google Scholar
Huang, Z. et al. Artificial intelligence reveals features associated with breast cancer neoadjuvant chemotherapy responses from multi-stain histopathologic images. NPJ Precis. Oncol. 7, 14 (2023).
Article CAS PubMed PubMed Central Google Scholar
Foersch, S. et al. Multistain deep learning for prediction of prognosis and therapy response in colorectal cancer. Nat. Med. 29, 430–439 (2023).
Article CAS PubMed Google Scholar
Vanguri, R. S. et al. Multimodal integration of radiology, pathology and genomics for prediction of response to PD-(L)1 blockade in patients with non-small cell lung cancer. Nat. Cancer 3, 1151–1164 (2022).
Article CAS PubMed PubMed Central Google Scholar
Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).
Jia, C. et al. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 4904–4916 (PMLR, 2021).
Yu, J. et al. CoCa: contrastive captioners are image–text foundation models. Trans. Mach. Learn. Artif. Intell. https://openreview.net/forum?id=Ee277P3AYC (2022).
Li, J., Li, D., Xiong, C. & Hoi, S. BLIP: bootstrapping language–image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning (eds Chaudhur, K. et al.) 12888–12900 (PMLR, 2022).
Singh, A. et al. FLAVA: a foundational language and vision alignment model. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 15638–15650 (IEEE, 2022).
Li, H. et al. Uni-Perceiver v2: a generalist model for large-scale vision and vision-language tasks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2691–2700 (IEEE, 2023).
Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 35, 23716–23736 (2022).
Google Scholar
Li, Y., Fan, H., Hu, R., Feichtenhofer, C. & He, K. Scaling language–image pre-training via masking. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 23390–23400 (IEEE, 2023).
Wang, W. et al. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 19175–19186 (IEEE, 2023).
Schuhmann, C. et al. LAION-5B: an open large-scale dataset for training next generation image-text models. Adv. Neural Inf. Process. Syst. 35, 25278–25294 (2022).
Google Scholar
Chen, Z., Song, Y., Chang, T.-H. & Wan, X. Generating radiology reports via memory-driven transformer. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B. et al.) 1439–1449 (Association for Computational Linguistics, 2020); https://aclanthology.org/2020.emnlp-main.112
Liu, G. et al. Clinically accurate chest X-ray report generation. In Proc. 4th Machine Learning for Healthcare Conference (eds Doshi-Velez, F. et al.), Vol. 106, 249–269 (PMLR, 2019).
Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. 6, 1399–1406 (2022).
Article PubMed PubMed Central Google Scholar
Huang, S.-C., Shen, L., Lungren, M. P. & Yeung, S. GLoRIA: a multimodal global–local representation learning framework for label-efficient medical image recognition. In Proc. IEEE/CVF International Conference on Computer Vision 3942–3951 (IEEE, 2021).
Zhang, S. et al. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image–text pairs. Preprint at https://doi.org/10.48550/arXiv.2303.00915 (2023).
Wang, Z., Wu, Z., Agarwal, D. & Sun, J. MedCLIP: contrastive learning from unpaired medical images and text. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds Che, W. & Shutova, E.) 3876–3887 (Association for Computational Linguistics, 2022).
Schaumberg, A. J. et al. Interpretable multimodal deep learning for real-time pan-tissue pan-disease pathology search on social media. Mod. Pathol. 33, 2169–2185 (2020).
Article PubMed PubMed Central Google Scholar
Maleki, D. & Tizhoosh, H. R. LILE: look in-depth before looking elsewhere—a dual attention network using transformers for cross-modal information retrieval in histopathology archives. In International Conference on Medical Imaging with Deep Learning (eds Konukoglu, E. et al.) 879–894 (PMLR, 2022).
Zhang, Y., Jiang, H., Miura, Y., Manning, C. D. & Langlotz, C. P. Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare Conference (eds Lipton, Z. et al.) 2–25 (PMLR, 2022).
Zhang, H. et al. PathNarratives: data annotation for pathological human–AI collaborative diagnosis. Front. Med. 9, 1070072 (2023).
Article Google Scholar
Tsuneki, M. & Kanavati, F. Inference of captions from histopathological patches. In International Conference on Medical Imaging with Deep Learning (eds Konukoglu, E. et al.) 1235–1250 (PMLR, 2022).
Zhang, R., Weber, C., Grossman, R. & Khan, A. A. Evaluating and interpreting caption prediction for histopathology images. In Machine Learning for Healthcare Conference (eds Doshi-Velez, F. et al.) 418–435 (PMLR, 2020).
Naseem, U., Khushi, M. & Kim, J. Vision-language transformer for interpretable pathology visual question answering. IEEE J. Biomed. Health Inform. 27, 1681–1690 (2022).
Article Google Scholar
He, X. Towards visual question answering on pathology images. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (eds Zong, C. et al.) 708–718 (Association for Computational Linguistics, 2021).
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual-language foundation model for pathology image analysis using medical Twitter. Nat. Med. 29, 2307–2316 (2023).
Article CAS PubMed Google Scholar
Gamper, J. & Rajpoot, N. Multiple instance captioning: learning representations from histopathology textbooks and articles. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16549–16559 (IEEE, 2021).
Lu, M. Y. et al. Visual language pretrained multiple instance zero-shot transfer for histopathology images. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 19764–19775 (IEEE, 2023).
Lin, W. et al. PMC-CLIP: contrastive language–image pre-training using biomedical documents. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2023 (ed. Greenspan, H. et al.) 525–536 (Springer Nature, 2023).
Ikezogwo, W. O. et al. Quilt-1M: one million image–text pairs for histopathology. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 37995–38017 (Curran Associates, Inc., 2023).
Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. In International Conference on Machine Learning (eds Dy, J. & Krause, A.) 2127–2136 (PMLR, 2018).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Deng, J. et al. ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).
Wang, X. et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Med. Image Anal. 81, 102559 (2022).
Article PubMed Google Scholar
Gatta, G. et al. Burden and centralised treatment in Europe of rare tumours: results of RARECAREnet—a population-based study. Lancet Oncol. 18, 1022–1039 (2017).
Article PubMed Google Scholar
Riasatian, A. et al. Fine-tuning and training of densenet for histopathology image representation using TCGA diagnostic slides. Med. Image Anal. 70, 102032 (2021).
Article PubMed Google Scholar
Kundra, R. et al. OncoTree: a cancer classification system for precision oncology. JCO Clin. Cancer Inform. 5, 221–230 (2021).
Article PubMed Google Scholar
Alfasly, S. et al. When is a foundation model a foundation model. Preprint at https://doi.org/10.48550/arXiv.2309.11510 (2023).
Zhou, K., Yang, J., Loy, C. C. & Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 130, 2337–2348 (2022).
Article Google Scholar
Gao, P. et al. CLIP-Adapter: better vision-language models with feature adapters. Int. J. Comput. Vis. 132, 581–595 (2024).
Article Google Scholar
Perez, E., Kiela, D. & Cho, K. True few-shot learning with language models. Adv. Neural Inf. Process. Syst. 34, 11054–11070 (2021).
Google Scholar
Sanh, V. et al. Multitask prompted training enables zero-shot task generalization. In 10th International Conference on Learning Representations https://openreview.net/forum?id=9Vrb9D0WI4 (OpenReview.net 2021).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You Only Look Once: unified, real-time object detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 779–788 (IEEE, 2016).
Luo, R. et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform. 23, bbac409 (2022).
Article PubMed Google Scholar
Dosovitskiy, A. et al. An image is worth 16×16 words: transformers for image recognition at scale. In 9th International Conference on Learning Representations https://openreview.net/forum?id=YicbFdNTTy (OpenReview.net, 2021).
Zhou, J. et al. Image BERT pre-training with online tokenizer. In 10th International Conference on Learning Representations https://openreview.net/forum?id=ydopy-e6Dg (OpenReview.net, 2022).
Silva-Rodriguez, J., Colomer, A., Dolz, J. & Naranjo, V. Self-learning for weakly supervised Gleason grading of local patterns. IEEE J. Biomed. Health Inform. 25, 3094–3104 (2021).
Article PubMed Google Scholar
Dice, L. R. Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945).
Article Google Scholar
Kolesnikov, A., Zhai, X. & Beyer, L. Revisiting self-supervised visual representation learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 1920–1929 (IEEE, 2019).
Wang, J. et al. GIT: a generative image-to-text transformer for vision and language. Trans. Mach. Learn. Res. https://openreview.net/forum?id=b4tMhpN0JC (2022).
Li, J., Li, D., Savarese, S. & Hoi, S. BLIP-2: bootstrapping language–image pre-training with frozen image encoders and large language models. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 19730–19742 (PMLR, 2023).
Banerjee, S. & Lavie, A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization 65–72 (Association for Computational Linguistics, 2005).
Lin, C.-Y. ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out 74–81 (Association for Computational Linguistics, 2004).
Lewis, M., Dauphin, Y. & Fan, A. Hierarchical neural story generation. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (eds Gurevych, I. & Miyao, Y.) 889–898 (Association for Computational Linguistics, 2018).
Wei, J. W. et al. Pathologist-level classification of histologic patterns on resected lung adenocarcinoma slides with deep neural networks. Sci. Rep. 9, 3358 (2019).
Article PubMed PubMed Central Google Scholar
Kather, J. N. et al. Predicting survival from colorectal cancer histology slides using deep learning: a retrospective multicenter study. PLoS Med. 16, e1002730 (2019).
Article PubMed PubMed Central Google Scholar
Han, C. et al. WSSS4LUAD: Grand Challenge on weakly-supervised tissue semantic segmentation for lung adenocarcinoma. Preprint at https://doi.org/10.48550/arXiv.2204.06455 (2022).
Da, Q. et al. DigestPath: a benchmark dataset with challenge review for the pathological detection and segmentation of digestive-system. Med. Image Anal. 80, 102485 (2022).
Article PubMed Google Scholar
Roetzer-Pejrimovsky, T. et al. The Digital Brain Tumour Atlas, an open histopathology resource. Sci. Data 9, 55 (2022).
Article PubMed PubMed Central Google Scholar
Roetzer-Pejrimovsky, T. et al. The Digital Brain Tumour Atlas, an open histopathology resource [Data set]. EBRAINS https://doi.org/10.25493/WQ48-ZGX (2022).
Huo, X. et al. Comprehensive AI model development for Gleason grading: from scanning, cloud-based annotation to pathologist–AI interaction. Preprint at SSRN https://doi.org/10.2139/ssrn.4172090 (2022).
Bulten, W. et al. Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge. Nat. Med. 28, 154–163 (2022).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank Jinghao Zhou for providing insights into the training dynamics for iBOT. We thank A. Song for his feedback. This work was supported in part by the BWH president’s fund, BWH and MGH Pathology, and NIH NIGMS R35GM138216 (F.M.). M.Y.L. was also supported by the Siebel Scholars program. D.F.K.W. was also funded by the NIH NCI Ruth L. Kirschstein National Service Award, T32CA251062. R.J.C. was also supported by the NSF Graduate Fellowship. T.D. was also supported by the Harvard SEAS Fellowship. G.G. was supported by the BWH president’s scholar award, NIGMS R35GM149270, NIDDK P30DK034854 and the Massachusetts Life Sciences Center. We thank T. Janicki, R. Kenny and the system administration staff at the MGB Enterprise Research Infrastructure and Services (ERIS) research computing core for maintaining the GPU computing resources that were instrumental in this study. We also thank T. Mages and T. Ramsey for their administrative support. The content is solely the responsibility of the authors and does not reflect the official views of the National Institutes of Health or the National Science Foundation.

Author information

These authors contributed equally: Ming Y. Lu, Bowen Chen, Drew F. K. Williamson.

Authors and Affiliations

Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
Ming Y. Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Georg Gerber, Andrew Zhang & Faisal Mahmood
Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
Ming Y. Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Guillaume Jaume, Long Phi Le, Andrew Zhang & Faisal Mahmood
Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
Ming Y. Lu, Drew F. K. Williamson, Richard J. Chen, Guillaume Jaume, Andrew Zhang & Faisal Mahmood
Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
Ming Y. Lu, Richard J. Chen, Guillaume Jaume, Andrew Zhang & Faisal Mahmood
Electrical Engineering and Computer Science, Massachusetts Institute of Technology (MIT), Cambridge, MA, USA
Ming Y. Lu
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Richard J. Chen
Harvard John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA
Ivy Liang & Tong Ding
Department of Pathology, Wexner Medical Center, Ohio State University, Columbus, OH, USA
Anil V. Parwani
Health Sciences and Technology, Harvard-MIT, Cambridge, MA, USA
Andrew Zhang
Harvard Data Science Initiative, Harvard University, Cambridge, MA, USA
Faisal Mahmood

Authors

Ming Y. Lu
View author publications
You can also search for this author in PubMed Google Scholar
Bowen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Drew F. K. Williamson
View author publications
You can also search for this author in PubMed Google Scholar
Richard J. Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ivy Liang
View author publications
You can also search for this author in PubMed Google Scholar
Tong Ding
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Jaume
View author publications
You can also search for this author in PubMed Google Scholar
Igor Odintsov
View author publications
You can also search for this author in PubMed Google Scholar
Long Phi Le
View author publications
You can also search for this author in PubMed Google Scholar
Georg Gerber
View author publications
You can also search for this author in PubMed Google Scholar
Anil V. Parwani
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Faisal Mahmood
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

F.M., M.Y.L., B.C. and D.F.K.W. conceptualized the study and designed the experiments. M.Y.L., B.C., R.J.C., T.D., I.L., D.F.K.W., I.O. and L.P.L. performed collection and cleaning of data used for unimodal and visual-language pretraining. M.Y.L., B.C. and R.J.C. performed model development. M.Y.L., B.C., D.F.K.W. and G.J. performed experimental analysis. M.Y.L., B.C., D.F.K.W., A.Z., R.J.C., I.L., T.D., G.J., F.M., G.G., L.P.L. and A.V.P. interpreted experimental results and provided feedback on the study. M.Y.L., B.C., D.F.K.W. and F.M. prepared the paper with input from all co-authors. F.M. supervised the research.

Corresponding author

Correspondence to Faisal Mahmood.

Ethics declarations

Competing interests

M.Y.L., B.C., R.J.C. and F.M. are inventors on a provisional US patent (application number 63/610,645) filed corresponding to the methodological aspects of this work.

Peer review

Peer review information

Nature Medicine thanks Andrew Beck, Lee Cooper and Geert Litjens for their contribution to the peer review of this work. Primary Handling Editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Caption content of pre-training dataset.

Wordclouds of captions to qualitatively visualize the caption content of each category in the pre-training dataset. Larger words are more represented in the captions. Common articles, nouns, and verbs are ignored.

Extended Data Fig. 2 Zero-shot classification: single prompt vs. ensembling.

a-d, slide-level tasks. e, ROI-level tasks. We compare using a single text prompt per class vs. ensembling over multiple class names and templates. Since zero-shot performance of a visual-language pretrained model can be sensitive to the prompts used⁵² when using a single prompt per class, for each class, we independently randomly sample a prompt from the pool of candidate templates and class names (see Supplementary Data Tables 34-38 for the prompt pools). We randomly sample 50 sets of prompts for each task, and plot the resulting distribution of zero-shot performance for each model using boxplot. Each dot corresponds to a single set of prompts (n = 50 for each box). Boxes indicate quartile values, and whiskers extend to data points within 1.5 × the interquartile range. Triangles indicate the performance of prompt ensembling. For slide-level tasks, we show performance for all Ks used in top-K pooling. We observe prompt ensembling can substantially boost performance (relative to the median performance of randomly sampled single prompts) for most models in most tasks, except when the median performance is near random chance, such as for OpenAICLIP on most tasks and PLIP on TCGA BRCA. The poor median performance in these scenarios indicates that the model fails to perform under the majority of prompts sampled and therefore it is unsurprising that the ensembled prompt performs equally bad or worse. See Supplementary Data Tables 1-14 for more results.

Extended Data Fig. 3 CONCH heatmaps, renal cell carcinomas.

Pathologist-annotated H&E images, corresponding cosine-similarity heatmaps of, from top to bottom, papillary, chromophobe, and clear cell renal cell carcinomas. Tiles of high similarity (red border) and low similarity (black border) with the predicted class label are randomly sampled and displayed next to each heatmap. We find excellent agreement between the annotated image and the regions of the slide with high similarity, with the tiles demonstrating stereotypical morphology of the tumors within the high-similarity regions and stroma or other normal constituents of the kidney in the low similarity regions.

Extended Data Fig. 4 CONCH heatmaps, non-small cell lung carcinomas.

Pathologist-annotated H&E images, corresponding cosine-similarity heatmaps of adenocarcinoma (top) and squamous cell carcinoma (bottom) of the lung. Tiles of high similarity (red border) and low similarity (black border) with the predicted class label are randomly sampled and displayed next to each heatmap. We find excellent agreement between the annotated image and the regions of the slide with high similarity, with the tiles demonstrating stereotypical morphology of the tumors within the high-similarity regions and stroma or other normal constituents of the lung in the low similarity regions.

Extended Data Fig. 5 CONCH heatmap, lobular carcinoma of the breast.

Pathologist-annotated H&E image, corresponding cosine-similarity heatmap of lobular carcinoma of the breast. Tiles of high similarity (red border) and low similarity (black border) with the predicted class label are randomly sampled and displayed next to the heatmap. As with the ductal carcinoma heatmap in Fig. 2e, we find excellent agreement between the annotated image and the regions of the slide with high similarity, with the tiles demonstrating stereotypical morphology of lobular caricnoma within the high-similarity regions and stroma or other normal constituents of the breast in the low similarity regions.

Extended Data Fig. 6 ROI-level few-shot classification experiments.

a, b. We investigate the label efficiency of different visual-language pretrained encoders in the few-shot setting where we vary the number of training labels per class (n_c), for n_c= 1,2,4,8,16,… up to 512. For each n_c, we sample 5 different sets of training examples and perform linear probing on each training set using associated image labels (see Supervised classification experiments for details). We show their individual model performance via boxplot (i.e., n = 5 for each box) to study the variance in model performance when performing supervised learning with very few training examples. Boxes indicate quartile values and whiskers extend to data points within 1.5 × the interquartile range. For reference, the zero-shot performance of each model is shown as a dotted line on the same plot. In terms of few-shot supervised learning, CONCH achieves better performance (i.e. in terms of the median accuracy of 5 runs) than other encoders for different sizes of training set and for all tasks. Additionally, in SICAP, we find CONCH zero-shot performance to be competitive with PLIP and BiomedCLIP few-shot up to 64 labels per class.

Extended Data Fig. 7 Rare disease classification results on EBRAINS.

a. Weakly-supervised ABMIL performance for CONCH and other pretrained encoder models on the EBRAINS 30-class brain tumor subtyping task (n = 573). Error bars represent 95% confidence intervals; the center is the computed value of balanced accuracy. b. We investigate the label efficiency of different pretrained encoders in the few-shot setting where we vary the number of training labels per class (n_c), for n_cϵ{1, 2, 4, 8, 16}. For each n_c, we sample 5 different sets of training examples and follow the experimental protocol in a to train an ABMIL model on each training set using associated slide labels (see Supervised classification experiments for details). We show their individual model performance via boxplot (i.e., n = 5 for each box) to study the variance in model performance when performing supervised learning with very few training examples. Boxes indicate quartile values and whiskers extend to data points within 1.5 × the interquartile range. For reference, the zero-shot performance of each model is shown as a dotted line on the same plot. Additional metrics are reported in Supplementary Data Table 20 - 21. We find that CONCH consistently outperform all other visual language pretrained models in zeroshot classification and all pretrained encoders in weakly-supervised learning in terms of both performance and label efficiency.

Extended Data Fig. 8 Additional Retrieval Examples.

Retrieved examples (among top 10) using complex prompts with detailed morphological information. Images are from an in-house dataset of tiles sampled from 1,620 cases held-out during pretraining, spanning 108 OncoTree codes (5 for each code). Similarity scores between each image and prompt are shown in the top-right corner of each image.

Extended Data Fig. 9 Image captioning results.

a. Captioning performance of CONCH and baselines fine-tuned on Source A (train n=558, validation n=77, test n=162). The METEOR and ROUGE metrics are both calculated to evaluate the quality of generated captions. Captions were generated using top-K sampling with K=50 as the decoding strategy. Error bars representing 95% confidence intervals; the center is the computed value of each metric indicated by the x-axis label. CONCH outperforms both GIT baselines with p < 0.01. Although our absolute performance on these metrics is not ideal, image captioning is a considerably more difficult task than classification and retrieval, and we show that our pretraining data and approach can significantly improve performance over general visual-language models. b. Examples of captions generated by CONCH considered by a pathologist to be high quality. The green text boxes show generated captions and gray text boxes show captions corrected by a pathologist. c. Examples of partially correct captions generated by CONCH. Reasonably correct portions of the generated caption are highlighted in blue. In general, we noticed that some of the generated captions are regurgitated verbatim from the training dataset, likely due to the limited scale of fine-tuning (training split: n=558). Given that our current pretraining scale is still relatively small compared to works in the general visual-language domain, we expect the fine-tuned captioning performance to potentially improve substantially with more high-quality training data.

Extended Data Fig. 10 CONCH pretraining ablations.

In a, b, error bars represent 95% confidence intervals and the centres correspond to computed values of each metric as specified by the legend (left) or the y-axis label (middle, right). a. Comparison between CONCH pretrained on human-only data (n = 1,170,647) using CoCa vs. human-only data using CLIP vs. H&E only data (n = 457,372) vs. the full unfiltered dataset (n = 1,786,362). Left. Zero-shot performance on downstream subtyping (TCGA BRCA, n = 150; TCGA RCC, n = 225; TCGA NSCLC, n = 150; DHMC LUAD, n = 143; CRC100k, n = 7, 180; WSSS4LUAD, n = 4, 693) and grading (SICAP, n = 2, 122) tasks. Following pre-established conventions, quadratically weighted Cohen’s κ is reported for SICAP and Cohen’s κ is reported for DHMC LUAD, while balanced accuracy is reported for all other tasks. CONCH performs the best on average. Middle and right. Model performance in cross-modal retrieval on 3 datasets of image-text pairs (Source A, n = 797; Source B, n = 1,755; TCGA LUAD, n = 165). CONCH (CLIP) performs the best on average. b. Comparison between CONCH and no domain-specific unimodal pretraining. CONCH (No vision pretraining) replaces the image encoder pretrained on histopathology image patches with an analogous encoder pretrained on ImageNet. CONCH (No language pretraining) initializes the text encoder randomly instead of pretraining on pathology-related text. Left. Zero-shot performance on subtyping and grading tasks. Middle and right. Cross-modal retrieval performance.

Supplementary information

Supplementary Information

Supplementary Tables 1–44.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lu, M.Y., Chen, B., Williamson, D.F.K. et al. A visual-language foundation model for computational pathology. Nat Med 30, 863–874 (2024). https://doi.org/10.1038/s41591-024-02856-4

Download citation

Received: 02 August 2023
Accepted: 05 February 2024
Published: 19 March 2024
Issue Date: March 2024
DOI: https://doi.org/10.1038/s41591-024-02856-4