Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

1ECE, Boston University, 2 Boston University Chobanian & Avedisian School of Medicine, 3 ISP, University of Pittsburgh
27th INTERNATIONAL CONFERENCE ON MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION (MICCAI) 2024
(Early Accept top 11%)

Revolutionizing Breast Cancer Detection with AI

Breast cancer is a leading cause of death among women worldwide, with early detection being crucial for effective treatment. However, the development of robust computer-aided diagnosis (CAD) systems is often limited by the availability of large, diverse annotated datasets. Mammo-CLIP addresses these challenges by introducing a Vision Language Model specifically trained on paired mammogram images and reports, significantly enhancing the robustness and data efficiency of CAD systems.


Mammo-CLIP employs multi-view supervision, data augmentation, and a novel feature attribution method called Mammo-FActOR, to provide interpretable and highly accurate results. By leveraging high-resolution images and diverse training data, Mammo-CLIP excels in detecting and localizing critical mammographic attributes such as masses and calcifications.


Explore our results and see how Mammo-CLIP sets a new standard in mammography AI.


Features

  • Mammo-CLIP is designed for downstream tasks including zero-shot classification of breast findings and disease, ensuring versatility across diverse datasets.
  • The model supports linear probe classification and localization, allowing for efficient use of varying amounts of labeled data to achieve accurate results.
  • Mammo-CLIP can be fine-tuned on specific datasets to further enhance classification and localization performance, particularly in identifying breast cancer and other related findings.
  • With Mammo-FActOR, Mammo-CLIP vision encoder excels in localization tasks, accurately identifying findings like masses and calcifications using descriptive sentences, without relying on ground truth bounding boxes.
  • Although Mammo-CLIP is primarily aligned with screening mammograms, it demonstrates exceptional performance in cancer classification, showcasing its ability to generalize effectively to unseen domains.

Mammo-CLIP Schematic Overview

The schematic below illustrates the core components of Mammo-CLIP, including image-text augmentation, dataset augmentation, pretraining strategy, and feature attribution using Mammo-FActOR. These components are crucial for enhancing the model's ability to classify and localize mammographic features with high accuracy.


Mammo-CLIP Schematic
Fig. 1. Schematic view of our method. (a) Image-text augmentation for Multi-View Supervision (MVS). (b) Dataset augmentation by synthesizing reports using image-label datasets. (c) Mammo-CLIP pretraining strategy. (d) Feature attribution using Mammo-FActOR.

Mammo-CLIP Optimization

Mammo-CLIP is a Vision Language Model specifically tailored for mammography. Its primary goal is to align visual features from mammogram images with the corresponding textual descriptions found in radiology reports. This alignment is accomplished by using separate encoders for images and text, ensuring that similar images and texts are closely aligned within a shared feature space. This method enhances the model's capability to accurately interpret and classify mammographic findings, resulting in more reliable and interpretable outcomes. Additionally, Mammo-CLIP leverages both image+text datasets and image+label datasets to learn superior representations through a multiview supervision (MVS) loss. In practice, we utilize an in-house image+report dataset from UPMC, alongside the publicly available VinDr dataset as our image+label dataset.

Instance and Dataset Augmentation

To enhance the robustness and generalizability of Mammo-CLIP, both instance-level and dataset-level augmentation techniques are utilized. Instance augmentation involves creating multiple modified versions of each mammogram image and its corresponding report. These modifications include changes such as flipping, rotation, and cropping of images, as well as slight rephrasing of the text. Dataset augmentation, on the other hand, synthesizes entirely new data pairs by combining images and text from different sources within the dataset. These strategies help the model learn more diverse patterns, making it better suited to handle variations in real-world data.

Mammo-FActOR: Interpretable AI for Mammography

Mammo-FActOR is an interpretability module integrated within Mammo-CLIP. This module maps the visual features extracted from mammograms to specific textual attributes found in radiology reports. By doing so, Mammo-FActOR provides a clearer understanding of how the model interprets different regions of the image in relation to the findings described in the text. This enhanced interpretability allows radiologists to better understand the model’s decision-making process, increasing the trustworthiness and transparency of the system.

Results and Impact

Mammo-CLIP demonstrates strong performance across 2 public datasets: RSNA and VinDr, significantly outperforming the baselines in both classification and localization tasks. The model's robustness and data efficiency make it a valuable asset in the early detection of breast cancer, potentially saving lives by enabling faster and more accurate diagnoses.

Detailed Task Descriptions

  • Zero-shot classification of cancer on the RSNA dataset, and classification of mass, calcification, and density on the VinDr dataset without any fine-tuning.
  • Classification of mass, calcification, and density on the VinDr dataset using the Mammo-CLIP vision encoder, trained with linear probes on 10%, 50%, and 100% of the training data.
  • Classification of mass, calcification, and density on the VinDr dataset by fine-tuning the Mammo-CLIP vision encoder with 100% of the training data.
  • Classification of cancer on the RSNA dataset by fine-tuning the Mammo-CLIP vision encoder with 10%, 50%, and 100% of the training data.
  • Classification of cancer on the RSNA dataset using the Mammo-CLIP vision encoder, trained with linear probes on 100% of the training data.
  • Localization of mass and calcification on the VinDr dataset by fine-tuning the Mammo-CLIP vision encoder with 10%, 50%, and 100% of the training data.
  • Localization of mass and calcification on the VinDr dataset using the Mammo-CLIP vision encoder with a frozen encoder, trained on 100% of the training data.

Baselines

Using UPMC (image+text) dataset and CLIP objective, we construct two baselines: 1) an image encoder w/ ResNet (RN)-50 initialized with CLIP weights and fine-tuned with 224×224 images, 2) EfficientNet (EN)-B5 fine-tuned using the same pre-processed images as Mammo-CLIP. Both the baselines are pre-trained using the UPMC dataset, as CLIP only uses an image-text dataset, not an image-label dataset.

Classification Performance on RSNA Dataset

The plot below shows the classification performance of various models on the RSNA dataset to classify malignancy. The models were evaluated using AUC scores across different training settings including zero-shot, fine-tuning with 10%, 50%, and 100% of the data, and linear probe with 100% of the data.


Classification Performance on RSNA Dataset

Classification Performance for Calcification, Mass, and Density on VinDr dataset

This plot compares the AUC performance of various models on calcification, mass, and density classification tasks. The performance is reported across zero-shot, linear probe (10%, 50%, 100% data), and fine-tune (100% data) settings, providing insights into the model's robustness across different conditions.


AUC Performance for Calcification, Mass, and Density

Supervised Localization Performance on VinDr Dataset

The following plot presents the localization performance (mAP) on the VinDr dataset. The evaluation compares different models under various training conditions including freeze encoder, fine-tuning with 10%, 50%, and 100% of the data.


Localization Performance on VinDr Dataset

Mammo-FActOR Interpretability

This figure showcases the interpretability of Mammo-FActOR. The ground-truth regions for mass and calcification are compared against the model’s predictions, visualized through heatmaps that highlight the model’s focus areas.


Mammo-FACtoR Interpretability
Ground-truth and Mammo-FActOR prediction visualizations for mass and calcification.

Weakly-Supervised Localization Results

The bar plot below compares the Intersection over Union (IoU) performance of two Mammo-CLIP variants for mass and calcification detection on the VinDr dataset. The IoU is reported at thresholds of 0.25 and 0.50, showcasing the models' effectiveness in weakly-supervised localization tasks using Mammo-FActOR.


Weakly-Supervised Localization Results
IoU comparison for weakly-supervised localization of mass and calcification.

Future Directions

We are excited to continue developing Mammo-CLIP by exploring the integration of vision transformers and cross-attention mechanisms, which have the potential to further enhance the model's performance. We also aim to expand the dataset to include more diverse mammogram images and reports, ensuring that Mammo-CLIP remains at the forefront of AI in healthcare.

Acknowledgments

This work was partially supported by the Pennsylvania Department of Health, NIH Award Number 1R01HL141813-01, and funding from the Hariri Institute for Computing, Boston University. We are grateful for the computational resources from Pittsburgh Super Computing grant number TG-ASC170024.

BibTeX


                            @article{ghosh2024mammo,
                              title={Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography},
                              author={Ghosh, Shantanu and Poynton, Clare B and Visweswaran, Shyam and Batmanghelich, Kayhan},
                              journal={arXiv preprint arXiv:2405.12255},
                              year={2024}
                            }