Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

Shantanu Ghosh1, Clare B. Poynton2, Shyam Visweswaran3, Kayhan Batmanghelich1
1Boston University, 2BUMC, 3Pitt DBMI
27th INTERNATIONAL CONFERENCE ON MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION (MICCAI) 2024
(Early Accept top 11%)

Mammo-CLIP Schematic


Fig. 1. Schematic view of our method. (a) Image-text augmentation for Multi-View Supervision (MVS). (b) Dataset augmentation by synthesizing reports using image-label datasets. (c) Mammo-CLIP pretraining strategy. (d) Feature attribution using Mammo-FActOR.

TL;DR: Mammo‑CLIP is the first vision‑language model trained on screening mammogram-report pairs, combining multi‑view supervision and report‑synthesized augmentation to achieve robust, data‑efficient classification and localization of key breast imaging features.

🔥🔥🔥 [ACL2025] New Update: Bias Discovery with LADDER + Mammo-CLIP


LADDER (Language-Driven Slice Discovery and Error Rectification) is a new framework from our lab that uses Mammo-CLIP to detect and correct biases in vision classifiers, including those trained on mammograms.

Instead of relying on manually defined subgroups or attributes, LADDER discovers failure modes and bias slices using natural language and interpretable reasoning from Large Language Models (LLMs).

  • 🧠 Automatically identify performance disparities across latent subgroups
  • 🩻 Evaluate alignment of radiology reports with model predictions
  • ⚙️ Use pseudo-labels and debiasing to correct classifier errors — no extra annotation needed

Example: If your model underperforms on younger patients or dense breast cases, LADDER helps surface those slices using textual probes (e.g., "dense breast with asymmetry") and suggests retraining paths to reduce bias — all without needing protected attribute labels.

This work was presented at ACL 2025.


Revolutionizing Breast Cancer Detection with AI


Breast cancer is a leading cause of death among women worldwide, with early detection being crucial for effective treatment. However, the development of robust computer-aided diagnosis (CAD) systems is often limited by the availability of large, diverse annotated datasets. Mammo-CLIP addresses these challenges by introducing a Vision Language Model specifically trained on paired mammogram images and reports, significantly enhancing the robustness and data efficiency of CAD systems. Mammo-CLIP employs multi-view supervision, data augmentation, and a novel feature attribution method called Mammo-FActOR, to provide interpretable and highly accurate results. By leveraging high-resolution images and diverse training data, Mammo-CLIP excels in detecting and localizing critical mammographic attributes such as masses and calcifications. Explore our results and see how Mammo-CLIP sets a new standard in mammography AI.

Features

Code, Data, and Models


GitHub
Code, Data, and Models

Mammo-CLIP Optimization


Mammo-CLIP is a Vision Language Model specifically tailored for mammography. Its primary goal is to align visual features from mammogram images with the corresponding textual descriptions found in radiology reports. This alignment is accomplished by using separate encoders for images and text, ensuring that similar images and texts are closely aligned within a shared feature space. This method enhances the model's capability to accurately interpret and classify mammographic findings, resulting in more reliable and interpretable outcomes. Additionally, Mammo-CLIP leverages both image+text datasets and image+label datasets to learn superior representations through a multiview supervision (MVS) loss. In practice, we utilize an in-house image+report dataset from UPMC, alongside the publicly available VinDr dataset as our image+label dataset.

Instance and Dataset Augmentation


To enhance the robustness and generalizability of Mammo-CLIP, both instance-level and dataset-level augmentation techniques are utilized. Instance augmentation involves creating multiple modified versions of each mammogram image and its corresponding report. These modifications include changes such as flipping, rotation, and cropping of images, as well as slight rephrasing of the text. Dataset augmentation, on the other hand, synthesizes entirely new data pairs by combining images and text from different sources within the dataset. These strategies help the model learn more diverse patterns, making it better suited to handle variations in real-world data.

Mammo-FActOR: Interpretable AI for Mammography


Mammo-FActOR is an interpretability module integrated within Mammo-CLIP. This module maps the visual features extracted from mammograms to specific textual attributes found in radiology reports. By doing so, Mammo-FActOR provides a clearer understanding of how the model interprets different regions of the image in relation to the findings described in the text. This enhanced interpretability allows radiologists to better understand the model’s decision-making process, increasing the trustworthiness and transparency of the system.

Results and Impact


Mammo-CLIP demonstrates strong performance across 2 public datasets: RSNA and VinDr, significantly outperforming the baselines in both classification and localization tasks. The model's robustness and data efficiency make it a valuable asset in the early detection of breast cancer, potentially saving lives by enabling faster and more accurate diagnoses.

Detailed Task Descriptions

Baselines


Using UPMC (image+text) dataset and CLIP objective, we construct two baselines: 1) an image encoder w/ ResNet (RN)-50 initialized with CLIP weights and fine-tuned with 224×224 images, 2) EfficientNet (EN)-B5 fine-tuned using the same pre-processed images as Mammo-CLIP. Both the baselines are pre-trained using the UPMC dataset, as CLIP only uses an image-text dataset, not an image-label dataset.

Classification Performance on RSNA Dataset


The plot below shows the classification performance of various models on the RSNA dataset to classify malignancy. The models were evaluated using AUC scores across different training settings including zero-shot, fine-tuning with 10%, 50%, and 100% of the data, and linear probe with 100% of the data.

Classification Performance on RSNA Dataset

Classification Performance for Calcification, Mass, and Density on VinDr dataset


This plot compares the AUC performance of various models on calcification, mass, and density classification tasks. The performance is reported across zero-shot, linear probe (10%, 50%, 100% data), and fine-tune (100% data) settings, providing insights into the model's robustness across different conditions.

AUC Performance for Calcification, Mass, and Density

Supervised Localization Performance on VinDr Dataset


The following plot presents the localization performance (mAP) on the VinDr dataset. The evaluation compares different models under various training conditions including freeze encoder, fine-tuning with 10%, 50%, and 100% of the data.

Localization Performance on VinDr Dataset

Mammo-FActOR Interpretability


This figure showcases the interpretability of Mammo-FActOR. The ground-truth regions for mass and calcification are compared against the model’s predictions, visualized through heatmaps that highlight the model’s focus areas.

Mammo-FACtoR Interpretability
Ground-truth and Mammo-FActOR prediction visualizations for mass and calcification.

Weakly-Supervised Localization Results


The bar plot below compares the Intersection over Union (IoU) performance of two Mammo-CLIP variants for mass and calcification detection on the VinDr dataset. The IoU is reported at thresholds of 0.25 and 0.50, showcasing the models' effectiveness in weakly-supervised localization tasks using Mammo-FActOR.

Weakly-Supervised Localization Results
IoU comparison for weakly-supervised localization of mass and calcification.

Citation



                            @InProceedings{10.1007/978-3-031-72390-2_59,
                            author="Ghosh, Shantanu
                            and Poynton, Clare B.
                            and Visweswaran, Shyam
                            and Batmanghelich, Kayhan",
                            editor="Linguraru, Marius George
                            and Dou, Qi
                            and Feragen, Aasa
                            and Giannarou, Stamatia
                            and Glocker, Ben
                            and Lekadir, Karim
                            and Schnabel, Julia A.",
                            title="Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography",
                            booktitle="Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024",
                            year="2024",
                            publisher="Springer Nature Switzerland",
                            address="Cham",
                            pages="632--642",
                            isbn="978-3-031-72390-2"
                            }