Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

Shantanu Ghosh¹, Clare B. Poynton², Shyam Visweswaran³, Kayhan Batmanghelich¹
¹Boston University, ²BUMC, ³Pitt DBMI
27th INTERNATIONAL CONFERENCE ON MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION (MICCAI) 2024
(Early Accept top 11%)

arXiv Paper Code Hugging Face Dataset Poster Reviews

Fig. 1. Schematic view of our method. (a) Image-text augmentation for Multi-View Supervision (MVS). (b) Dataset augmentation by synthesizing reports using image-label datasets. (c) Mammo-CLIP pretraining strategy. (d) Feature attribution using Mammo-FActOR.

TL;DR: Mammo‑CLIP is the first vision‑language model trained on screening mammogram-report pairs, combining multi‑view supervision and report‑synthesized augmentation to achieve robust, data‑efficient classification and localization of key breast imaging features.

🔥🔥🔥 [ACL2025] New Update: Bias Discovery with LADDER + Mammo-CLIP

LADDER (Language-Driven Slice Discovery and Error Rectification) is a new framework from our lab that uses Mammo-CLIP to detect and correct biases in vision classifiers, including those trained on mammograms.

Instead of relying on manually defined subgroups or attributes, LADDER discovers failure modes and bias slices using natural language and interpretable reasoning from Large Language Models (LLMs).

🧠 Automatically identify performance disparities across latent subgroups
🩻 Evaluate alignment of radiology reports with model predictions
⚙️ Use pseudo-labels and debiasing to correct classifier errors — no extra annotation needed

Read LADDER Paper Project Page LADDER Code Models/Data/Resources

Example: If your model underperforms on younger patients or dense breast cases, LADDER helps surface those slices using textual probes (e.g., "dense breast with asymmetry") and suggests retraining paths to reduce bias — all without needing protected attribute labels.

This work was presented at ACL 2025.

Revolutionizing Breast Cancer Detection with AI

Breast cancer is a leading cause of death among women worldwide, with early detection being crucial for effective treatment. However, the development of robust computer-aided diagnosis (CAD) systems is often limited by the availability of large, diverse annotated datasets. Mammo-CLIP addresses these challenges by introducing a Vision Language Model specifically trained on paired mammogram images and reports, significantly enhancing the robustness and data efficiency of CAD systems. Mammo-CLIP employs multi-view supervision, data augmentation, and a novel feature attribution method called Mammo-FActOR, to provide interpretable and highly accurate results. By leveraging high-resolution images and diverse training data, Mammo-CLIP excels in detecting and localizing critical mammographic attributes such as masses and calcifications. Explore our results and see how Mammo-CLIP sets a new standard in mammography AI.

Features

Mammo-CLIP is designed for downstream tasks including zero-shot classification of breast findings and disease, ensuring versatility across diverse datasets.
The model supports linear probe classification and localization, allowing for efficient use of varying amounts of labeled data to achieve accurate results.
Mammo-CLIP can be fine-tuned on specific datasets to further enhance classification and localization performance, particularly in identifying breast cancer and other related findings.
With Mammo-FActOR, Mammo-CLIP vision encoder excels in localization tasks, accurately identifying findings like masses and calcifications using descriptive sentences, without relying on ground truth bounding boxes.
Although Mammo-CLIP is primarily aligned with screening mammograms, it demonstrates exceptional performance in cancer classification, showcasing its ability to generalize effectively to unseen domains.

Code, Data, and Models

Code, Data, and Models

Mammo-CLIP Optimization

Mammo-CLIP is a Vision Language Model specifically tailored for mammography. Its primary goal is to align visual features from mammogram images with the corresponding textual descriptions found in radiology reports. This alignment is accomplished by using separate encoders for images and text, ensuring that similar images and texts are closely aligned within a shared feature space. This method enhances the model's capability to accurately interpret and classify mammographic findings, resulting in more reliable and interpretable outcomes. Additionally, Mammo-CLIP leverages both image+text datasets and image+label datasets to learn superior representations through a multiview supervision (MVS) loss. In practice, we utilize an in-house image+report dataset from UPMC, alongside the publicly available VinDr dataset as our image+label dataset.

Instance and Dataset Augmentation

To enhance the robustness and generalizability of Mammo-CLIP, both instance-level and dataset-level augmentation techniques are utilized. Instance augmentation involves creating multiple modified versions of each mammogram image and its corresponding report. These modifications include changes such as flipping, rotation, and cropping of images, as well as slight rephrasing of the text. Dataset augmentation, on the other hand, synthesizes entirely new data pairs by combining images and text from different sources within the dataset. These strategies help the model learn more diverse patterns, making it better suited to handle variations in real-world data.

Mammo-FActOR: Interpretable AI for Mammography

Mammo-FActOR is an interpretability module integrated within Mammo-CLIP. This module maps the visual features extracted from mammograms to specific textual attributes found in radiology reports. By doing so, Mammo-FActOR provides a clearer understanding of how the model interprets different regions of the image in relation to the findings described in the text. This enhanced interpretability allows radiologists to better understand the model’s decision-making process, increasing the trustworthiness and transparency of the system.

Results and Impact

Mammo-CLIP demonstrates strong performance across 2 public datasets: RSNA and VinDr, significantly outperforming the baselines in both classification and localization tasks. The model's robustness and data efficiency make it a valuable asset in the early detection of breast cancer, potentially saving lives by enabling faster and more accurate diagnoses.

Detailed Task Descriptions

Zero-shot classification of cancer on the RSNA dataset, and classification of mass, calcification, and density on the VinDr dataset without any fine-tuning.
Classification of mass, calcification, and density on the VinDr dataset using the Mammo-CLIP vision encoder, trained with linear probes on 10%, 50%, and 100% of the training data.
Classification of mass, calcification, and density on the VinDr dataset by fine-tuning the Mammo-CLIP vision encoder with 100% of the training data.
Classification of cancer on the RSNA dataset by fine-tuning the Mammo-CLIP vision encoder with 10%, 50%, and 100% of the training data.
Classification of cancer on the RSNA dataset using the Mammo-CLIP vision encoder, trained with linear probes on 100% of the training data.
Localization of mass and calcification on the VinDr dataset by fine-tuning the Mammo-CLIP vision encoder with 10%, 50%, and 100% of the training data.
Localization of mass and calcification on the VinDr dataset using the Mammo-CLIP vision encoder with a frozen encoder, trained on 100% of the training data.

Baselines

Using UPMC (image+text) dataset and CLIP objective, we construct two baselines: 1) an image encoder w/ ResNet (RN)-50 initialized with CLIP weights and fine-tuned with 224×224 images, 2) EfficientNet (EN)-B5 fine-tuned using the same pre-processed images as Mammo-CLIP. Both the baselines are pre-trained using the UPMC dataset, as CLIP only uses an image-text dataset, not an image-label dataset.

Classification Performance on RSNA Dataset

The plot below shows the classification performance of various models on the RSNA dataset to classify malignancy. The models were evaluated using AUC scores across different training settings including zero-shot, fine-tuning with 10%, 50%, and 100% of the data, and linear probe with 100% of the data.

Classification Performance for Calcification, Mass, and Density on VinDr dataset

This plot compares the AUC performance of various models on calcification, mass, and density classification tasks. The performance is reported across zero-shot, linear probe (10%, 50%, 100% data), and fine-tune (100% data) settings, providing insights into the model's robustness across different conditions.

AUC Performance for Calcification, Mass, and Density

Supervised Localization Performance on VinDr Dataset

The following plot presents the localization performance (mAP) on the VinDr dataset. The evaluation compares different models under various training conditions including freeze encoder, fine-tuning with 10%, 50%, and 100% of the data.

Mammo-FActOR Interpretability

This figure showcases the interpretability of Mammo-FActOR. The ground-truth regions for mass and calcification are compared against the model’s predictions, visualized through heatmaps that highlight the model’s focus areas.

Weakly-Supervised Localization Results

The bar plot below compares the Intersection over Union (IoU) performance of two Mammo-CLIP variants for mass and calcification detection on the VinDr dataset. The IoU is reported at thresholds of 0.25 and 0.50, showcasing the models' effectiveness in weakly-supervised localization tasks using Mammo-FActOR.

Citation


                            @InProceedings{10.1007/978-3-031-72390-2_59,
                            author="Ghosh, Shantanu
                            and Poynton, Clare B.
                            and Visweswaran, Shyam
                            and Batmanghelich, Kayhan",
                            editor="Linguraru, Marius George
                            and Dou, Qi
                            and Feragen, Aasa
                            and Giannarou, Stamatia
                            and Glocker, Ben
                            and Lekadir, Karim
                            and Schnabel, Julia A.",
                            title="Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography",
                            booktitle="Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024",
                            year="2024",
                            publisher="Springer Nature Switzerland",
                            address="Cham",
                            pages="632--642",
                            isbn="978-3-031-72390-2"
                            }