Mammo-CLIP Optimization
Mammo-CLIP is a Vision Language Model specifically tailored for mammography. Its primary goal is to align visual features from mammogram images with the corresponding textual descriptions found in radiology reports. This alignment is accomplished by using separate encoders for images and text, ensuring that similar images and texts are closely aligned within a shared feature space. This method enhances the model's capability to accurately interpret and classify mammographic findings, resulting in more reliable and interpretable outcomes. Additionally, Mammo-CLIP leverages both image+text datasets and image+label datasets to learn superior representations through a multiview supervision (MVS) loss. In practice, we utilize an in-house image+report dataset from UPMC, alongside the publicly available VinDr dataset as our image+label dataset.