LADDER: Language-Driven Slice Discovery and Error Rectification in Vision Classifiers

Shantanu Ghosh1, Rayan Syed1, Chenyu Wang1, Vaibhav Choudhary1, Binxu Li2, Clare B. Poynton3, Shyam Visweswaran4, Kayhan Batmanghelich1
1Boston University, 2Stanford University, 3BUMC, 4Pitt DBMI
Accepted at ACL 2025 Findings

LADDER Pipeline Illustration

TL;DR: LADDER is a modular framework that uses large language models (LLMs) to discover, explain, and mitigate hidden biases in vision classifiers—without requiring prior knowledge of bias attributes.

Abstract


Slice discovery refers to identifying systematic biases in the mistakes of pre-trained vision models. Current slice discovery methods in computer vision rely on converting input images into sets of attributes and then testing hypotheses about configurations of these pre-computed attributes associated with elevated error patterns. However, such methods face several limitations: 1) they are restricted by the predefined attribute bank; 2) they lack the common sense reasoning and domain-specific knowledge often required for specialized fields e.g, radiology; 3) at best, they can only identify biases in image attributes while overlooking those introduced during preprocessing or data preparation. We hypothesize that bias-inducing variables leave traces in the form of language (e.g, logs), which can be captured as unstructured text. Thus, we introduce LADDER, which leverages the reasoning capabilities and latent domain knowledge of Large Language Models (LLMs) to generate hypotheses about these mistakes. Specifically, we project the internal activations of a pre-trained model into text using a retrieval approach and prompt the LLM to propose potential bias hypotheses. To detect biases from preprocessing pipelines, we convert the preprocessing data into text and prompt the LLM. Finally, LADDER generates pseudo-labels for each identified bias, thereby mitigating all biases without requiring expensive attribute annotations. Rigorous evaluations on 3 natural and 3 medical imaging datasets, 200+ classifiers, and 4 LLMs with varied architectures and pretraining strategies -- demonstrate that LADDER consistently outperforms current methods.

Code, Data, and Models


GitHub
Code, Data, and Models

Research questions addressed by LADDER


RQ1. How does LADDER perform in discovering error slices compared to baselines?
RQ2. How does LADDER leverage reasoning and latent domain knowledge of LLMs for slice discovery?
RQ3. How does LADDER discover biased attributes with different architectures and pre-training methods?
RQ4. How does LADDER mitigate biases using the discovered attributes?
RQ5. Can LADDER operate without captions?
RQ6. Can LADDER detect biases beyond captions/reports?

Experimental setup


Model Architectures & Pretraining Methods
  • ResNet-50 on ImageNet-1K using supervised pretraining (resnet_sup_in1k)
  • ResNet-50 on ImageNet-21K using supervised pretraining (resnet_sup_in21k)
  • ResNet-50 on ImageNet-1K using SimCLR (resnet_simclr_in1k)
  • ResNet-50 on ImageNet-1K using Barlow Twins (resnet_barlow_in1k)
  • ResNet-50 on ImageNet-1K using DINO (resnet_dino_in1k)
  • Efficient-B5 on ImageNet-1K using supervised pretraining (EF-B5) - For 2D mammograms
  • ViT-B on ImageNet-1K using supervised pretraining (vit_sup_in1k)
  • ViT-B on ImageNet-21K using supervised pretraining (vit_sup_in21k)
  • ViT-B from OpenAI CLIP (vit_clip_oai)
  • ViT-B pretrained using CLIP on LAION-2B (vit_clip_laion)
  • ViT-B on SWAG using weakly supervised pretraining (vit_sup_swag)
  • ViT-B on ImageNet-1K using DINO (vit_dino_in1k)
Slice discovery Baselines
Mitigation Baselines & Datasets
Available Algorithms (~20)
  • Empirical Risk Minimization (ERM)
  • Invariant Risk Minimization (IRM)
  • Group Distributionally Robust Optimization (GroupDRO)
  • Conditional Value-at-Risk DRO (CVaRDRO)
  • Mixup (Mixup)
  • Just Train Twice (JTT)
  • Learning from Failure (LfF)
  • Learning Invariant Predictors with Selective Augmentation (LISA)
  • Deep Feature Reweighting (DFR)
  • Maximum Mean Discrepancy (MMD)
  • Class-Balanced Loss (CBLoss)
  • Label-Distribution-Aware Margin Loss (LDAM)
  • Classifier Re-Training (CRT)
  • Reweight-Classifier Re-Training (ReWeightCRT)
Available Datasets (13)

Vision-Language Representation space & LLMs


Vision-Language Models
Captioning & Hypothesis LLMs
  • Captions: BLIP and GPT-4o
  • LLMs for Hypothesis Generation: GPT-4o, Claude, Gemini, LLaMA

Results


RQ1: Performance of LADDER compared with other slice discovery algorithms
RQ1 Plot
RQ2. Leveraging LLM’s reasoning and domain knowledge for bias discovery
RQ2 Plot
RQ3: Biased attributes discovery across architectures/pre-training methods
RQ3 Plot RQ3 Plot
RQ4: Mitigating biases using LADDER
RQ4 Plot
RQ5: Relaxing the dependency on captions
RQ5 Plot 1
Fig. Biased attributes detected by LADDER w/ captions and w/ instruction-tuned models (w/o captions). Bright/light colors show presence/absence of attributes.
RQ5 Plot 2
Fig. (a) Precision@10 for slice discovery and (b) WGA for bias mitigation using LADDER w/ captions vs. instruction-tuned models.
RQ6: Detecting biases beyond captions/reports
RQ5 Plot 2
Fig. LADDER detects biases beyond reports, identifying biases from metadata (age, view and implant) and DICOM headers (Photometric interpretation).

Ablations


Ablation1: Impact of captioners on LADDER
RQ5 Plot 2
Ablation2: Impact of the choice of LLM on detecting biases using LADDER
RQ5 Plot 2
Ablation3: Impact of the choice of LLM on the mitigation strategy of LADDER
RQ5 Plot 2
Ablation4: Impact of the choice of vision language models on the LADDER for CXR domain
RQ5 Plot 2
Ablation5: Cost of using various LLMs
RQ5 Plot 2

Extension to tabular data


Check out my paper, where LADDER's mitigation approach was utilized for tabular data using self-supervised learning. This work was completed during my internship at Amazon AWS SAAR team at NYC in 2024 summer.

Citation


@article{ghosh2024ladder,
  title={LADDER: Language Driven Slice Discovery and Error Rectification},
  author={Ghosh, Shantanu and Syed, Rayan and Wang, Chenyu and Poynton, Clare B and Visweswaran, Shyam and Batmanghelich, Kayhan},
  journal={arXiv preprint arXiv:2408.07832},
  year={2024}
}