LADDER: Language-Driven Slice Discovery and Error Rectification in Vision Classifiers

Shantanu Ghosh¹, Rayan Syed¹, Chenyu Wang¹, Vaibhav Choudhary¹, Binxu Li², Clare B. Poynton³, Shyam Visweswaran⁴, Kayhan Batmanghelich¹
¹Boston University, ²Stanford University, ³BUMC, ⁴Pitt DBMI
Accepted at ACL 2025 Findings

arXiv Paper Code Hugging Face Dataset Poster

TL;DR: LADDER is a modular framework that uses large language models (LLMs) to discover, explain, and mitigate hidden biases in vision classifiers—without requiring prior knowledge of bias attributes.

Abstract

Slice discovery refers to identifying systematic biases in the mistakes of pre-trained vision models. Current slice discovery methods in computer vision rely on converting input images into sets of attributes and then testing hypotheses about configurations of these pre-computed attributes associated with elevated error patterns. However, such methods face several limitations: 1) they are restricted by the predefined attribute bank; 2) they lack the common sense reasoning and domain-specific knowledge often required for specialized fields e.g, radiology; 3) at best, they can only identify biases in image attributes while overlooking those introduced during preprocessing or data preparation. We hypothesize that bias-inducing variables leave traces in the form of language (e.g, logs), which can be captured as unstructured text. Thus, we introduce LADDER, which leverages the reasoning capabilities and latent domain knowledge of Large Language Models (LLMs) to generate hypotheses about these mistakes. Specifically, we project the internal activations of a pre-trained model into text using a retrieval approach and prompt the LLM to propose potential bias hypotheses. To detect biases from preprocessing pipelines, we convert the preprocessing data into text and prompt the LLM. Finally, LADDER generates pseudo-labels for each identified bias, thereby mitigating all biases without requiring expensive attribute annotations. Rigorous evaluations on 3 natural and 3 medical imaging datasets, 200+ classifiers, and 4 LLMs with varied architectures and pretraining strategies -- demonstrate that LADDER consistently outperforms current methods.

Code, Data, and Models

Code, Data, and Models

Research questions addressed by LADDER

RQ1. How does LADDER perform in discovering error slices compared to baselines?

RQ2. How does LADDER leverage reasoning and latent domain knowledge of LLMs for slice discovery?

RQ3. How does LADDER discover biased attributes with different architectures and pre-training methods?

RQ4. How does LADDER mitigate biases using the discovered attributes?

RQ5. Can LADDER operate without captions?

RQ6. Can LADDER detect biases beyond captions/reports?

Experimental setup

Model Architectures & Pretraining Methods

ResNet-50 on ImageNet-1K using supervised pretraining (resnet_sup_in1k)
ResNet-50 on ImageNet-21K using supervised pretraining (resnet_sup_in21k)
ResNet-50 on ImageNet-1K using SimCLR (resnet_simclr_in1k)
ResNet-50 on ImageNet-1K using Barlow Twins (resnet_barlow_in1k)
ResNet-50 on ImageNet-1K using DINO (resnet_dino_in1k)
Efficient-B5 on ImageNet-1K using supervised pretraining (EF-B5) - For 2D mammograms
ViT-B on ImageNet-1K using supervised pretraining (vit_sup_in1k)
ViT-B on ImageNet-21K using supervised pretraining (vit_sup_in21k)
ViT-B from OpenAI CLIP (vit_clip_oai)
ViT-B pretrained using CLIP on LAION-2B (vit_clip_laion)
ViT-B on SWAG using weakly supervised pretraining (vit_sup_swag)
ViT-B on ImageNet-1K using DINO (vit_dino_in1k)

Slice discovery Baselines

DOMINO (Eyuboglu et al. 2022)
FACTS (Yenamandra et al. 2023)

Mitigation Baselines & Datasets

Available Algorithms (~20)

Empirical Risk Minimization (ERM)
Invariant Risk Minimization (IRM)
Group Distributionally Robust Optimization (GroupDRO)
Conditional Value-at-Risk DRO (CVaRDRO)
Mixup (Mixup)
Just Train Twice (JTT)
Learning from Failure (LfF)
Learning Invariant Predictors with Selective Augmentation (LISA)
Deep Feature Reweighting (DFR)
Maximum Mean Discrepancy (MMD)
Class-Balanced Loss (CBLoss)
Label-Distribution-Aware Margin Loss (LDAM)
Classifier Re-Training (CRT)
Reweight-Classifier Re-Training (ReWeightCRT)

Available Datasets (13)

Waterbirds (Wah et al., 2011)
CelebA (Liu et al., 2015)
MetaShift (Liang and Zou, 2022)
NIH-CXR (Wang et al., 2017)
RSNA-Mammograms (Kaggle competition)
VinDr-Mammograms (Nguyen et al., 2023)

Vision-Language Representation space & LLMs

Vision-Language Models

Natural Images: CLIP
Mammograms: Mammo-CLIP
Chest X-Rays: CXR-CLIP

Captioning & Hypothesis LLMs

Captions: BLIP and GPT-4o
LLMs for Hypothesis Generation: GPT-4o, Claude, Gemini, LLaMA

Results

RQ1: Performance of LADDER compared with other slice discovery algorithms

RQ2. Leveraging LLM’s reasoning and domain knowledge for bias discovery

RQ3: Biased attributes discovery across architectures/pre-training methods

RQ4: Mitigating biases using LADDER

RQ5: Relaxing the dependency on captions

RQ5 Plot 1 — Fig. Biased attributes detected by LADDER w/ captions and w/ instruction-tuned models (w/o captions). Bright/light colors show presence/absence of attributes.

RQ5 Plot 2 — Fig. (a) Precision@10 for slice discovery and (b) WGA for bias mitigation using LADDER w/ captions vs. instruction-tuned models.

RQ6: Detecting biases beyond captions/reports

Ablations

Ablation1: Impact of captioners on LADDER

Ablation2: Impact of the choice of LLM on detecting biases using LADDER

Ablation3: Impact of the choice of LLM on the mitigation strategy of LADDER

Ablation4: Impact of the choice of vision language models on the LADDER for CXR domain

Ablation5: Cost of using various LLMs

Extension to tabular data

Check out my paper, where LADDER's mitigation approach was utilized for tabular data using self-supervised learning. This work was completed during my internship at Amazon AWS SAAR team at NYC in 2024 summer.

Citation

@inproceedings{ghosh-etal-2025-ladder,
    title = "{LADDER}: Language-Driven Slice Discovery and Error Rectification in Vision Classifiers",
    author = "Ghosh, Shantanu  and
      Syed, Rayan  and
      Wang, Chenyu  and
      Choudhary, Vaibhav  and
      Li, Binxu  and
      Poynton, Clare B  and
      Visweswaran, Shyam  and
      Batmanghelich, Kayhan",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.1177/",
    pages = "22935--22970",
    ISBN = "979-8-89176-256-5",
    abstract = "Slice discovery refers to identifying systematic biases in the mistakes of pre-trained vision models. Current slice discovery methods in computer vision rely on converting input images into sets of attributes and then testing hypotheses about configurations of these pre-computed attributes associated with elevated error patterns. However, such methods face several limitations: 1) they are restricted by the predefined attribute bank; 2) they lack the \textit{common sense} reasoning and domain-specific knowledge often required for specialized fields radiology; 3) at best, they can only identify biases in image attributes while overlooking those introduced during preprocessing or data preparation. We hypothesize that bias-inducing variables leave traces in the form of language (logs), which can be captured as unstructured text. Thus, we introduce ladder, which leverages the reasoning capabilities and latent domain knowledge of Large Language Models (LLMs) to generate hypotheses about these mistakes. Specifically, we project the internal activations of a pre-trained model into text using a retrieval approach and prompt the LLM to propose potential bias hypotheses. To detect biases from preprocessing pipelines, we convert the preprocessing data into text and prompt the LLM. Finally, ladder generates pseudo-labels for each identified bias, thereby mitigating all biases without requiring expensive attribute annotations.Rigorous evaluations on 3 natural and 3 medical imaging datasets, 200+ classifiers, and 4 LLMs with varied architectures and pretraining strategies {--} demonstrate that ladder consistently outperforms current methods. Code is available: \url{https://github.com/batmanlab/Ladder}."
}