Dividing and Conquering a BlackBox to a Mixture of Interpretable Models: Route, Interpret, Repeat

1Boston University, 2University of Pittsburgh 3Meta AI

Fortieth International Conference on Machine Learning (ICML 2023)

Also, in the 2nd Workshop on Spurious Correlations, Invariance and Stability (SCIS), ICML 2023

Update! See our MICCAI, 2023 paper on applying the interpretable models for efficient transfer learning


We aim to extract multiple interpretable models from a BlackBox, each specializing in a different subset of data to provide instance-specific explanations using human-understandable concepts. In this work, we restrict ourselves to First-order logic (FOL) based explanations.


Problem Statement. We aim to solve the problem of explaining the prediction of a deep neural network post-hoc using high level human interpretable concepts. In this work, we blur the distinction of post-hoc explanations and designing interpretable models.

Why post-hoc, not interpretable by design? Most of the early interpretable by design methods focus on tabular data. Plus, they tend to be less flexible than the Blackbox models and demand substantial expertise to design. Also, mostly they underperform than their Blackbox counterparts. Post hoc methods preserve the flexibility and performance of the Blackbox.

Why concept based model, not saliency maps? Post-hoc based saliency maps identify key input features that contribute the most to the network’s output. They suffer from a lack of fidelity and mechanistic explanation of the network output. Without a mechanistic explanation, recourse to a model’s undesirable behavior is unclear. Concept based models can identify the important concept, responsible for the model's output. We can intervene on these concepts to rectify the model's prediction.

What is a concept based model? Concept based model or technically Concept Bottleneck Models are a family of models where first the human understandable concepts are predicted from the given input (images) and then the class labels are predicted from the concepts. In this work, we assume to have the ground truth concepts either in the dataset (CUB200 or Awa2) or discovered from another dataset (HAM10000, SIIM-ISIC or MIMIC-CXR). Also, we predict the concepts from the pre-trained embedding of the Blackbox as shown in Posthoc Concept Bottleneck Models.

What is a human understandable concept? Human understandable concepts are high-level features which constitute the class label. For example, the stripes can be a human understandable concept, responsible for predicting zebra. In chest-x-rays, anatomical features like lower left lobe of lung can be another human understandable concept. For more details, refer to TCAV paper or Concept Bottleneck Models.

What is the research gap? Most of the interpretable models (interpretable by design or post-hoc) utilizes a single interpretable model to fit the whole data. If a portion of the data does not fit the template design of the interpretable model, they do not offer any flexibility, compromising performance. Thus, a single interpretable model may be insufficient to explain all samples, offering generic explanations.

Our contribution. We propose an interpretable method, aiming to achieve the best of both worlds: not sacrificing Blackbox performance similar to post hoc explainability while still providing actionable interpretation. We hypothesize that a Blackbox encodes several interpretable models, each applicable to a different portion of data. We construct a hybrid neuro-symbolic model by progressively carving out a mixture of interpretable models and a residual network from the given Blackbox. We coin the term expert for each interpretable model, as they specialize over a subset of data. All the interpretable models are termed a Mixture of Interpretable Experts (MoIE). Our design identifies a subset of samples and routes them through the interpretable models to explain the samples with First order logic(FOL), providing basic reasoning on concepts from the Blackbox. The remaining samples are routed through a flexible residual network. On the residual network, we repeat the method until MoIE explains the desired proportion of data. Using FOL for interpretable models offers recourse when undesirable behavior is detected in the model. Our method is the divide-and-conquer approach, where the instances covered by the residual network need progressively more complicated interpretable models. Such insight can be used to inspect the data and the model further. Finally, our model allows unexplainable category of data, which is currently not allowed in the interpretable models.

What is a FOL? FOL is a logical function that accepts predicates (concept presence/absent) as input and returns a True/False output being a logical expression of the predicates. The logical expression, which is a set of AND, OR, Negative, and parenthesis, can be written in the so-called Disjunctive Normal Form (DNF). DNF is a FOL logical formula composed of a disjunction (OR) of conjunctions (AND), known as the sum of products.


Assume we have a dataset {X , Y, C}, where X , Y, and C are the input images, class labels, and human interpretable attributes, respectively. Assume f0=h0(Φ(.)) is the trained Blackbox, where Φ is the representation and h is the classifier. We denote the learnable function t, projecting the image embeddings to the concept space. The concept space is the space spanned by the attributes C. Thus, function t outputs a scalar value representing a concept for each input image.

We iteratively carve out an interpretable model from the given Blackbox. Each iteration yields an interpretable model (the downward grey paths in the above Figure) and a residual (the straightforward black paths in the above Figure 1). We start with the initial Blackbox f0. At iteration k, we distill the Blackbox from the previous iteration fk−1 into a neurosymbolic interpretable model, gk, predicting the class labels Y from the concepts C. The residual rk = fk-1 − gk emphasizes the portion of fk-1 that gk cannot explain. We then approximate rk with fk = hk(Φ(.)). fk will be the Blackbox for the subsequent iteration and be explained by the respective interpretable model. A learnable gating mechanism, denoted by Πk: C → {0, 1} (shown as the selector in Figure 1) routes an input sample towards either gk or rk. Each interpretable model is learned to focus a specific subset of the data, defined by coverage. The thickness of the lines in Figure represents the samples covered by the interpretable models (grey line) and the residuals (black line). With every iteration, the cumulative coverage of the interpretable models increases, but the residual decreases. We name our method route, interpret and repeat.

We refer to the interpretable models of all the iterations as a Mixture of Interpretable Experts (MoIE) cumulatively after training. Furthermore, we utilize E-LEN, i.e., a Logic Explainable Network implemented with an Entropy Layer as first layer as the interpretable symbolic model g to construct First Order Logic (FOL) explanations of a given prediction.


We perform experiments on a variety of vision and medical imaging datasets to show that 1) MoIE captures a diverse set of concepts, 2) the performance of the residuals degrades over successive iterations as they cover harder instances, 3) MoIE does not compromise the performance of the Blackbox, 4) MoIE achieves superior performances during test time interventions, and 5) MoIE can fix the shortcuts using the Waterbirds dataset. We evaluate our methods using CUB200, Awa2, HAM10000, SIIM-ISIC (real-world transfer learning setting) and MIMIC-CXR (effusion classification) datasets.

Baselines. We compare our methods to two concept-based baselines – 1) interpretable-by-design and 2) posthoc. The end-to-end CEMs and sequential CBMs serve as interpretable-by-design baselines. Similarly, PCBM and PCBM-h serve as post hoc baselines. The standard CBM and PCBM models do not show how the concepts are composed to make the label prediction. So, we create CBM + E-LEN, PCBM + E-LEN and PCBM-h + E-LEN by using the identical g of MOIE, as a replacement for the standard classifiers of CBM and PCBM.


Heterogenity of Explanations

To view the FOL explanation for each sample per expert for different datasets, go to the explanations directory in our official repo. All the explanations are stored in separate csv files for each expert for different datasets.

MoIE identifies diverse concepts for specific subsets of a class, unlike the generic ones by the baselines. We construct the FOL explanations of the samples of, Bay breasted warbler in the CUB-200 dataset for VIT-based experts in MoIE at inference. We highlight the unique concepts for experts 1, 2, and 3 in red, blue, and magenta, respectively.

Construction logical explanations of the samples of Effusion in the MIMIC-CXR dataset for various experts in MoIE at inference. The final residual covers the unexplained sample, which is harder to explain (indicated in red).

Construction logical explanations of the samples of a category, Harris Sparrow in the CUB-200 dataset for (a) VIT-based sequential CBM + E-LEN as an interpretable by design baseline, (b) VIT-based PCBM + E-LEN as a posthoc based baseline, (c) various experts in MoIE at inference.

Construction logical explanations of the samples of a category, Anna hummingbird in the CUB-200 dataset for (a) VIT-based sequential CBM + E-LEN as an interpretable by design baseline, (b) VIT-based PCBM + E-LEN as a posthoc based baseline, (c) various experts in MoIE at inference.

Comparison of FOL explanations by MoIE with the PCBM + E-LEN baselines for HAM10000 (top) and ISIC (down) to classify Malignant lesion. We highlight unique concepts for experts 3, 5, and 6 in red, blue, and violet, respectively. For brevity, we combine FOLs for each expert for the samples covered by them.

Flexibility of FOL explanations by VIT-derived MoIE MoIE and the CBM + E-LEN and PCBM + E-LEN baselines for Awa2 dataset to classify Otter at inference.

Flexibility of FOL explanations by VIT-derived MoIE MoIE and the CBM + E-LEN and PCBM + E-LEN baselines for Awa2 dataset to classify Horse at inference.

MoIE identifies more meaningful instance-specific concepts

Quantitative validation of the extracted concepts using completeness scores of the models for a varying number of top concepts and drop in accuracy compared to the original model after zeroing out the top significant concepts iteratively. The highest drop for MoIE indicates that MoIE selects more instance-specific concepts than generic ones by the baselines.

Identification of Harder samples by successive residuals

The performance of experts and residuals across iterations. (a-c) Coverage and proportional accuracy of the experts and residuals. (d-f) We route the samples covered by the residuals across iterations to the initial Blackbox f0 and compare the accuracy of f0 (red bar) with the residual (blue bar). Figures d-f show the progressive decline in performance of the residuals across iterations as they cover the samples in the increasing order of hardness. We observe the similar abysmal performance of the initial blackbox f0 for these samples.

Quantitative analysis of MoIE with the Blackbox and baselines

MoIE does not hurt the performance of the original Blackbox using a held-out test set. We provide the mean and standard errors of AUROC and accuracy for medical imaging (e.g., HAM10000, ISIC, and Effusion) and vision (e.g., CUB-200 and Awa2) datasets, respectively, over 5 random seeds.

Test time interventions

Across architectures test time interventions of concepts on all the samples and on the hard samples, covered by only the last two experts of MoIE.

Applying MoIE to remove shortcuts

MoIE fixes shortcuts. (a) Performance of the biased Blackbox. (b) Performance of final MoIE extracted from the robust Blackbox after removing the shortcuts using Metadata normalization (MDN). (c) Examples of samples (top-row) and their explanations by the biased (middle-row) and robust Blackboxes (bottom-row). (d) Comparison of accuracies of the spurious concepts extracted from the biased vs. the robust Blackbox.


We would like to thank Mert Yuksekgonul of Stanford University for providing the code to construct the concept bank of Derm7pt for conducting the skin experiments. Also, he provided the code for PCBM and PCBM-h when it was not publicly available. This work was partially supported by NIH Award Number 1R01HL141813-01 and the Pennsylvania Department of Health. We are grateful for the computational resources provided by Pittsburgh Super Computing grant number TGASC170024.


  title = 	 {Dividing and Conquering a {B}lack{B}ox to a Mixture of Interpretable Models: Route, Interpret, Repeat},
  author =       {Ghosh, Shantanu and Yu, Ke and Arabshahi, Forough and Batmanghelich, Kayhan},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {11360--11397},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/ghosh23c/ghosh23c.pdf},
  url = 	 {https://proceedings.mlr.press/v202/ghosh23c.html},
  abstract = 	 {ML model design either starts with an interpretable model or a Blackbox and explains it post hoc. Blackbox models are flexible but difficult to explain, while interpretable models are inherently explainable. Yet, interpretable models require extensive ML knowledge and tend to be less flexible, potentially underperforming than their Blackbox equivalents. This paper aims to blur the distinction between a post hoc explanation of a Blackbox and constructing interpretable models. Beginning with a Blackbox, we iteratively carve out a mixture of interpretable models and a residual network. The interpretable models identify a subset of samples and explain them using First Order Logic (FOL), providing basic reasoning on concepts from the Blackbox. We route the remaining samples through a flexible residual. We repeat the method on the residual network until all the interpretable models explain the desired proportion of data. Our extensive experiments show that our route, interpret, and repeat approach (1) identifies a richer diverse set of instance-specific concepts with high concept completeness via interpretable models by specializing in various subsets of data without compromising in performance, (2) identifies the relatively “harder” samples to explain via residuals, (3) outperforms the interpretable by-design models by significant margins during test-time interventions, (4) can be used to fix the shortcut learned by the original Blackbox.}
    title={Tackling Shortcut Learning in Deep Neural Networks: An Iterative Approach with Interpretable Models},
    author={Ghosh, Shantanu and Yu, Ke and Arabshahi, Forough and Batmanghelich, Kayhan},
    booktitle={ICML 2023: Workshop on Spurious Correlations, Invariance and Stability},
    title={Bridging the Gap: From Post Hoc Explanations to Inherently Interpretable Models for Medical Imaging},
    author={Ghosh, Shantanu and Yu, Ke and Arabshahi, Forough and Batmanghelich, Kayhan},
    booktitle={ICML 2023: Workshop on Interpretable Machine Learning in Healthcare},