Jump to content

Class activation mapping

From Wikipedia, the free encyclopedia

Class activation mapping methods are explainable AI (XAI) techniques used to visualize the regions of an input image that are the most relevant for a particular task, especially image classification, in convolutional neural networks (CNNs). These methods generate heatmaps by weighting the feature maps from a convolutional layer according to their relevance to the target class[1].

In the field of artificial intelligence (AI), generically defined as "the effort to automate intellectual tasks normally performed by humans"[2], machine learning (ML) and deep learning (DL) were created. They both use statistical and computational methods to learn patterns from data, reducing the need for manually coded rules.[2].
ML models are trained on input data and the known respective answers, learning the underlying patterns or structures present in the data. Traditional ML algorithms employ manually designed feature sets, posing a direct link between ML designers and employed features[3].
DL is a subfield of ML, based on the concept of successive layers of representation, in which the data is progressively unfolded in different ways, to extract relevant and informative patterns in data analysis. DL algorithms are defined as feature learning algorithms automatically learning hierarchical feature representations from raw data, extracting increasingly abstract features through multiple layers[2][3].
CNNs are a specific architecture of DL models, designed to process spatially structured data, such as images, exploiting a series of convolution, non-linear activation and pooling operations to extract relevant features, contained in the so-called feature maps from input data[3]. CNNs have demonstrated to be highly effective in a variety of computer vision (CV) and image processing tasks[1].
CNNs (and deep learning models more broadly) are described as black boxes[4] due to their complex and non-transparent internal layers of representation. The need for clearer indications on its internal working and decision-making process gave birth to XAI techniques[1].
Among the proposed XAI techniques for CV tasks, Class activation mapping methods can show which pixels in an input image are important to the predicted logit for a class of interest, in a classification task [1].
Class activation mapping methods were originally developed for class-discriminative scenarios to visualize which parts of the input image influenced the classification decision. Namely, to visually highlight the regions of those feature maps which contribute most strongly to the prediction of a given class. More advanced versions of these methods are not limited to image classification tasks, but have been extended also to several vision-related tasks, such as object detection[5], image captioning, visual question answering (VQA) and image segmentation[1].

Background

[edit]

The following methods laid the groundwork for the class activation maps approaches, forming the conceptual basis of using gradients to highlight class-discriminative regions.

Class model visualization and saliency maps for convolutional neural networks

[edit]

The class model visualization and image-specific saliency maps approaches have been presented in the foundational work "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps" by Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman [6] and it generalizes the deconvnet method by Zeiler and Fergus[7].

  • Class model visualization synthesizes an artificial input image that strongly activates the output neurons associated with a target class. Given a trained, fixed model, this method starts with a zero-initialized image, backpropagates the gradients from the class score to the image pixels, updates the image pixels increasing the specific class scores and it repeats the pixel updating process, showing an encoded (idealized version) prototype of the class of interest[6].
A image-specific saliency map visual explanation which highlights the most relevant pixels in an image portraying Leonardo DiCaprio.
  • Image-specific class saliency visualization method provides a visual explanation by highlighting the most relevant pixels in an image for predicting a certain class C of interest. This is done by computing the gradient of the class score with respect to the input image,

    approximating the model locally (around ) as linear, using a first-order Taylor expansion:
    .
    The magnitude of , the gradient, indicates the importancy of the pixels: larger gradients suggest greater influence on the prediction. Once the gradient is known, the saliency map is defined as the maximum absolute gradient across the color channels:

    resulting in an saliency map (i.e. heatmap) [6][8].

Guided backpropagation

[edit]

The concept of guided backpropagation can be traced for the first time in the paper by Springenberg et al. "Striving For Simplicity: The All Convolutional Net" [9] and also this method builds upon the work by Zeiler and Fergus "Visualizing and Understanding Convolutional Networks" [7].

Guided backpropagation visualization on an image of Leonardo DiCaprio.

Guided backpropagation core is to understand what a CNN is learning, by visualizing the patterns that activate more strongly individual neurons (or filters), in architectures which do not rely on max-pooling layer.

When propagating gradients back through a rectified linear unit (ReLU), guided backpropagation passes the gradient if and only if the input to the ReLU was positive (forward pass) and the output gradient is positive (backward signal), tackling both inactive neurons, negative gradients and suppressing the noise. The result displays sharper, high-resolution visualizations of what each neuron is responding to[9].

Guided backpropagation represents a simple and practical method for model interpretability, helping understand how and where neural networks detect semantic concepts across layers. Moreover, it can be applied to any network architecture, due to its working principle[10].

Base versions

[edit]
Key architectural network differences between CAM and Grad-CAM techniques, with visual example.

Class activation mapping and gradient-weighted class activation mapping are the original and most widely used methods for visual explanations in convolutional neural networks. These methods serve as the foundation for many later developments in explainable AI[11].

Notation: In this article, the symbols i and j represent integer indices that disappear inside sums or averages, while x and y are the continuous (or up-sampled integer) coordinates of the final heat-map that is plotted.

Class activation mapping (CAM)

[edit]
Leonardo Di Caprio's suit CAM visual localization. Using a modified ResNet-50 backbone, the CAM-based localization model has been tasked with identifying Leonardo DiCaprio’s suit, assigning that class a confidence score of 59.53%.

Class activation mapping (CAM) was the first, and the original, version of CAM methods, and it gave the name to the whole category. The approach was firstly introduced by Zhou et al. in their seminal work "Learning Deep Features for Discriminative Localization"[12].
This approach achieves class-specific heatmaps by modifying image classification CNN architectures, replacing fully-connected layers with convolutional layers and a final global average pooling layer.
Its main scope is to localize and highlight discriminative regions of an input image that a CNN uses to identify a particular class, without needing explicit bounding box annotations[11][12] [13][14].

Global average pooling (GAP)

[edit]

Global average pooling (GAP) represents the key element in the original CAM approach.
It is a dimensionality reduction technique and, similarly to other pooling layers, it allows the downsampling of the feature maps, calculating representative values for a specific region of the feature map. The particularity of GAP is that it calculates a single value for an entire feature map, significantly reducing the model dimensions[11][12][13][14].

Mathematical description

[edit]

The mathematical description considers as its key the combination of convolutional and GAP layers.
In CAM, it is mandatory to have the GAP layer after the last convolutional layer and before the final linear classifier layer. This last element of the architecture connects the output logits (the network predictions) , to the GAP values, with its respective fine-tuned weights, .

Considering as the last feature maps of the last convolutional layer, GAP produces one value for each feature map, by averaging all the matrix elements (i, j) of the feature map:

with

Namely, in the GAP layer, each feature map is reduced to a single scalar via GAP, producing k values, hence reducing the dimensionality of the network. A kmn tensor is reduced to k scalars, shrinking the parameter count for the linear classifier head.

The final output logits are calculated as the linear sum of the GAP values, weights and bias:

The localization map is computed as follows:

namely, is the activation of node k in the target layer of the model, and is the class-specific weight, for the channel k, in the linear classifier layer[11][12][13][14].

Advantages and drawbacks

[edit]

The use of the GAP layer represents an example of an interpretability by design (IBD) approach. IBD refers to a technique which uses the model's own architecture to help explain its predictions.

The main drawback of CAM is that it is highly model-specific, being applicable to CNN architectures whose layer before the softmax one is a GAP.

Since the approach relies on the post-GAP weights for the overall evaluation, the method can't be applied to intermediate layers.

The choice of dealing with an IBD approach restricts the possibility to generalize the model architecture. Moreover, IBD methods often require re-training of the model[11][12][13][14].

Gradient-weighted class activation mapping (Grad-CAM)

[edit]

Gradient-weighted class activation mapping (Grad-CAM) is a generalized version of CAM and it tackles its architectural limitations. Grad-CAM computes the gradient of a target class score, the pre-softmax logit, with respect to the feature maps of a convolutional neural network. The gradients are global-average-pooled to obtain importance weights, which are used to compute a class-specific localization map by linearly weighting the feature maps. The result is a heatmap that highlights the regions in the input image that are the most influential for predicting the target class.

The main advantage of Grad-CAM, with respect to the standard CAM, is that it is model agnostic (provided that the network still needs to be differentiable), meaning that it generates visual explanation for any CNN-based network without architectural changes or re-training, making it broadly applicable to pre-trained models [11][13][15][16].

Mathematical description

[edit]

Considering:

  • y the logits (i.e. the pre-softmax activated neurons responsible for a certain class prediction) of interest;
  • A the feature activated map for a specific convolutional layer;
  • L the class-discriminative localization map, of width u and height v for any class c;

Grad-CAM, employing backpropagation, computes the logit gradient with respect to the feature map A as

highlighting the importance of a certain class discrimination decision process of the logit.
These gradients are global-average-pooled over each element of the feature map (hence, highlighting the "importance" of the elements of a feature map k for a target class C):

So, to account for the total number of feature maps, each of them is multiplied by its weight (via dot-product) and element-wise summation is done:

It can be observed that, due to the intrinsic nature of the gradient operation, some elements of the weighted feature map will have negative value, so, since only elements that have increased the logit of the predicted class are of interest, a ReLU activation function is applied:

Lastly, the output heatmap image dimensions are upsampled to the original image size to match the input dimensions[11][13][15][16].

Advantages and drawbacks

[edit]

Grad-CAM addresses the most important CAM limitations. It makes CAM free from the GAP layer need, generalizing its behavior and enabling visual explanation at intermediate layers.

However, Grad-CAM focuses on the most discriminative region when contributing to classification. If multiple similar objects are present, Grad-CAM often highlights only one of them, or part of one, providing also coarser maps and lower localization accuracy.

Moreover, Grad-CAM retrieves backwards information (the gradients), without taking into consideration how the activation flowed forward during prediction (unless combined with the guided backpropagation technique), resulting in a certain probability of missing patterns highlighted in the forward signal. On top of that, Grad-CAM heat maps are low-resolution when choosing a very deep layer.

Lastly, false emphasis in the heatmap may be present when large gradients are computed for low activation values. Grad-CAM assumes that gradient implies importance, ignoring the activation features value[11][13][15][16].

Grad-CAM and CAM comparison

[edit]
CAM and Grad-CAM differences[11]
Feature CAM Grad-CAM
Architecture Requires CNN whose last convolutional layer is followed by GAP Works with any CNN architecture that can be backpropagated
Flexibility Only works with GAP-based networks Generalized, gradient-based version, it works with any pre-trained CNN at any layer
Working principle Applies weights (class-specific linear classifier learned after GAP) from the output layer to feature maps Compute importance by calculating the gradients of the class score with respect to the feature maps
Mathematical description of the working principle

Fine-tuned versions

[edit]

Several methods have refined Grad-CAM to improve clarity and flexibility. Guided Grad-CAM, Grad-CAM++, Score-CAM, and Layer-CAM enhance aspects such as localization accuracy, gradient independence, and multi-layer visualization. These techniques build directly on the principles of CAM and Grad-CAM.

Guided Grad-CAM

[edit]

Guided Grad-CAM fuses the coarse, class‐discriminative localization of Grad-CAM with the high‐resolution details of guided backpropagation. Grad-CAM heatmap is first computed for the target class, and it is upsampled to the input size. Then, a Guided Backpropagation saliency map for the same class is computed. A final element‐wise product of the two results in the Guided Grad-CAM visualization map.

The result is a high-resolution, class-specific saliency map that highlights exactly which pixels contribute most to the network’s decision[15].

Grad-CAM++

[edit]

Grad-CAM++ introduces a more refined way of computing the weights for each feature map, bypassing the global average of the gradients approach provided by Grad-CAM. This approach aims to improve the visual effect when multiple target instances are present in a single image.

Specifically, Grad-CAM++ employs pixel-wise gradients (via higher-order gradients), to compute the importance of a specific pixel for a prediction, lighting up multiple object instances in the same image.
These improvements allow for a more sensible and detailed output heatmap.

The associated mathematical framework is defined by the following localization map:

in which the coefficient is defined as:

with , the activation of node k in the target layer of the model at position (x,y); the logit score for class C, and being:

While addressing some Grad-CAM problems, Grad-CAM++ method still relies on gradients, and it only improves the underlying math. It is, however, still based on the idea of assigning a direct and valid relationship between gradient and importance[13][17].

Notation: (a,b) indexes all pixel positions in the feature‐map, exactly like (i,j) does, but for the summation in the denominator.

Score-CAM

[edit]

Score-CAM is a gradient-free CAM technique, thus redefining the original Grad-CAM and Grad-CAM++ working principles. It uses the model confidence scores instead of gradients.

Score-CAM performs the following operations:

  1. Extracts the feature maps of the final convolutional layers, as in the original Grad-CAM;
  2. Upsamples each activation map to the same input image dimensions, defining a mask and each mask is normalized;
  3. Multiplies the original input image by the mask, defining a masked image ( is the element-wise multiplication);
  4. Gets a confidence score (a softmax probability is the output value after the softmax operation; the logit is the value before the softmax) for the masked images , by feeding it into the CNN (either the soft-max probability or the raw logit can be used; both yield similar results in practice);
  5. Considers that confidence score as the weight for the feature map .

These operations allow to replace the gradient calculations with the actual model outputs, building more accurate heatmaps.

Mathematically, the localization map is defined as:

and the coefficient s:

where is the activation of channel k at location (x,y), is the logit for class C for an input X and is the mask, defined as:

with as the upsampling operation.

Since the process of score calculation is repeated for every channel, Score-CAM is slow with respect to gradient-based methods. Moreover, it focuses on regions highlighted by individual feature maps, ignoring the context of the full image, reducing interpretability in complex scenes[13][18].

LayerCAM

[edit]

LayerCAM enhances backwards class-specific gradients using both intermediate and final convolutional layers. Combining information across layers allows to achieve higher resolution and more fine-grained detail, improving localization.

Specifically, for each position in the feature map, LayerCAM evaluates the gradient. The positive gradients are employed as the weights, . Namely:

The activations are then weighted, and the final class activation map is retrieved by summing over the channels.

This technique offers high-resolution heatmaps, flexible localization and per-location precision, employing positive gradients[13][19].

References

[edit]
  1. ^ a b c d e Somani, Ayush; Horsch, Alexander; Prasad, Dilip K. (2023). Interpretability in Deep Learning. Springer International Publishing. pp. 1–18, 125–133, 297–302. doi:10.1007/978-3-031-20639-9. ISBN 978-3-031-20638-2.
  2. ^ a b c Chollet, François (2021). Deep Learning with Python, Second Edition. New York: Manning Publications Co. LLC. pp. 1–25. ISBN 978-1-61729-686-4.
  3. ^ a b c LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey (2015). "Deep learning". Nature. 521 (7553): 436–444. Bibcode:2015Natur.521..436L. doi:10.1038/nature14539. PMID 26017442.
  4. ^ Iqbal, Saeed; Qureshi, Adnan N.; Alhussein, Musaed; Aurangzeb, Khursheed; Anwar, Muhammad Shahid (2024). "AD-CAM: Enhancing Interpretability of Convolutional Neural Networks With a Lightweight Framework - From Black Box to Glass Box". IEEE Journal of Biomedical and Health Informatics. 28: 514–525. doi:10.1109/jbhi.2023.3329231. PMID 37910403.
  5. ^ Ishrak, Md Fatin; Nahar, Jannatun; Faiza, Fairooz Afnad; Siddiqua, Lamia Mahzabin (2024). "Enhancing Object Detection Interpretability for Class-Discriminative Visualizations with Grad-CAM". IEEE. 2024 27th International Conference on Computer and Information Technology (ICCIT): 1493–1498. doi:10.1109/iccit64611.2024.11022089. ISBN 979-8-3315-1909-4.
  6. ^ a b c Simonyan, Karen; Vedaldi, Andrea; Zisserman, Andrew (2014-04-19). "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps". ICLR 2014 Workshop Submission. arXiv:1312.6034.
  7. ^ a b Zeiler, Matthew D.; Fergus, Rob (2014). "Visualizing and Understanding Convolutional Networks". Computer Vision – ECCV 2014. Lecture Notes in Computer Science. Vol. 8689. pp. 818–833. doi:10.1007/978-3-319-10590-1_53. ISBN 978-3-319-10589-5.
  8. ^ Yang, Shengping; Berdine, Gilbert (2023). "Interpretable artificial intelligence (AI) – saliency maps". The Southwest Respiratory and Critical Care Chronicles. 11 (48): 31–37. doi:10.12746/swrccc.v11i48.1209.
  9. ^ a b Springenberg, Jost Tobias; Dosovitskiy, Alexey; Brox, Thomas; Riedmiller, Martin (2015-04-13). "Striving for Simplicity: The All Convolutional Net". ICLR (Workshop Track). doi:10.48550/arXiv.1412.6806.
  10. ^ Chen, Xiongren; Li, Jiuyong; Liu, Jixue; Peters, Stefan; Liu, Lin; Le, Thuc Duy; Walsh, Walsh (2023). "Improve interpretability of Information Bottlenecks for Attribution with Layer-wise Relevance Propagation". 2023 IEEE International Conference on Big Data (BigData). pp. 1064–1069. doi:10.1109/bigdata59044.2023.10386271. ISBN 979-8-3503-2445-7.
  11. ^ a b c d e f g h i Guo, Xianpeng; Hou, Biao; Wu, Zitong; Ren, Bo; Wang, Shuang; Jiao, Licheng (2022). "Prob-POS: A Framework for Improving Visual Explanations from Convolutional Neural Networks for Remote Sensing Image Classification". MDPI. 14 (13): 3042. Bibcode:2022RemS...14.3042G. doi:10.3390/rs14133042.
  12. ^ a b c d e Zhou, Bolei; Khosla, Aditya; Lapedriza, Agata; Oliva, Aude; Torralba, Antonio (2016). "Learning Deep Features for Discriminative Localization". 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2921–2929. doi:10.1109/cvpr.2016.319. hdl:1721.1/112986. ISBN 978-1-4673-8851-1.
  13. ^ a b c d e f g h i j Wang, Yue; Liu, Yuyang; Zou, Jiaqi; Huo, Mengyao (2023). Signal and Information Processing, Networking and Computers: Proceedings of the 10th International Conference on Signal and Information Processing, Networking and Computers (ICSINC). Singapore: Springer Nature Singapore. pp. 399–407. doi:10.1007/978-981-19-9968-0. ISBN 978-981-19996-7-3.
  14. ^ a b c d Jung, Hyungsik; Oh, Youngrock (2021). "Towards Better Explanations of Class Activation Mapping". 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1316–1324. arXiv:2102.05228. doi:10.1109/iccv48922.2021.00137. ISBN 978-1-6654-2812-5.
  15. ^ a b c d Selvaraju, Ramprasaath R.; Cogswell, Michael; Das, Abhishek; Vedantam, Ramakrishna; Parikh, Devi; Batra, Dhruv (2017). "Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization". 2017 IEEE International Conference on Computer Vision (ICCV). pp. 618–626. doi:10.1109/iccv.2017.74. ISBN 978-1-5386-1032-9.
  16. ^ a b c Zhang, Yubo; Zhu, Yong; Liu, Junli; Yu, Wei; Jiang, Chuang (2025). "An Interpretability Optimization Method for Deep Learning Networks Based on Grad-CAM". IEEE Internet of Things Journal. 12 (4): 3961–3970. doi:10.1109/jiot.2024.3485765.
  17. ^ Chattopadhyay, Aditya; Sarkar, Anirban; Howlader, Prantik; Balasubramanian, Vineeth N. (2018). "Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks". 2018 IEEE Winter Conference on Applications of Computer Vision (WACV): 839–847. arXiv:1710.11063. doi:10.1109/wacv.2018.00097. ISBN 978-1-5386-4886-5.
  18. ^ Wang, Haofan; Wang, Zifan; Du, Mengnan; Yang, Fan; Zhang, Zijian; Ding, Sirui; Mardziel, Piotr; Hu, Xia (2020). "Score-CAM: score-weighted visual explanations for convolutional neural networks". 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW): 111–119. doi:10.1109/cvprw50498.2020.00020. ISBN 978-1-7281-9360-1.
  19. ^ Jiang, Peng-Tao; Zhang, Chang-Bin; Hou, Qibin; Cheng, Ming-Ming; Wei, Yunchao (2021). "LayerCAM: Exploring Hierarchical Class Activation Maps for Localization". IEEE Transactions on Image Processing. 30: 5875–5888. Bibcode:2021ITIP...30.5875J. doi:10.1109/tip.2021.3089943.
[edit]