Pooling layer

In neural networks, a pooling layer is a kind of network layer that downsamples and aggregates information that is dispersed among many vectors into fewer vectors.^[1] It has several uses. It removes redundant information, reducing the amount of computation and memory required, makes the model more robust to small variations in the input, and increases the receptive field of neurons in later layers in the network.

Convolutional neural network pooling

Pooling is most commonly used in convolutional neural networks (CNN). We describe pooling in 2-dimensional CNNs. The generalization to n-dimensions is immediate.

As notation, we consider a tensor $x\in \mathbb {R} ^{H\times W\times C}$ , where $H$ is height, $W$ is width, and $C$ is the number of channels. A pooling layer outputs a tensor $y$ .

We define two variables $f,s$ called "filter size" and "stride". Sometimes, it is necessary to use a different filter size and stride for horizontal and vertical directions. In such cases, we define 4 variables $f_{H},f_{W},s_{H},s_{W}$ .

The receptive field of an entry in the output tensor $y$ are all the entries in $x$ that can affect that entry.

Max pooling

Max Pooling (MaxPool) is commonly used in CNNs to reduce the spatial dimensions of feature maps. It was introduced in 1990.^[2]

Define $\mathrm {MaxPool} (x|f,s)_{0,0,0}=\max(x_{0:f-1,0:f-1,0})$ where $0:f-1$ means the range $0,1,\dots ,f-1$ . Note that we need to avoid the off-by-one error. The next input is $\mathrm {MaxPool} (x|f,s)_{1,0,0}=\max(x_{s:s+f-1,0:f-1,0})$ and so on. The receptive field of $y_{i,j,c}$ is $x_{is+f-1,js+f-1,c}$ , so in general, $\mathrm {MaxPool} (x|f,s)_{i,j,c}=\mathrm {max} (x_{is:is+f-1,js:js+f-1,c})$ If the horizontal and vertical filter size and strides differ, then in general, $\mathrm {MaxPool} (x|f,s)_{i,j,c}=\mathrm {max} (x_{is_{H}:is_{H}+f_{H}-1,js_{W}:js_{W}+f_{W}-1,c})$ More succinctly, we can write $y_{k}=\max(\{x_{k'}|k'{\text{ in the receptive field of }}k\})$ .

If $H$ is not expressible as $ks+f$ where $s$ is an integer, then for computing the entries of the output tensor on the boundaries, max pooling would attempt to take as inputs variables off the tensor. In this case, how those non-existent variables are handled depends on the padding conditions, illustrated on the right.

Global Max Pooling (GMP) is a specific kind of max pooling where the output tensor has shape $\mathbb {R} ^{C}$ and the receptive field of $y_{c}$ is all of $x_{0:H,0:W,c}$ . That is, it takes the maximum over each entire channel. It is often used just before the final fully connected layers in a CNN classification head.

Average pooling

Average pooling (AvgPool) is similarly defined $\mathrm {AvgPool} (x|f,s)_{i,j,c}=\mathrm {average} (x_{is:is+f-1,js:js+f-1,c})={\frac {1}{f^{2}}}\sum _{k\in is:is+f-1}\sum _{l\in js:js+f-1}x_{k,l,c}$ Global Average Pooling (GAP) is defined similarly to GMP. It was first proposed in Network-in-Network.^[3] Similarly to GMP, it is often used just before the final fully connected layers in a CNN classification head.

Interpolations

There are some interpolations of max pooling and average pooling.

Mixed Pooling is a linear sum of maxpooling and average pooling.^[4] That is,

$\mathrm {MixedPool} (x|f,s,w)=w\mathrm {MaxPool} (x|f,s)+(1-w)\mathrm {AvgPool} (x|f,s)$ where $w\in [0,1]$ is either a hyperparameter, a learnable parameter, or randomly sampled anew every time.

Lp Pooling is like average pooling, but uses Lp norm average instead of average: $y_{k}=\left({\frac {1}{N}}\sum _{k'{\text{ in the receptive field of }}k}|x_{k'}|^{p}\right)^{1/p}$ where $N$ is the size of receptive field, and $p\geq 1$ is a hyperparameter. If all activations are non-negative, then average pooling is the case of $p=1$ , and maxpooling is the case of $p\to \infty$ . Square-root pooling is the case of $p=2$ .^[5]

Stochastic pooling samples a random activation $x_{k'}$ from the receptive field with probability ${\frac {x_{k'}}{\sum _{k''}x_{k''}}}$ . It is the same as average pooling in expectation.^[6]

Softmax pooling is like maxpooling, but uses softmax, i.e. ${\frac {\sum _{k'}e^{\beta x_{k'}}x_{k'}}{\sum _{k''}e^{\beta x_{k''}}}}$ where $\beta >0$ . Average pooling is the case of $\beta \downarrow 0$ , and maxpooling is the case of $\beta \uparrow \infty$ ^[5]

RoI pooling

Region of Interest Pooling (also known as RoI pooling) is a variant of max pooling, in which output size is fixed and input rectangle is a parameter. They are used in R-CNNs for object detection.^[7]

Other poolings

Spatial pyramidal pooling applies max pooling (or any other form of pooling) in a pyramid structure. That is, it applies global max pooling, then applies max pooling to the image divided into 4 equal parts, then 16, etc. The results are then concatenated. It is a hierarchical form of global pooling, and similar to global pooling, it is often used just before a classification head.^[8]

Multihead attention pooling applies a multiheaded attention block to pooling. Specifically, it takes as input a list of vectors $v_{1},v_{2},\dots ,v_{n}$ , applies a feedforward layer $\mathrm {FFN}$ on each vector resulting in a matrix $V=[\mathrm {FFN} (v_{1}),\dots ,\mathrm {FFN} (v_{n})]$ , then sends the resulting matrix to $\mathrm {MultiheadedAttention} (Q,V,V)$ , where $Q$ is a matrix of trainable parameters. It is used in vision transformers.^[9]

See ^[10]^[11] for reviews for pooling methods.

Vision Transformer pooling

In Vision Transformers (ViT), there are the following common kinds of poolings.

Graph neural network pooling

In graph neural networks (GNN), there are also two forms of pooling: global and local. Global pooling can be reduced to a local pooling where the receptive field is the entire output.

Local pooling: a local pooling layer coarsens the graph via downsampling. Local pooling is used to increase the receptive field of a GNN, in a similar fashion to pooling layers in convolutional neural networks. Examples include k-nearest neighbours pooling, top-k pooling,^[12] and self-attention pooling.^[13]
Global pooling: a global pooling layer, also known as readout layer, provides fixed-size representation of the whole graph. The global pooling layer must be permutation invariant, such that permutations in the ordering of graph nodes and edges do not alter the final output.^[14] Examples include element-wise sum, mean or maximum.

Local pooling layers coarsen the graph via downsampling. We present here several learnable local pooling strategies that have been proposed.^[14] For each cases, the input is the initial graph is represented by a matrix $\mathbf {X}$ of node features, and the graph adjacency matrix $\mathbf {A}$ . The output is the new matrix $\mathbf {X} '$ of node features, and the new graph adjacency matrix $\mathbf {A} '$ .

Top-k pooling

We first set

$\mathbf {y} ={\frac {\mathbf {X} \mathbf {p} }{\Vert \mathbf {p} \Vert }}$

where $\mathbf {p}$ is a learnable projection vector. The projection vector $\mathbf {p}$ computes a scalar projection value for each graph node.

The top-k pooling layer ^[12] can then be formalised as follows:

\mathbf {X} '=(\mathbf {X} \odot {\text{sigmoid}}(\mathbf {y} ))_{\mathbf {i} }

\mathbf {A} '=\mathbf {A} _{\mathbf {i} ,\mathbf {i} }

where $\mathbf {i} ={\text{top}}_{k}(\mathbf {y} )$ is the subset of nodes with the top-k highest projection scores, $\odot$ denotes element-wise matrix multiplication, and ${\text{sigmoid}}(\cdot )$ is the sigmoid function. In other words, the nodes with the top-k highest projection scores are retained in the new adjacency matrix $\mathbf {A} '$ . The ${\text{sigmoid}}(\cdot )$ operation makes the projection vector $\mathbf {p}$ trainable by backpropagation, which otherwise would produce discrete outputs.^[12]

Self-attention pooling

We first set

\mathbf {y} ={\text{GNN}}(\mathbf {X} ,\mathbf {A} )

where ${\text{GNN}}$ is a generic permutation equivariant GNN layer (e.g., GCN, GAT, MPNN).

The Self-attention pooling layer^[13] can then be formalised as follows:

\mathbf {X} '=(\mathbf {X} \odot \mathbf {y} )_{\mathbf {i} }

\mathbf {A} '=\mathbf {A} _{\mathbf {i} ,\mathbf {i} }

where $\mathbf {i} ={\text{top}}_{k}(\mathbf {y} )$ is the subset of nodes with the top-k highest projection scores, $\odot$ denotes element-wise matrix multiplication.

The self-attention pooling layer can be seen as an extension of the top-k pooling layer. Differently from top-k pooling, the self-attention scores computed in self-attention pooling account both for the graph features and the graph topology.

References

^ Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "7.5. Pooling". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.
^ Yamaguchi, Kouichi; Sakamoto, Kenji; Akabane, Toshio; Fujimoto, Yoshiji (November 1990). A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90). Kobe, Japan. Archived from the original on 2021-03-07. Retrieved 2019-09-04.
^ Lin, Min; Chen, Qiang; Yan, Shuicheng (2013). "Network In Network". arXiv:1312.4400 [cs.NE].
^ Yu, Dingjun; Wang, Hanli; Chen, Peiqiu; Wei, Zhihua (2014). "Mixed Pooling for Convolutional Neural Networks". In Miao, Duoqian; Pedrycz, Witold; Ślȩzak, Dominik; Peters, Georg; Hu, Qinghua; Wang, Ruizhi (eds.). Rough Sets and Knowledge Technology. Lecture Notes in Computer Science. Vol. 8818. Cham: Springer International Publishing. pp. 364–375. doi:10.1007/978-3-319-11740-9_34. ISBN 978-3-319-11740-9.
^ ^a ^b Boureau, Y-Lan; Ponce, Jean; LeCun, Yann (2010-06-21). "A theoretical analysis of feature pooling in visual recognition". Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML'10. Madison, WI, USA: Omnipress: 111–118. ISBN 978-1-60558-907-7.
^ Zeiler, Matthew D.; Fergus, Rob (2013-01-15), Stochastic Pooling for Regularization of Deep Convolutional Neural Networks, arXiv:1301.3557
^ Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "14.8. Region-based CNNs (R-CNNs)". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.
^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-09-01). "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition". IEEE Transactions on Pattern Analysis and Machine Intelligence. 37 (9): 1904–1916. arXiv:1406.4729. doi:10.1109/TPAMI.2015.2389824. ISSN 0162-8828. PMID 26353135.
^ Zhai, Xiaohua; Kolesnikov, Alexander; Houlsby, Neil; Beyer, Lucas (June 2022). "Scaling Vision Transformers". 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 1204–1213. arXiv:2106.04560. doi:10.1109/CVPR52688.2022.01179. ISBN 978-1-6654-6946-3.
^ Zafar, Afia; Aamir, Muhammad; Mohd Nawi, Nazri; Arshad, Ali; Riaz, Saman; Alruban, Abdulrahman; Dutta, Ashit Kumar; Almotairi, Sultan (2022-08-29). "A Comparison of Pooling Methods for Convolutional Neural Networks". Applied Sciences. 12 (17): 8643. doi:10.3390/app12178643. ISSN 2076-3417.
^ Gholamalinezhad, Hossein; Khosravi, Hossein (2020-09-16), Pooling Methods in Deep Neural Networks, a Review, arXiv:2009.07485, retrieved 2024-09-09
^ ^a ^b ^c Gao, Hongyang; Ji, Shuiwang Ji (2019). "Graph U-Nets". arXiv:1905.05178 [cs.LG].
^ ^a ^b Lee, Junhyun; Lee, Inyeop; Kang, Jaewoo (2019). "Self-Attention Graph Pooling". arXiv:1904.08082 [cs.LG].
^ ^a ^b Liu, Chuang; Zhan, Yibing; Li, Chang; Du, Bo; Wu, Jia; Hu, Wenbin; Liu, Tongliang; Tao, Dacheng (2022). "Graph Pooling for Graph Neural Networks: Progress, Challenges, and Opportunities". arXiv:2204.07321 [cs.LG].

[1] Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "7.5. Pooling". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.

[Yamaguchi111990-2] Yamaguchi, Kouichi; Sakamoto, Kenji; Akabane, Toshio; Fujimoto, Yoshiji (November 1990). A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90). Kobe, Japan. Archived from the original on 2021-03-07. Retrieved 2019-09-04.

[3] Lin, Min; Chen, Qiang; Yan, Shuicheng (2013). "Network In Network". arXiv:1312.4400 [cs.NE].

[4] Yu, Dingjun; Wang, Hanli; Chen, Peiqiu; Wei, Zhihua (2014). "Mixed Pooling for Convolutional Neural Networks". In Miao, Duoqian; Pedrycz, Witold; Ślȩzak, Dominik; Peters, Georg; Hu, Qinghua; Wang, Ruizhi (eds.). Rough Sets and Knowledge Technology. Lecture Notes in Computer Science. Vol. 8818. Cham: Springer International Publishing. pp. 364–375. doi:10.1007/978-3-319-11740-9_34. ISBN 978-3-319-11740-9.

[:7-5] Boureau, Y-Lan; Ponce, Jean; LeCun, Yann (2010-06-21). "A theoretical analysis of feature pooling in visual recognition". Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML'10. Madison, WI, USA: Omnipress: 111–118. ISBN 978-1-60558-907-7.

[6] Zeiler, Matthew D.; Fergus, Rob (2013-01-15), Stochastic Pooling for Regularization of Deep Convolutional Neural Networks, arXiv:1301.3557

[:03-7] Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "14.8. Region-based CNNs (R-CNNs)". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.

[8] He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-09-01). "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition". IEEE Transactions on Pattern Analysis and Machine Intelligence. 37 (9): 1904–1916. arXiv:1406.4729. doi:10.1109/TPAMI.2015.2389824. ISSN 0162-8828. PMID 26353135.

[9] Zhai, Xiaohua; Kolesnikov, Alexander; Houlsby, Neil; Beyer, Lucas (June 2022). "Scaling Vision Transformers". 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 1204–1213. arXiv:2106.04560. doi:10.1109/CVPR52688.2022.01179. ISBN 978-1-6654-6946-3.

[10] Zafar, Afia; Aamir, Muhammad; Mohd Nawi, Nazri; Arshad, Ali; Riaz, Saman; Alruban, Abdulrahman; Dutta, Ashit Kumar; Almotairi, Sultan (2022-08-29). "A Comparison of Pooling Methods for Convolutional Neural Networks". Applied Sciences. 12 (17): 8643. doi:10.3390/app12178643. ISSN 2076-3417.

[11] Gholamalinezhad, Hossein; Khosravi, Hossein (2020-09-16), Pooling Methods in Deep Neural Networks, a Review, arXiv:2009.07485, retrieved 2024-09-09

[gao2019-12] Gao, Hongyang; Ji, Shuiwang Ji (2019). "Graph U-Nets". arXiv:1905.05178 [cs.LG].

[lee2019-13] Lee, Junhyun; Lee, Inyeop; Kang, Jaewoo (2019). "Self-Attention Graph Pooling". arXiv:1904.08082 [cs.LG].

[lui2022-14] Liu, Chuang; Zhan, Yibing; Li, Chang; Du, Bo; Wu, Jia; Hu, Wenbin; Liu, Tongliang; Tao, Dacheng (2022). "Graph Pooling for Graph Neural Networks: Progress, Challenges, and Opportunities". arXiv:2204.07321 [cs.LG].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]