Normalization (machine learning)

In machine learning, normalization is a statistical technique with various applications. There are mainly two forms of normalization, data normalization and activation normalization. Data normalization, or feature scaling, is a general technique in statistics, and it includes methods that rescale input data so that they have well-behaved range, mean, variance, and other statistical properties. Activation normalization is specific to deep learning, and it includes methods that rescale the activation of hidden neurons inside a neural network.

Normalization is often used for faster training convergence, less sensitivity to variations in input data, less overfitting, and better generalization to unseen data. They are often theoretically justified as reducing covariance shift, smoother optimization landscapes, increasing regularization, though they are mainly justified by empirical success.^[1]

Activation normalization

In a deep neural network, the neural activations are arrays of numbers. The output array of each layer is the input data to the next layer. While this means feature scaling techniques can be applied to normalize neural activations, specialized techniques were developed which were empirically found to work better.^{[citation needed]}

Normalization can also equalize the effect of features to the learning process, preventing data imbalance and make it so that one learning rate works for all the weights.^{[citation needed]}

Typically, the benefits of activation normalization are reported as: stabilizing the gradients during training, less sensitive to weight initialization, faster convergence, larger learning rates, allowing deeper networks to be trained, model regularization, reducing overfitting, better generalization.^{[citation needed]}

Batch normalization

Batch normalization (BatchNorm)^[2] operates on the activations of a layer for each mini-batch rather than on the input features across the entire dataset.

It is typically applied after the linear transformation (and before the activation function) within a neural network layer.^{[citation needed]}

Instead of applying a fixed formula like Min-Max or Standardization, BatchNorm learns the optimal scaling and shifting parameters for each layer during training. For a given mini-batch and feature map, it calculates the mean ( $\mu _{B}$ ) and variance ( $\sigma _{B}^{2}$ ) and uses them to normalize the activations. To maintain the expressive power of the network, learnable parameters $\gamma$ (scale) and $\beta$ (shift) are introduced: ${\hat {x_{i}}}={\frac {x_{i}-\mu _{B}}{\sqrt {\sigma _{B}^{2}+\epsilon }}},\quad y_{i}=\gamma {\hat {x_{i}}}+\beta$

$x_{i}$ represents individual activations in the mini-batch.
$\epsilon$ is a small constant added for numerical stability.

It is claimed in the original publication that BatchNorm works by reducing "internal covariance shift", though the claim has both supporters^[3]^[4] and detractors.^[5]^[6]

Layer normalization

Layer normalization (LayerNorm)^[7] is a common competitor to BatchNorm. Unlike BatchNorm, which normalizes activations across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. Compared to BatchNorm, LayerNorm's performance is not affected by batch size, making it more stable when using smaller batch sizes or working with recurrent neural networks (RNNs).

It is a key component of Transformers, particularly for natural language processing tasks.

For a given data sample and layer, LayerNorm computes the mean ( $\mu$ ) and variance ( $\sigma ^{2}$ ) over all the features. Similar to BatchNorm, learnable parameters $\gamma$ (scale) and $\beta$ (shift) are applied: ${\hat {x_{i}}}={\frac {x_{i}-\mu }{\sqrt {\sigma ^{2}+\epsilon }}},\quad y_{i}=\gamma {\hat {x_{i}}}+\beta$

$x_{i}$ represents the features (activations) for a single data sample.
$\epsilon$ is a small constant added for numerical stability.

References

^ Huang, Lei (2022). Normalization Techniques in Deep Learning. Synthesis Lectures on Computer Vision. Cham: Springer International Publishing. doi:10.1007/978-3-031-14595-7. ISBN 978-3-031-14594-0.
^ Ioffe, Sergey; Szegedy, Christian (2015-06-01). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". Proceedings of the 32nd International Conference on Machine Learning. PMLR: 448–456.
^ Xu, Jingjing; Sun, Xu; Zhang, Zhiyuan; Zhao, Guangxiang; Lin, Junyang (2019). "Understanding and Improving Layer Normalization". Advances in Neural Information Processing Systems. 32. Curran Associates, Inc.
^ Awais, Muhammad; Bin Iqbal, Md. Tauhid; Bae, Sung-Ho (2021-11). "Revisiting Internal Covariate Shift for Batch Normalization". IEEE Transactions on Neural Networks and Learning Systems. 32 (11): 5082–5092. doi:10.1109/TNNLS.2020.3026784. ISSN 2162-237X. {{cite journal}}: Check date values in: |date= (help)
^ Bjorck, Nils; Gomes, Carla P; Selman, Bart; Weinberger, Kilian Q (2018). "Understanding Batch Normalization". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc.
^ Santurkar, Shibani; Tsipras, Dimitris; Ilyas, Andrew; Madry, Aleksander (2018). "How Does Batch Normalization Help Optimization?". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc.
^ Ba, Jimmy Lei; Kiros, Jamie Ryan; Hinton, Geoffrey E. (2016). "Layer Normalization". doi:10.48550/ARXIV.1607.06450. {{cite journal}}: Cite journal requires |journal= (help)

[1] Huang, Lei (2022). Normalization Techniques in Deep Learning. Synthesis Lectures on Computer Vision. Cham: Springer International Publishing. doi:10.1007/978-3-031-14595-7. ISBN 978-3-031-14594-0.

[2] Ioffe, Sergey; Szegedy, Christian (2015-06-01). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". Proceedings of the 32nd International Conference on Machine Learning. PMLR: 448–456.

[3] Xu, Jingjing; Sun, Xu; Zhang, Zhiyuan; Zhao, Guangxiang; Lin, Junyang (2019). "Understanding and Improving Layer Normalization". Advances in Neural Information Processing Systems. 32. Curran Associates, Inc.

[4] Awais, Muhammad; Bin Iqbal, Md. Tauhid; Bae, Sung-Ho (2021-11). "Revisiting Internal Covariate Shift for Batch Normalization". IEEE Transactions on Neural Networks and Learning Systems. 32 (11): 5082–5092. doi:10.1109/TNNLS.2020.3026784. ISSN 2162-237X. {{cite journal}}: Check date values in: |date= (help)

[5] Bjorck, Nils; Gomes, Carla P; Selman, Bart; Weinberger, Kilian Q (2018). "Understanding Batch Normalization". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc.

[6] Santurkar, Shibani; Tsipras, Dimitris; Ilyas, Andrew; Madry, Aleksander (2018). "How Does Batch Normalization Help Optimization?". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc.

[7] Ba, Jimmy Lei; Kiros, Jamie Ryan; Hinton, Geoffrey E. (2016). "Layer Normalization". doi:10.48550/ARXIV.1607.06450. {{cite journal}}: Cite journal requires |journal= (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

Normalization (machine learning)

Activation normalization

Batch normalization

Layer normalization

See also

Further reading

References