Normalization (machine learning)

In machine learning, normalization is a statistical technique with various applications. There are mainly two forms of normalization, data normalization and activation normalization. Data normalization is a general technique in statistics, and it includes methods that rescale input data so that they have well-behaved range, mean, variance, and other statistical properties. Activation normalization is specific to deep learning, and it includes methods that rescale the activation of hidden neurons inside a neural network.

Normalization is often used for faster training convergence, less sensitivity to variations in input data, less overfitting, and better generalization to unseen data. They are often theoretically justified as reducing covariance shift, smoother optimization landscapes, increasing regularization, though they are mainly justified by empirical success.^[1]

Data normalization

Normalization is often used in applications involving distances and similarities between data points, such as clustering and similarity search. As an example, the K-means clustering algorithm is sensitive to feature scales.

Notation

$X=(x_{1},x_{2},\dots ,x_{n})$ is a dataset.
$x_{i}$ is an individual data in the dataset.

$X_{min}$ is the minimum value in the dataset.
$X_{max}$ is the maximum value in the dataset.
$\mu$ is the mean value of the dataset.
$\sigma$ is the standard deviation of the dataset.

Min-max scaling

Min-max scaling, also known as min-max normalization, is a linear transformation that rescales data into a specific range, typically between 0 and 1. Given a dataset with a feature $x$ , the min-max scaled value $x'$ is calculated as: $x'={\frac {x-x_{min}}{x_{max}-x_{min}}}$ For example, with [1, 5, 10], $x_{min}=1$ $x_{max}=10$ , Applying min-max scaling, we get [0, 0.44, 1].

Standardization

Standardization, often referred to as z-score normalization, transforms data to have a mean ( $\mu$ ) of 0 and a standard deviation ( $\sigma$ ) of 1. This process centers the data around zero and scales it based on the spread of the original data: $x'={\frac {x-\mu }{\sigma }}$ In other words, it normalizes each data point to be its z-score relative to the normal distribution ${\mathcal {N}}(\mu ,\sigma ^{2})$ .

Mean normalization

Mean normalization changes data to have mean of 0 and range contained within $[-1,+1]$ : $x'={\frac {x-\mu _{x}}{x_{max}-x_{min}}}$

Robust Scaling

Robust scaling, also known as standardization using median and interquartile range (IQR), is designed to be robust to outliers. It scales features using the median and IQR as reference points instead of the mean and standard deviation: $x'={\frac {x-Q_{2}(x)}{Q_{3}(x)-Q_{1}(x)}}$ where $Q_{1}(x),Q_{2}(x),Q_{3}(x)$ are the three quartiles (25th, 50th, 75th percentile) of the feature.

Unit vector normalization

Unit vector normalization regards each individual data point as a vector, and divide each by its vector norm, to obtain $x'=x/\|x\|$ . Any vector norm can be used, but the most common ones are the L1 norm and the L2 norm.

For example, if $x=(v_{1},v_{2},v_{3})$ , then its Lp-normalized version is: $\left({\frac {v_{1}}{(|v_{1}|^{p}+|v_{2}|^{p}+|v_{3}|^{p})^{1/p}}},{\frac {v_{2}}{(|v_{1}|^{p}+|v_{2}|^{p}+|v_{3}|^{p})^{1/p}}},{\frac {v_{3}}{(|v_{1}|^{p}+|v_{2}|^{p}+|v_{3}|^{p})^{1/p}}}\right)$

Activation normalization

In a deep neural network, the neural activations are arrays of numbers. The output array of each layer is the input data to the next layer. While this means data normalization techniques can be applied to normalize neural activations, specialized techniques were developed which were empirically found to work better.

Normalization can also equalize the effect of features to the learning process, preventing data imbalance and make it so that one learning rate works for all the weights.

Typically, the benefits of activation normalization are reported as: stabilizing the gradients during training, less sensitive to weight initialization, faster convergence, larger learning rates, allowing deeper networks to be trained, model regularization, reducing overfitting, better generalization.

Batch normalization

Batch normalization (BatchNorm)^[2] operates on the activations of a layer for each mini-batch rather than on the input features across the entire dataset.

It is typically applied after the linear transformation (and before the activation function) within a neural network layer.

Instead of applying a fixed formula like Min-Max or Standardization, BatchNorm learns the optimal scaling and shifting parameters for each layer during training. For a given mini-batch and feature map, it calculates the mean ( $\mu _{B}$ ) and variance ( $\sigma _{B}^{2}$ ) and uses them to normalize the activations. To maintain the expressive power of the network, learnable parameters $\gamma$ (scale) and $\beta$ (shift) are introduced: ${\hat {x_{i}}}={\frac {x_{i}-\mu _{B}}{\sqrt {\sigma _{B}^{2}+\epsilon }}},\quad y_{i}=\gamma {\hat {x_{i}}}+\beta$

$x_{i}$ represents individual activations in the mini-batch.
$\epsilon$ is a small constant added for numerical stability.

It is claimed in the original publication that BatchNorm works by reducing "internal covariance shift", though the claim has both supporters^[3]^[4] and detractors.^[5]^[6]

Layer normalization

Layer normalization (LayerNorm)^[7] is a common competitor to BatchNorm. Unlike BatchNorm, which normalizes activations across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. Compared to BatchNorm, LayerNorm's performance is not affected by batch size, making it more stable when using smaller batch sizes or working with recurrent neural networks (RNNs).

It is a key component of Transformers, particularly for natural language processing tasks.

For a given data sample and layer, LayerNorm computes the mean ( $\mu$ ) and variance ( $\sigma ^{2}$ ) over all the features. Similar to BatchNorm, learnable parameters $\gamma$ (scale) and $\beta$ (shift) are applied: ${\hat {x_{i}}}={\frac {x_{i}-\mu }{\sqrt {\sigma ^{2}+\epsilon }}},\quad y_{i}=\gamma {\hat {x_{i}}}+\beta$

$x_{i}$ represents the features (activations) for a single data sample.
$\epsilon$ is a small constant added for numerical stability.