Weight initialization

In machine learning and deep learning, weight initialization describes the initial step in creating a neural network. A neural network contains trainable parameters that are modified during training. Before training, these need to be assigned initial values. This assignment step is weight initialization.

The choice of weight initialization method affects the speed of convergence, the scale of neural activation within the network, the scale of gradient signals during backpropagation, and the quality of the final model. Proper initialization is necessary for avoiding issues such as vanishing and exploding gradients, saturation of activation function, etc.

Initialization schemes

We discuss the main methods of initialization in the context of a multilayer perceptron (MLP). Specific strategies for initializing other network architectures are discussed in later sections.

For an MLP, there are only two kinds of trainable parameters, called weights and biases. Each layer $l$ contains a weight matrix $W^{(l)}\in \mathbb {R} ^{n_{l-1}\times n_{l}}$ and a bias vector $b^{(l)}\in \mathbb {R} ^{n_{l}}$ , where $n_{l}$ is the number of neurons in that layer. A weight initialization method is an algorithm for setting the initial values for $W^{(l)},b^{(l)}$ for each layer $l$ .

Zero initialization

The simplest form is zero initialization: $W^{(l)}=0,b^{(l)}=0$ Zero initialization is sometimes used for initializing biases, but it is not used for initializing weights, as it leads to symmetry in the network, causing all neurons to learn the same features.

Random initialization

Random initialization involves setting the weights to small random values, typically drawn from a normal distribution or a uniform distribution.

The uniform random initialization is typically by sampling each entry in $W^{(l)}$ from the uniform distribution ${\mathcal {U}}(\pm 1/{\sqrt {n_{l-1}}})$ .

Glorot initialization

Glorot initialization (or Xavier initialization) was proposed by Xavier Glorot and Yoshua Bengio.^[1] It was designed to keep the scale of gradients roughly the same in all layers.

For uniform initialization, it samples each entry in $W^{(l)}$ from ${\mathcal {U}}(\pm {\sqrt {6/(n_{l}+n_{l-1})}})$ . In the context, $n_{l-1}$ is also called the "fan-in", and $n_{l}$ the "fan-out".

He initialization

He initialization (or Kaiming initialization) was proposed by Kaiming He et al.^[2] It was designed for networks with ReLU activation.

It samples each entry in $W^{(l)}$ from ${\mathcal {N}}(0,{\sqrt {2/n_{l-1}}})$ .

Other methods

For hyperbolic tangent activation function, a particular scaling is sometimes used: $1.7159\tanh(2x/3)$ . This was sometimes called "LeCun's tanh". It was designed so that if the input has variance roughly 1, then the output has variance roughly 1.^[3]^[4]

^[5]^[6]

References

^ Glorot, Xavier; Bengio, Yoshua (2010-03-31). "Understanding the difficulty of training deep feedforward neural networks". Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 249–256.
^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification". doi:10.48550/ARXIV.1502.01852. {{cite journal}}: Cite journal requires |journal= (help)
^ Y. LeCun. Generalization and network design strategies. In R. Pfeifer, Z. Schreter, F. Fogelman, and L. Steels, editors, Connectionism in Perspective, Amsterdam, 1989. Elsevier. Proceedings of the International Conference Connectionism in Perspective, University of Zurich, 10. -- 13. October 1988.
^ LeCun, Yann; Bottou, Leon; Orr, Genevieve B.; Müller, Klaus -Robert (1998), Orr, Genevieve B.; Müller, Klaus-Robert (eds.), "Efficient BackProp", Neural Networks: Tricks of the Trade, Berlin, Heidelberg: Springer, pp. 9–50, doi:10.1007/3-540-49430-8_2, ISBN 978-3-540-49430-0, retrieved 2024-10-05
^ Hanin, Boris; Rolnick, David (2018). "How to Start Training: The Effect of Initialization and Architecture". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc.
^ Mishkin, Dmytro; Matas, Jiri (2016-02-19), All you need is a good init, doi:10.48550/arXiv.1511.06422, retrieved 2024-10-05