Neural network Gaussian process

As Bayesian artificial neural networks are made wider, the distribution over functions they compute converges to a Gaussian process, with a particular compositional kernel that depends on the neural network architecture and the prior distribution over model parameters. This Neural Network Gaussian Process (NNGP) can be evaluated to generate predictions that would come from an infinitely wide Bayesian neural network, without ever instantiating a neural network. The NNGP additionally describes the distribution over functions realized by non-Bayesian neural networks at random initialization.

This equivalence between wide neural networks and NNGPs has been shown to hold for: single hidden layer [] and deep [] neural networks as the number of units per layer is taken to infinity; in convolutional neural networks as the number of channels is taken to infinity; in transformer networks as the number of attention heads is taken to infinity;

This limit is of particular practical relevance, as finite width neural networks are often found to perform strictly better with increasing width.

Proof sketch

References