Reparameterization trick

The reparameterization trick (aka "reparameterization gradient estimator" or "pathwise derivative") is a technique used in machine learning, particularly in variational inference and stochastic optimization. It allows for the efficient computation of gradients through random variables, enabling the optimization of parametric probability models using stochastic gradient descent, and also in variance reduction of estimators.

This trick has been used in various machine learning applications, most notably in variational autoencoders (VAEs).

Mathematics

Let $z$ be a random variable with distribution $q_{\phi }(z)$ , where $\phi$ are the parameters of the distribution. The reparameterization trick expresses $z$ as: $z=g_{\phi }(\epsilon ),\quad \epsilon \sim p(\epsilon )$ Here, $g_{\phi }$ is a deterministic function parameterized by $\phi$ , and $\epsilon$ is a noise variable drawn from a fixed distribution $p(\epsilon )$ .

REINFORCE estimator

Consider an objective function of the form: $L(\phi )=\mathbb {E} _{z\sim q_{\phi }(z)}[f(z)]$ Without the reparameterization trick, estimating the gradient $\nabla _{\phi }L(\phi )$ can be challenging, because the parameter appears in the random variable itself. In more detail, we have $\nabla _{\phi }L(\phi )=\nabla _{\phi }\int dz\;q_{\phi }(z)f(z)$ The REINFORCE estimator, widely used in reinforcement learning,^[1] estimates the gradient by $\nabla _{\phi }L(\phi )=\int dz\;q_{\phi }(z)\nabla _{\phi }(\ln q_{\phi }(z))f(z)=\mathbb {E} _{z\sim q_{\phi }(z)}[\nabla _{\phi }(\ln q_{\phi }(z))f(z)]$ This allows the gradient to be estimated: $\nabla _{\phi }L(\phi )\approx {\frac {1}{N}}\sum _{i=1}^{N}\nabla _{\phi }(\ln q_{\phi }(z_{i}))f(z_{i})$

The REINFORCE estimator has high variance, and many methods were developed to reduce its variance.^[2]

Reparametrization estimator

With the reparameterization trick, we rewrite the expectation as: $L(\phi )=\mathbb {E} _{\epsilon \sim p(\epsilon )}[f(g_{\phi }(\epsilon ))]$ Now, the gradient can be estimated as: $\nabla _{\phi }L(\phi )=\mathbb {E} _{\epsilon \sim p(\epsilon )}[\nabla _{\phi }f(g_{\phi }(\epsilon ))]\approx {\frac {1}{N}}\sum _{i=1}^{N}\nabla _{\phi }f(g_{\phi }(\epsilon _{i}))$

Examples

For some common distributions, the reparameterization trick takes specific forms:

Normal distribution: For $z\sim {\mathcal {N}}(\mu ,\sigma ^{2})$ , we can use: $z=\mu +\sigma \epsilon ,\quad \epsilon \sim {\mathcal {N}}(0,1)$

Exponential distribution: For $z\sim {\text{Exp}}(\lambda )$ , we can use: $z=-{\frac {1}{\lambda }}\log(\epsilon ),\quad \epsilon \sim {\text{Uniform}}(0,1)$ Discrete distribution can be reparametrized by the Gumbel distribution ("Gumbel-max trick" or "concrete distribution").^[3]

The Gamma Beta, Dirichlet, and von Mises distributions, can be reparametrized by the implicit method proposed by Figurnov et al.^[4]

Applications

Variational autoencoder

In Variational Autoencoders (VAEs), the VAE objective function, known as the Evidence Lower Bound (ELBO), is given by:

${\text{ELBO}}(\phi ,\theta )=\mathbb {E} _{z\sim q_{\phi }(z|x)}[\log p_{\theta }(x|z)]-D_{\text{KL}}(q_{\phi }(z|x)||p(z))$

where $q_{\phi }(z|x)$ is the encoder (recognition model), $p_{\theta }(x|z)$ is the decoder (generative model), and $p(z)$ is the prior distribution over latent variables. The gradient of ELBO with respect to $\theta$ is simply $\mathbb {E} _{z\sim q_{\phi }(z|x)}[\nabla _{\theta }\log p_{\theta }(x|z)]\approx {\frac {1}{L}}\sum _{l=1}^{L}\nabla _{\theta }\log p_{\theta }(x|z_{l})$ but the gradient with respect to $\phi$ requires the trick. Express the sampling operation $z\sim q_{\phi }(z|x)$ as: $z=\mu _{\phi }(x)+\sigma _{\phi }(x)\odot \epsilon ,\quad \epsilon \sim {\mathcal {N}}(0,I)$ where $\mu _{\phi }(x)$ and $\sigma _{\phi }(x)$ are the outputs of the encoder network, and $\odot$ denotes element-wise multiplication. Then we have $\nabla _{\phi }{\text{ELBO}}(\phi ,\theta )=\mathbb {E} _{\epsilon \sim {\mathcal {N}}(0,I)}[\nabla _{\phi }\log p_{\theta }(x|z)+\nabla _{\phi }\log q_{\phi }(z|x)-\nabla _{\phi }\log p(z)]$ where $z=\mu _{\phi }(x)+\sigma _{\phi }(x)\odot \epsilon$ . This allows us to estimate the gradient using Monte Carlo sampling: $\nabla _{\phi }{\text{ELBO}}(\phi ,\theta )\approx {\frac {1}{L}}\sum _{l=1}^{L}[\nabla _{\phi }\log p_{\theta }(x|z_{l})+\nabla _{\phi }\log q_{\phi }(z_{l}|x)-\nabla _{\phi }\log p(z_{l})]$ where $z_{l}=\mu _{\phi }(x)+\sigma _{\phi }(x)\odot \epsilon _{l}$ and $\epsilon _{l}\sim {\mathcal {N}}(0,I)$ for $l=1,\ldots ,L$ .

This formulation enables backpropagation through the sampling process, allowing for end-to-end training of the VAE model using stochastic gradient descent or its variants.

Variational inference

More generally, the trick allows using stochastic gradient descent for generic variational inference. Let the variational objective (ELBO) is be of the form: ${\text{ELBO}}(\phi )=\mathbb {E} _{z\sim q_{\phi }(z)}[\log p(x,z)-\log q_{\phi }(z)]$ Using the reparameterization trick, we can estimate the gradient of this objective with respect to $\phi$ using Monte Carlo sampling: $\nabla _{\phi }{\text{ELBO}}(\phi )\approx {\frac {1}{L}}\sum _{l=1}^{L}\nabla _{\phi }[\log p(x,g_{\phi }(\epsilon _{l}))-\log q_{\phi }(g_{\phi }(\epsilon _{l}))],\quad \epsilon _{l}\sim p(\epsilon )$

Dropout

The reparameterization trick has been applied to reduce the variance in dropout, a regularization technique in neural networks. In variational dropout,^[5] the dropout operation is reparameterized as: $y=(W\odot \epsilon )x,\quad \epsilon _{ij}\sim {\text{Bernoulli}}(\alpha _{ij})$ where $W$ is the weight matrix, $x$ is the input, and $\alpha _{ij}$ are the dropout rates.

The local reparameterization trick reduces its variance by pushing the noise to the activations: $y_{i}=\mu _{i}+\sigma _{i}\odot \epsilon _{i},\quad \epsilon _{i}\sim {\mathcal {N}}(0,I)$ where $\mu _{i}=\mathbf {m} _{i}^{\top }x$ and $\sigma _{i}^{2}=\mathbf {v} _{i}^{\top }x^{2}$ , with $\mathbf {m} _{i}$ and $\mathbf {v} _{i}$ being the mean and variance of the $i$ -th output neuron.