Backfitting algorithm

The backfitting algorithm is a simple iterative procedure used to fit a Generalized additive model. It was introduced in 1985 by Leo Breiman and Jerome Freidman along with generalized additive models. In most cases, the backfitting algorithm is equivalent to the Gauss-Seidel method algorithm for solving a certain linear system of equations

Algorithm

Generalized additive models are a class of non-parametric regression models of the form:

$Y=\alpha +\sum _{i=1}^{d}f_{i}(X_{i})+\epsilon$

where each $X_{1},X_{2},\ldots ,X_{p}$ is a variable in our p-dimensional predictor X, and Y is our outcome variable. $\epsilon$ represents our inherent error, which is assumed to have mean zero. The f_i represent unspecified smooth functions of a single X_i. Given the flexibility in the f_i, we typically do not have a unique solution: α is left unidentifiable. It is common to rectify this by constraining

$\sum _{1}^{N}f_{i}(X_{i})=0$

leaving

$\alpha =1/N\sum _{1}^{N}y_{i}$

necessarily.

The backfitting algorithm is then:

   Initialize  ${\hat {\alpha }}=1/N\sum _{1}^{N}y_{i},{\hat {f_{j}}}\equiv 0$ , $\forall i,j$ 
   Do until  ${\hat {f_{j}}}$  converge:
       For each predictor j:
            ${\hat {f_{j}}}\leftarrow Smooth[\lbrace y_{i}-{\hat {\alpha }}-\sum _{k\neq j}{\hat {f_{k}}}(x_{ik}\rbrace _{1}^{N}]$ 
            ${\hat {f_{j}}}\leftarrow {\hat {f_{j}}}-1/N\sum _{i=1}^{N}{\hat {f_{i}}}(x_{ij})$

where $Smooth$ is our smoothing operator. This is typically chosen to be a cubic spline smoother but can be any other appropriate fitting operation, such as:

local polynomial regression
kernel smoothing methods
more complex operators, such as surface smoothers for second and higher-order interactions

Motivation

If we consider the problem of minimizing the expected squared error:

$\min E[Y-(\alpha +\sum _{j=1}^{p}{\hat {f_{j}}}(x_{ij)})]^{2}$

There exists a unique solution by the theory of projections given by:

$f_{i}(X_{i})=E[Y-(\alpha +\sum _{j\neq i}^{p}f_{j}(X_{j}))|X_{i})]^{2}$

for all i = 1, 2, ... p.

This gives the matrix interpretation: ${\begin{pmatrix}I&P_{1}&\cdots &P_{1}\\P_{2}&I&\cdots &P_{2}\\\vdots &&\ddots &\vdots \\P_{p}&\cdots &P_{p}&I\end{pmatrix}}{\begin{pmatrix}f_{1}(X_{1})\\f_{2}(X_{2})\\\vdots \\f_{p}(X_{p})\end{pmatrix}}={\begin{pmatrix}P_{1}Y\\P_{2}Y\\\vdots \\P_{p}Y\end{pmatrix}}$

where $P_{i}(\cdot )=E(\cdot |X_{i})$ . In this context we can imagine a smoother matrix, $S_{i}$ , which approximates our $P_{i}$ and gives an estimate, $S_{i}Y$ , of $E(Y|X)$

${\begin{pmatrix}I&S_{1}&\cdots &S_{1}\\S_{2}&I&\cdots &S_{2}\\\vdots &&\ddots &\vdots \\S_{p}&\cdots &S_{p}&I\end{pmatrix}}{\begin{pmatrix}f_{1}\\f_{2}\\\vdots \\f_{p}\end{pmatrix}}={\begin{pmatrix}S_{1}Y\\S_{2}Y\\\vdots \\S_{p}Y\end{pmatrix}}$

or in abbreviated form

${\hat {S}}f=QY$

An exact solution of this is infeasible to calculate for large np, so the iterative technique of backfitting is used. We take initial guesses $f_{i}^{(0)}$ and update each $f_{i}^{(j)}$ in turn to be the smoothed fit for the residuals of all the others:

${\hat {f_{i}}}^{(j)}\leftarrow Smooth[\lbrace y_{i}-{\hat {\alpha }}-\sum _{k\neq j}{\hat {f_{k}}}(x_{ik}\rbrace _{1}^{N}]$

Looking at the abbreviated form it is easy to see the backfitting algorithm as equivalent to the Gauss-Seidel method for linear smoothing operators S.

Explicit Derivation for Two Dimensions

For the two dimensional case, we can formulate the backfitting algorithm explicitly. We have:

$f_{1}=S_{1}(Y-f_{2}),f_{2}=S_{2}(Y-f_{1})$

If we denote $astheestimateof<math>f_{1}$ in the i-th updating step, the backfitting steps are ${\hat {f}}_{1}^{(i)}=S_{1}[Y-{\hat {f}}_{2}^{(i-1)}],{\hat {f}}_{2}^{(}i)=S_{2}[Y-{\hat {f}}_{1}^{(i-1)}]$

By induction we get

${\hat {f}}_{1}^{(i)}=Y-\sum _{\alpha =0}^{i-1}(S_{1}S_{2})^{\alpha }(I-S_{1})Y-(S_{1}S_{2})^{i-1}S_{1}{\hat {f}}_{2}^{(0)}$

and

${\hat {f}}_{2}^{(i)}=S_{2}\sum _{\alpha =0}^{i-1}(S_{1}S_{2})^{\alpha }(I-S_{1})Y+S_{2}(S_{1}S_{2})^{i-1}S_{1}{\hat {f}}_{2}^{(0)}$

If we assume our constant $\alpha$ is zero and we set ${\hat {f}}_{2}^{(0)}=0$ then we get

${\hat {f}}_{1}^{(i)}=[I-\sum _{\alpha =0}^{i-1}(S_{1}S_{2})^{\alpha }(I-S_{1})]Y$

${\hat {f}}_{2}^{(i)}=[S_{2}\sum _{\alpha =0}^{i-1}(S_{1}S_{2})^{\alpha }(I-S_{1})]Y$

This converges if $\|S_{1}S_{2}\|<1$ .

Issues

The choice of when to stop the algorithm is arbitrary and it is hard to know a priori how long reaching a specific conversion threshold will take. Also, the final model depends on the order in which the predictor variables $X_{i}$ are fit.

As well, the solution found by the backfitting procedure is non-unique. If $b$ is a vector such that ${\hat {S}}b=0$ from above, then if ${\hat {f}}$ is a solution then so is ${\hat {f}}+\alpha b$ is also a solution for any $\alpha \in \mathbb {R}$ . A modification of the backfitting algorithm involving projections onto the eigenspace of S can remedy this problem.

Modified Algorithm

We can modify the backfitting algorithm to make it easier to provide a unique solution. Let ${\mathcal {V}}_{1}(S_{i})$ be the space spanned by all the eigenvectors of S_i that correspond to eigenvalue 1. Then any b satisfying ${\hat {S}}b=0$ has $b_{i}\in {\mathcal {V}}_{1}(S_{i})\forall i=1,\dots ,p$ and $\sum _{i=1}^{p}b_{i}=0.$ Now if we take $A$ to be a matrix that projects orthogonally onto ${\mathcal {V}}_{1}(S_{1})+\dots +{\mathcal {V}}_{1}(S_{p})$ , we get the following modified backfitting algorithm:

   Initialize  ${\hat {\alpha }}=1/N\sum _{1}^{N}y_{i},{\hat {f_{j}}}\equiv 0$ , $\forall i,j$ ,  ${\hat {f_{+}}}=\alpha +{\hat {f_{1}}}+\dots +{\hat {f_{p}}}$ 
   Do until  ${\hat {f_{j}}}$  converge:
       Regress  $y-{\hat {f_{+}}}$  onto the space  ${\mathcal {V}}_{1}(S_{i})+\dots +{\mathcal {V}}_{1}(S_{p})$ , setting  $a=A(Y-{\hat {f_{+}}})$ 
       For each predictor j:
           Apply backfitting update to  $(Y-a)$  using the smoothing operator  $(I-A_{i})S_{i}$ , yielding new estimates for  ${\hat {f_{j}}}$

References

Breiman, L. & Friedman, J. H. (1985). "Estimating optimal transformations for multiple regression and correlations (with discussion)". Journal of the American Statistical Association. 80(391): 580–619.{{cite journal}}: CS1 maint: multiple names: authors list (link)
Hastie, T. J. & Tibshirani, R. J. (1990). "Generalized Additive Models". Monographs on Statistics and Applied Probability. 43.{{cite journal}}: CS1 maint: multiple names: authors list (link)
Härdle, Wolfgang; et al. (June 9, 2004). "Backfitting". Retrieved 2009-11-15. {{cite web}}: Explicit use of et al. in: |author= (help)

External Links