Linear predictor function

In statistics and in machine learning, a linear predictor function is a linear function (linear combination) of a set of coefficients and explanatory variables (independent variables), whose value is used to predict the outcome of a dependent variable. Functions of this sort are standard in linear regression, where the coefficients are termed regression coefficients. However, they also occur in various types of linear classifiers (e.g. perceptrons, support vector machines, and linear discriminant analysis), as well as in various other models, such as principal component analysis and factor analysis. In many of these models, the coefficients are referred to as "weights".

Basic form

The basic form of a linear predictor function $f(i)$ for data point i (consisting of p explanatory variables), for i = 1 ... n, is

f(i)=\beta _{0}+\beta _{1}x_{i1}+\cdots +\beta _{p}x_{ip},

where $\beta _{0},\ldots ,\beta _{p}$ are the coefficients (regression coefficients, weights, etc.) indicating the relative effect of a particular explanatory variable on the outcome.

It is common to write the predictor function in a more compact form as follows:

The coefficients β₀, β₁, ..., β_p are grouped into a single vector β of size p+1.
For each data point i, an additional explanatory pseudo-variable x_i0 is added, with a fixed value of 1, corresponding to the intercept coefficient β₀.
The resulting explanatory variables x_i0, x_i1, ..., x_ip are then grouped into a single vector x_i of size p+1.

This makes it possible to write the linear predictor function as follows:

f(i)={\boldsymbol {\beta }}\cdot \mathbf {x} _{i}

using the notation for a dot product between two vectors.

An equivalent notation is as follows:

f(i)=\mathbf {x} '_{i}{\boldsymbol {\beta }}

where $\mathbf {x} '_{i}{\boldsymbol {\beta }}$ indicates matrix multiplication between the 1-by-p row vector $\mathbf {x} '_{i}$ and the p-by-1 column vector ${\boldsymbol {\beta }},$ producing a 1-by-1 matrix that is taken to be a scalar. (The apostrophe $\mathbf {x} '_{i}$ indicates a matrix transpose, where vectors are assumed by default to be column vectors.)

An example of the usage of such a linear predictor function is in linear regression, where each data point is associated with a continuous outcome y_i, and the relationship written

y_{i}=f(i)+\varepsilon _{i}=\mathbf {x} '_{i}{\boldsymbol {\beta }}+\varepsilon _{i},

where $\varepsilon _{i}$ is a disturbance term or error variable — an unobserved random variable that adds noise to the linear relationship between the dependent variable and predictor function.

Stacking

In some models (linear regression in particular), the equations for each of the data points i = 1 ... n are stacked together and written in vector form as

\mathbf {y} =\mathbf {X} {\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }},\,

where

\mathbf {y} ={\begin{pmatrix}y_{1}\\y_{2}\\\vdots \\y_{n}\end{pmatrix}},\quad \mathbf {X} ={\begin{pmatrix}\mathbf {x} '_{1}\\\mathbf {x} '_{2}\\\vdots \\\mathbf {x} '_{n}\end{pmatrix}}={\begin{pmatrix}x_{11}&\cdots &x_{1p}\\x_{21}&\cdots &x_{2p}\\\vdots &\ddots &\vdots \\x_{n1}&\cdots &x_{np}\end{pmatrix}},\quad {\boldsymbol {\beta }}={\begin{pmatrix}\beta _{1}\\\vdots \\\beta _{p}\end{pmatrix}},\quad {\boldsymbol {\varepsilon }}={\begin{pmatrix}\varepsilon _{1}\\\varepsilon _{2}\\\vdots \\\varepsilon _{n}\end{pmatrix}}.

This makes it possible to find optimal coefficients through the method of least squares using simple matrix operations.

The explanatory variables

Although the outcomes (dependent variables) to be predicted are often assumed to be random variables, the explanatory variables are usually not. Instead, they are assumed to be fixed values, and any random variables conditional on them. As a result, the model user is free to transform the explanatory variables in arbitrary ways, including creating multiple copies of a given explanatory variable, each transformed using a different function. Other common techniques are to create new explanatory variables in the form of interaction variables by taking products of two (or sometimes more) existing explanatory variables.

As an example, polynomial regression uses a linear predictor function to fit an arbitrary degree polynomial relationship (up to a given order) between two sets of data points (i.e. a single real-valued explanatory variable and a related real-valued dependent variable), by adding multiple explanatory variables corresponding to various powers of the existing explanatory variable. Mathematically, the form looks like this:

y_{i}=\beta _{0}+\beta _{1}x_{i}+\beta _{2}x_{i}^{2}+\cdots +\beta _{p}x_{i}^{p},

In this case, for each data point, a set of explanatory variables is created as follows:

(x_{i1}=x_{i},x_{i2}=x_{i}^{2},\ldots ,x_{ip}=x_{i}^{p})

and then standard linear regression is run. This example shows that a linear predictor function can actually be much more powerful than it first appears: It only really needs to be linear in the coefficients. (It is even possible to fit some functions that appear non-linear in the coefficients by transforming the coefficients into new coefficients that do appear linear. For example, a function of the form $a+b^{2}x_{i1}+{\sqrt {c}}x_{i2}$ for coefficients $a,b,c$ could be transformed into the appropriate linear function by applying the substitutions $b'=b^{2},c'={\sqrt {c}},$ leading to $a+b'x_{i1}+c'x_{i2},$ which is linear. Linear regression and similar techniques could be applied and will often still find the optimal coefficients, but their error estimates and such will be wrong.

The explanatory variables may be of any type: real-valued, binary, categorical, etc. The main distinction is between continuous variables (e.g. income, age, blood pressure, etc.) and discrete variables (e.g. sex, race, political party, etc.). Discrete variables referring to more than two possible choices are typically coded using dummy variables (or indicator variables), i.e. separate explanatory variables taking the value 0 or 1 are created for each possible value of the discrete variable, with a 1 meaning "variable does have the given value" and a 0 meaning "variable does not have the given value". For example, a four-way discrete variable of blood type with the possible values "A, B, AB, O" would be converted to separate two-way dummy variables, "is-A, is-B, is-AB, is-O", where only one of them has the value 1 and all the rest have the value 0. This allows for separate regression coefficients to be matched for each possible value of the discrete variable.

Note that, for K categories, not all K dummy variables are independent of each other. For example, in the above blood-type example, only three of the four dummy variables are independent, in the sense that once the values of three of the variables are known, the fourth is automatically determined. Thus, it's really only necessary to encode three of the four possibilities as dummy variables, and in fact if all four possibilities are encoded, the overall model becomes non-identifiable. This causes problems for a number of methods, such as the simple closed-form solution used in linear regression. The solution is either to avoid such cases by eliminating one of the dummy variables, and/or introduce a regularization constraint (which necessitates a more powerful, typically iterative, method for finding the optimal coefficients).