Learnable function class

This sandbox is in the article namespace. Either move this page into your userspace, or remove the {{User sandbox}} template. In statistical learning theory, a learnable function class is a set of functions for which an algorithm can be devised to asymptotically minimize the expected risk, uniformly over all probability distributions. The concept of learnable classes is closely linked to

Definition

Background

Let $\Omega ={\mathcal {X}}\times {\mathcal {Y}}=\{(x,y)\}$ be the sample space, where $y$ are the labels and $x$ are the covariates (predictors). ${\mathcal {F}}=\{f:{\mathcal {X}}\mapsto {\mathcal {Y}}\}$ is a collection of mappings (functions) under consideration to link $x$ to $y$ . $L:{\mathcal {Y}}\times {\mathcal {Y}}\mapsto \mathbb {R}$ is a pre-given loss function (usually non-negative). Given a probability distribution $P(x,y)$ on $\Omega$ , define the expected risk $I_{P}(f)$ to be:

I_{P}(f)=\int L(f(x),y)dP(x,y)

The general goal in statistical learning is to find the function in ${\mathcal {F}}$ that minimizes the expected risk. That is, to find solutions to the following problem:

{\hat {f}}=\arg \min _{f\in {\mathcal {F}}}I_{P}(f)

But in practice the distribution $P$ is unknown, and any learning task can only be based on finite samples. Thus we seek instead to find a algorithm that asymptotically minimizes the empirical risk, i.e., to find a sequence of functions $\{{\hat {f}}_{n}\}_{n=1}^{\infty }$ that satisfies

\lim _{n\rightarrow \infty }\mathbb {P} (I_{P}({\hat {f}}_{n})-\inf _{f\in {\mathcal {F}}}I_{P}(f)>\epsilon )=0

One usual approach to find such a sequence is through empirical risk minimization.

Learnable function class

We can make the condition given in the above equation stronger by requiring that the convergence is uniform for all probability distributions. That is:

\lim _{n\rightarrow \infty }\sup _{P}\mathbb {P} (I_{P}({\hat {f}}_{n})-\inf _{f\in {\mathcal {F}}}I_{P}(f)>\epsilon )=0

The intuition behind the more strict requirement is as such: since the probability distribution $P(x,y)$ is unknown, the rate at which it might not be enough that asymptotically minimizes the expected risk for each $P(x,y)$ individually, because the rate of convergence could be very different for different $P$ . The uniform convergence requirement requires that there is a lower bound on the rate of how fast $\{{\hat {f}}_{n}\}$ is getting closer to the true overall minimizer, regardless of $P$ .

${\mathcal {H}}$ is known as the hypothesis space; it is the collection of possible relationships that we are assuming between $x$ and $y$ . Larger ${\mathcal {H}}$ gives better model flexibility, but increases the risk of overfitting. Two examples of this:

${\mathcal {X}}=\mathbb {R} ^{p}$ , ${\mathcal {Y}}=\mathbb {R}$ , $L(f(x),y)=[f(x)-y]^{2}$ , ${\mathcal {H}}$ is linear functions on ${\mathcal {X}}$ . This is the linear least square regression problem.
${\mathcal {X}}=\mathbb {R} ^{p}$ , ${\mathcal {Y}}=\{-1,1\}$ , $L(f(x),y)=[f(x)-y]^{2}$ , ${\mathcal {H}}$ is linear functions on ${\mathcal {X}}$ . This is the linear support-vector machine problem.

A learning algorithm is said to minimize the expected risk, or consistent, if for sample sizes $n=1,2,\dots$ , it gives solutions $\{{\hat {f}}_{n}\}\subset {\mathcal {H}}$ that satisfies $\forall \epsilon >0$ , $\lim _{n\rightarrow \infty }\mathbb {P} (P{\hat {f}}_{n}-\inf _{f\in {\mathcal {H}}}Pf>\epsilon )=0$ .

This extension is important, because in real world we never know what the underlying distribution of $(x, y)$ is. We can strengthen this by requiring that the convergence in probability is uniform over all probability distributions: $\lim _{n\rightarrow \infty }\sup _{P}\mathbb {P} (P{\hat {f}}_{n}-\inf _{f\in {\mathcal {H}}}Pf>\epsilon )=0$