Jump to content

Learnable function class

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Syanuma1 (talk | contribs) at 00:07, 16 December 2015 (Background). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

This sandbox is in the article namespace. Either move this page into your userspace, or remove the {{User sandbox}} template. In statistical learning theory, a learnable function class is a set of functions for which an algorithm can be devised to asymptotically minimize the expected risk, uniformly over all probability distributions. The concept of learnable classes is closely linked to

Definition

Background

Let be the sample space, where are the labels and are the covariates (predictors). is a collection of mappings (functions) under consideration to link to . is a pre-given loss function (usually non-negative). Given a probability distribution on , define the expected risk to be:

The general goal in statistical learning is to find the function in that minimizes the expected risk. That is, to find solutions to the following problem:

But in practice the distribution is unknown, and any learning task can only be based on finite samples. Thus we seek instead to find a algorithm that asymptotically minimizes the empirical risk, i.e., to find a sequence of functions that satisfies

One usual approach to find such a sequence is through empirical risk minimization.

Learnable function class

We can make the condition given in the above equation stronger by requiring that the convergence is uniform for all probability distributions. That is:

The intuition behind the more strict requirement is as such: since the probability distribution is unknown, the rate at which it might not be enough that asymptotically minimizes the expected risk for each individually, because the rate of convergence could be very different for different . The uniform convergence requirement requires that there is a lower bound on the rate of how fast is getting closer to the true overall minimizer, regardless of .

is known as the hypothesis space; it is the collection of possible relationships that we are assuming between and . Larger gives better model flexibility, but increases the risk of overfitting. Two examples of this:

  • , , , is linear functions on . This is the linear least square regression problem.
  • , , , is linear functions on . This is the linear support-vector machine problem.

A learning algorithm is said to minimize the expected risk, or consistent, if for sample sizes , it gives solutions that satisfies , .

This extension is important, because in real world we never know what the underlying distribution of $(x, y)$ is. We can strengthen this by requiring that the convergence in probability is uniform over all probability distributions: