For a class of predicates
defined on a set
and a set of samples
, where
, the empirical frequency of
on
is
. The Uniform Convergence Theorem states,roughly,that if
is "simple" and we draw samples independently (with replacement) from
according to a distribution
, then with high probability all the emperical frequency will be close to its expectation, where the expectation is given by
. Here "simple" means that the Vapnik-Chernovenkis dimension of the class
is small relative to the size of the sample.
In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.
If
is a set of
-valued functions defined on an set
and
is a probability distribution on
then for
and
a positive integer, we have,
for some 
where, for any
,
,
and
.
indicates that the probability is taken over
consisting of
i.i.d. draws from the distribution
.
is defined as: For any
-valued functions
over
and
,
.
And for any natural number
the shattering number
is defined as.
.
From the point of Learning Theory one can consider
to be the Concept/Hypothesis class defined over the the instance set
. Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.
Sauer's Lemma relates the shattering number
to the VC Dimension.
Lemma:
, where
is the VC Dimension of the concept class
.
Corollary: If
, then
.
Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.
- Symmetrization: We transform the problem of analyzing
into the problem of analyzing
, where
and
are i.i.d samples of size
drawn according to the distribution
. One can view
as the original randomly drawn sample of length
, while
may be thought as the testing sample which is used to estimate
.
- Permutation: Since
and
are picked identically and independently, so swapping elements between them will not change the probability distribution on
and
. So, we will try to bound the probability of
for some
by considering the effect of a specific collection of permutations of the joint sample
. Specifically, we consider permutations
which swap
and
in some subset of
. The symbol
means the concatenation of
and
.
- Reduction to a finite class: We can now restrict the function class
to a fixed joint sample and hence, if
has finite VC Dimension, it reduces to the problem to one involving a finite function class.
We present the technical details of the proof.
Symmetrization
Lemma: Let
for some
and
for some
.
Then for
,
.
Proof:
By the triangle inequality,
if
and
then
.
Therefore,
and 
and
[since
and
are independent].
Now for
fix an
such that
. For this
, we shall show that
.
Thus for any
,
and hence
. And hence we perform the first step of our high level idea.
Notice,
is a binomial random variable with expectation
and variance
. By Chebyshev's inequality we get,
for the mentioned bound on
. Here we use the fact that
for
.
Permutations
Let
be the set of all permutations of
that swaps
and
in some subset of
.
Lemma: Let
be any subset of
and
any probability distribution on
. Then,
![{\displaystyle P^{2m}(R)=E[Pr[\sigma (x)\in R]]\leq max_{x\in X^{2m}}(Pr[\sigma (x)\in R]),\,\!}](/media/api/rest_v1/media/math/render/svg/209e15c6d2f59e46789510362eaa33268e3c2276)
where the expectation is over
chosen according to
, and the probability is over
chosen uniformly from
.
Proof:
For any
[since coordinate permutations preserve the product distribution
].


[Because
is finite]
[The expectation]
.
The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.
Reduction to a Finite Class
Lemma: Basing on the previous lemma,
.
Proof:
Let us define
and
which is atmost
. This means there are functions
such that for any
between
and
with
for
.
We see that
iff for some
in
satisfies,
.
Hence if we define
if
and
otherwise.
For
and
, we have that
iff for some
in
satisfies
. By union bound we get,
![{\displaystyle Pr[\sigma (x)\in R]\leq t\cdot max\left(Pr[|{\frac {1}{m}}\left(\sum _{i}w_{\sigma _{i}}^{j}-\sum _{i}w_{\sigma _{m+i}}^{j}\right)|\geq {\frac {\epsilon }{2}}]\right)}](/media/api/rest_v1/media/math/render/svg/d20323f87715f75c27a39e3b6a1730e0b9fd318e)
.
Since, the distribution over the permutations
is uniform for each
, so
equals
, with equal probability.
Thus,
,
where the probability on the right is over
and both the possibilities are equally likely. By Hoeffding's inequality, this is at most
.
Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.
References
- ^ Martin Anthony Peter,l.Bartlett. Neural Network Learning: Theoretical Foundations,Pages-46-50.First Edition,1999.Cambridge University Press, ISBN 0-521-57353-X
- ^ Sham Kakade and Ambuj Tewari,CMSC 35900 (Spring 2008) Learning Theory,Lecture 11
- ^ Martin Anthony Peter,l.Bartlett. Neural Network Learning: Theoretical Foundations,Pages-46-50.First Edition,1999.Cambridge University Press, ISBN 0-521-57353-X