For a class of predicates
defined on a set
and a set of samples
, where
, the empirical frequency of
on
is
. The Uniform Convergence Theorem states, roughly,that if
is "simple" and we draw samples independently (with replacement) from
according to a distribution
, then with high probability all the empirical frequency will be close to its expectation, where the expectation is given by
. Here "simple" means that the Vapnik-Chernovenkis dimension of the class
is small relative to the size of the sample.
In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.
If
is a set of
-valued functions defined on a set
and
is a probability distribution on
then for
and
a positive integer, we have,
for some 
where, for any
,
,
and
.
indicates that the probability is taken over
consisting of
i.i.d. draws from the distribution
.
is defined as: For any
-valued functions
over
and
,
.
And for any natural number
the shattering number
is defined as.
.
From the point of Learning Theory one can consider
to be the Concept/Hypothesis class defined over the instance set
. Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.
Sauer–Shelah lemma
The Sauer–Shelah lemma[2] relates the shattering number
to the VC Dimension.
Lemma:
, where
is the VC Dimension of the concept class
.
Corollary:
.
Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.
- Symmetrization: We transform the problem of analyzing
into the problem of analyzing
, where
and
are i.i.d samples of size
drawn according to the distribution
. One can view
as the original randomly drawn sample of length
, while
may be thought as the testing sample which is used to estimate
.
- Permutation: Since
and
are picked identically and independently, so swapping elements between them will not change the probability distribution on
and
. So, we will try to bound the probability of
for some
by considering the effect of a specific collection of permutations of the joint sample
. Specifically, we consider permutations
which swap
and
in some subset of
. The symbol
means the concatenation of
and
.
- Reduction to a finite class: We can now restrict the function class
to a fixed joint sample and hence, if
has finite VC Dimension, it reduces to the problem to one involving a finite function class.
We present the technical details of the proof.
Symmetrization
Lemma: Let
for some
and
for some
.
Then for
,
.
Proof:
By the triangle inequality,
if
and
then
.
Therefore,
and 
and
[since
and
are independent].
Now for
fix an
such that
. For this
, we shall show that
.
Thus for any
,
and hence
. And hence we perform the first step of our high level idea.
Notice,
is a binomial random variable with expectation
and variance
. By Chebyshev's inequality we get,
for the mentioned bound on
. Here we use the fact that
for
.
Permutations
Let
be the set of all permutations of
that swaps
and
in some subset of
.
Lemma: Let
be any subset of
and
any probability distribution on
. Then,
![{\displaystyle P^{2m}(R)=E[Pr[\sigma (x)\in R]]\leq max_{x\in X^{2m}}(Pr[\sigma (x)\in R]),\,\!}](/media/api/rest_v1/media/math/render/svg/209e15c6d2f59e46789510362eaa33268e3c2276)
where the expectation is over
chosen according to
, and the probability is over
chosen uniformly from
.
Proof:
For any
[since coordinate permutations preserve the product distribution
].


[Because
is finite]
[The expectation]
.
The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.
Reduction to a finite class
Lemma: Basing on the previous lemma,
.
Proof:
Let us define
and
which is atmost
. This means there are functions
such that for any
between
and
with
for
.
We see that
iff for some
in
satisfies,
.
Hence if we define
if
and
otherwise.
For
and
, we have that
iff for some
in
satisfies
. By union bound we get,
![{\displaystyle Pr[\sigma (x)\in R]\leq t\cdot max\left(Pr[|{\frac {1}{m}}\left(\sum _{i}w_{\sigma _{i}}^{j}-\sum _{i}w_{\sigma _{m+i}}^{j}\right)|\geq {\frac {\epsilon }{2}}]\right)}](/media/api/rest_v1/media/math/render/svg/d20323f87715f75c27a39e3b6a1730e0b9fd318e)
.
Since, the distribution over the permutations
is uniform for each
, so
equals
, with equal probability.
Thus,
,
where the probability on the right is over
and both the possibilities are equally likely. By Hoeffding's inequality, this is at most
.
Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.
References