Uniform convergence in probability

Uniform convergence in probability is a concept in probability theory with applications to statistical learning theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family converge to their theoretical probabilities.

For a class of predicates $H\,\!$ defined on a set $X\,\!$ and a set of samples $x=(x_{1},x_{2},\dots ,x_{m})\,\!$ , where $x_{i}\in X\,\!$ , the empirical frequency of $h\in H\,\!$ on $x\,\!$ is ${\widehat {Q_{x}}}(h)={\frac {1}{m}}|\{i:1\leq i\leq m,h(x_{i})=1\}|\,\!$ .

The Uniform Convergence Theorem states, roughly, that if $H\,\!$ is "simple" and we draw samples independently (with replacement) from $X\,\!$ according to a distribution $P\,\!$ , then with high probability, the empirical frequency will be close to its expected value, which is the theoretical probability $Q_{P}(h)=P\{y\in X:h(y)=1\}\,\!$ .

Here "simple" means that the Vapnik-Chervonenkis dimension of the class $H\,\!$ is small relative to the size of the sample.

In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.

The Uniform Convergence Theorem was first proved by Vapnik and Chervonenkis^[1] using the concept of growth function.

Uniform convergence theorem statement^[2]

If $H\,\!$ is a set of $\{0,1\}\,\!$ -valued functions defined on a set $X\,\!$ and $P\,\!$ is a probability distribution on $X\,\!$ then for $\epsilon >0\,\!$ and $m\,\!$ a positive integer, we have,

P^{m}\{|Q_{P}(h)-{\widehat {Q_{x}}}(h)|\geq \epsilon \,\!

for some

h\in H\}\leq 4\Pi _{H}(2m)e^{-{\frac {\epsilon ^{2}m}{8}}}.\,\!

where, for any $x\in X^{m}\,\!$ ,

Q_{P}(h)=P\{(y\in X:h(y)=1\}\,\!

,

{\widehat {Q_{x}}}(h)={\frac {1}{m}}|\{i:1\leq i\leq m,h(x_{i})=1\}|\,\!

and

|x|=m\,\!

.

P^{m}\,\!

indicates that the probability is taken over

x\,\!

consisting of

m\,\!

i.i.d. draws from the distribution

P\,\!

.

$\Pi _{H}\,\!$ is defined as: For any $\{0,1\}\,\!$ -valued functions $H\,\!$ over $X\,\!$ and $D\subseteq X\,\!$ ,

\Pi _{H}(D)=\{h\cap D:h\in H\}\,\!

.

And for any natural number $m\,\!$ , the shattering number $\Pi _{H}(m)\,\!$ is defined as.

\Pi _{H}(m)=max|\{h\cap D:|D|=m,h\in H\}|\,\!

.

From the point of Learning Theory one can consider $H\,\!$ to be the Concept/Hypothesis class defined over the instance set $X\,\!$ . Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.

Sauer–Shelah lemma

The Sauer–Shelah lemma^[3] relates the shattering number $\Pi _{h}(m)\,\!$ to the VC Dimension.

Lemma: $\Pi _{H}(m)\leq \left({\frac {em}{d}}\right)^{d}\,\!$ , where $d\,\!$ is the VC Dimension of the concept class $H\,\!$ .

Corollary: $\Pi _{H}(m)\leq m^{d}\,\!$ .

Proof of uniform convergence theorem

^[1] and ^[2] are the sources of the proof below. Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.

Symmetrization: We transform the problem of analyzing $|Q_{P}(h)-{\widehat {Q}}_{x}(h)|\geq \epsilon \,\!$ into the problem of analyzing $|{\widehat {Q}}_{r}(h)-{\widehat {Q}}_{s}(h)|\geq \epsilon /2\,\!$ , where $r\,\!$ and $s\,\!$ are i.i.d samples of size $m\,\!$ drawn according to the distribution $P\,\!$ . One can view $r\,\!$ as the original randomly drawn sample of length $m\,\!$ , while $s\,\!$ may be thought as the testing sample which is used to estimate $Q_{P}(h)\,\!$ .
Permutation: Since $r\,\!$ and $s\,\!$ are picked identically and independently, so swapping elements between them will not change the probability distribution on $r\,\!$ and $s\,\!$ . So, we will try to bound the probability of $|{\widehat {Q}}_{r}(h)-{\widehat {Q}}_{s}(h)|\geq \epsilon /2\,\!$ for some $h\in H\,\!$ by considering the effect of a specific collection of permutations of the joint sample $x=r||s\,\!$ . Specifically, we consider permutations $\sigma (x)\,\!$ which swap $x_{i}\,\!$ and $x_{m+i}\,\!$ in some subset of ${1,2,...,m}\,\!$ . The symbol $r||s\,\!$ means the concatenation of $r\,\!$ and $s\,\!$ .
Reduction to a finite class: We can now restrict the function class $H\,\!$ to a fixed joint sample and hence, if $H\,\!$ has finite VC Dimension, it reduces to the problem to one involving a finite function class.

We present the technical details of the proof.

Symmetrization

Lemma: Let $V=\{x\in X^{m}:|Q_{P}(h)-{\widehat {Q_{x}}}(h)|\geq \epsilon \,\!$ for some $h\in H\}\,\!$ and

R=\{(r,s)\in X^{m}\times X^{m}:|{\widehat {Q_{r}}}(h)-{\widehat {Q_{s}}}(h)|\geq \epsilon /2\,\!

for some

h\in H\}\,\!

.

Then for $m\geq {\frac {2}{\epsilon ^{2}}}\,\!$ , $P^{m}(V)\leq 2P^{2m}(R)\,\!$ .

Proof: By the triangle inequality,
if $|Q_{P}(h)-{\widehat {Q_{r}}}(h)|\geq \epsilon \,\!$ and $|Q_{P}(h)-{\widehat {Q_{s}}}(h)|\leq \epsilon /2\,\!$ then $|{\widehat {Q_{r}}}(h)-{\widehat {Q_{s}}}(h)|\geq \epsilon /2\,\!$ .
Therefore,

P^{2m}(R)\geq P^{2m}\{\exists h\in H,|Q_{P}(h)-{\widehat {Q_{r}}}(h)|\geq \epsilon \,\!

and

|Q_{P}(h)-{\widehat {Q_{s}}}(h)|\leq \epsilon /2\}\,\!

=\int _{V}P^{m}\{s:\exists h\in H,|Q_{P}(h)-{\widehat {Q_{r}}}(h)|\geq \epsilon \,\!

and

|Q_{P}(h)-{\widehat {Q_{s}}}(h)|\leq \epsilon /2\}dP^{m}(r)=A\,\!

[since

r\,\!

and

s\,\!

are independent].

Now for $r\in V\,\!$ fix an $h\in H\,\!$ such that $|Q_{P}(h)-{\widehat {Q_{r}}}(h)|\geq \epsilon \,\!$ . For this $h\,\!$ , we shall show that

P^{m}\{|Q_{P}(h)-{\widehat {Q_{s}}}(h)|\leq {\frac {\epsilon }{2}}\}\geq {\frac {1}{2}}\,\!

.

Thus for any $r\in V\,\!$ , $A\geq {\frac {P^{m}(V)}{2}}\,\!$ and hence $P^{2m}(R)\geq {\frac {P^{m}(V)}{2}}\,\!$ . And hence we perform the first step of our high level idea.
Notice, $m\cdot {\widehat {Q_{s}}}(h)\,\!$ is a binomial random variable with expectation $m\cdot Q_{P}(h)\,\!$ and variance $m\cdot Q_{P}(h)(1-Q_{P}(h))\,\!$ . By Chebyshev's inequality we get,

P^{m}\{|Q_{P}(h)-{\widehat {Q_{s}(h)}}|>{\frac {\epsilon }{2}}\}\leq {\frac {m\cdot Q_{P}(h)(1-Q_{P}(h))}{(\epsilon m/2)^{2}}}\leq {\frac {1}{\epsilon ^{2}m}}\leq {\frac {1}{2}}\,\!

for the mentioned bound on

m\,\!

. Here we use the fact that

x(1-x)\leq 1/4\,\!

for

x\,\!

.

Permutations

Let $\Gamma _{m}\,\!$ be the set of all permutations of $\{1,2,3,\dots ,2m\}\,\!$ that swaps $i\,\!$ and $m+i\,\!$ $\forall i\,\!$ in some subset of $\{1,2,3,...,2m\}\,\!$ .

Lemma: Let $R\,\!$ be any subset of $X^{2m}\,\!$ and $P\,\!$ any probability distribution on $X\,\!$ . Then,

P^{2m}(R)=E[Pr[\sigma (x)\in R]]\leq max_{x\in X^{2m}}(Pr[\sigma (x)\in R]),\,\!

where the expectation is over $x\,\!$ chosen according to $P^{2m}\,\!$ , and the probability is over $\sigma \,\!$ chosen uniformly from $\Gamma _{m}\,\!$ .

Proof: For any $\sigma \in \Gamma _{m},\,\!$

P^{2m}(R)=P^{2m}\{x:\sigma (x)\in R\}\,\!

[since coordinate permutations preserve the product distribution

P^{2m}\,\!

].

\therefore P^{2m}(R)=\int _{X^{2m}}1_{R}(x)dP^{2m}(x)\,\!

={\frac {1}{|\Gamma _{m}|}}\sum _{\sigma \in \Gamma _{m}}\int _{X^{2m}}1_{R}(\sigma (x))dP^{2m}(x)\,\!

=\int _{X^{2m}}{\frac {1}{|\Gamma _{m}|}}\sum _{\sigma \in \Gamma _{m}}1_{R}(\sigma (x))dP^{2m}(x)\,\!

[Because

|\Gamma _{m}|\,\!

is finite]

=\int _{X^{2m}}Pr[\sigma (x)\in R]dP^{2m}(x)

[The expectation]

\leq max_{x\in X^{2m}}(Pr[\sigma (x)\in R])\,\!

.

The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.

Reduction to a finite class

Lemma: Basing on the previous lemma,

max_{x\in X^{2m}}(Pr[\sigma (x)\in R])\leq 4\Pi _{H}(2m)e^{-{\frac {\epsilon ^{2}m}{8}}}\,\!

.

Proof: Let us define $x=(x_{1},x_{2},...,x_{2m})\,\!$ and $t=|H|_{x}|\,\!$ which is atmost $\Pi _{H}(2m)\,\!$ . This means there are functions $h_{1},h_{2},...,h_{t}\in H\,\!$ such that for any $h\in H,\exists i\,\!$ between $1\,\!$ and $t\,\!$ with $h_{i}(x_{k})=h(x_{k})\,\!$ for $1\leq k\leq 2m\,\!$ .
We see that $\sigma (x)\in R\,\!$ iff for some $h\,\!$ in $H\,\!$ satisfies, $|{\frac {1}{m}}|\{1\leq i\leq m:h(x_{\sigma _{i}})=1\}|-{\frac {1}{m}}|\{m+1\leq i\leq 2m:h(x_{\sigma _{i}})=1\}||\geq {\frac {\epsilon }{2}}\,\!$ . Hence if we define $w_{i}^{j}=1\,\!$ if $h_{j}(x_{i})=1\,\!$ and $w_{i}^{j}=0\,\!$ otherwise.
For $1\leq i\leq m\,\!$ and $1\leq j\leq t\,\!$ , we have that $\sigma (x)\in R\,\!$ iff for some $j\,\!$ in ${1,...,t}\,\!$ satisfies $|{\frac {1}{m}}\left(\sum _{i}w_{\sigma (i)}^{j}-\sum _{i}w_{\sigma (m+i)}^{j}\right)|\geq {\frac {\epsilon }{2}}\,\!$ . By union bound we get,

Pr[\sigma (x)\in R]\leq t\cdot max\left(Pr[|{\frac {1}{m}}\left(\sum _{i}w_{\sigma _{i}}^{j}-\sum _{i}w_{\sigma _{m+i}}^{j}\right)|\geq {\frac {\epsilon }{2}}]\right)

\leq \Pi _{H}(2m)\cdot max\left(Pr[|{\frac {1}{m}}\left(\sum _{i}w_{\sigma _{i}}^{j}-\sum _{i}w_{\sigma _{m+i}}^{j}\right)|\geq {\frac {\epsilon }{2}}]\right)\,\!

.

Since, the distribution over the permutations $\sigma \,\!$ is uniform for each $i\,\!$ , so $w_{\sigma _{i}}^{j}-w_{\sigma _{m+i}}^{j}\,\!$ equals $\pm |w_{i}^{j}-w_{m+i}^{j}|\,\!$ , with equal probability.
Thus,

Pr[|{\frac {1}{m}}\left(\sum _{i}\left(w_{\sigma _{i}}^{j}-w_{\sigma _{m+i}}^{j}\right)\right)|\geq {\frac {\epsilon }{2}}]=Pr[|{\frac {1}{m}}\left(\sum _{i}|w_{i}^{j}-w_{m+i}^{j}|\beta _{i}\right)|\geq {\frac {\epsilon }{2}}]\,\!

,

where the probability on the right is over $\beta _{i}\,\!$ and both the possibilities are equally likely. By Hoeffding's inequality, this is at most $2e^{-{\frac {m\epsilon ^{2}}{8}}}\,\!$ .

Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.

References

^ ^a ^b Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications. 16 (2): 264. doi:10.1137/1116025. This is an English translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk. 181 (4): 781. 1968. The translation was reproduced as: Vapnik, V. N.; Chervonenkis, A. Ya. (2015). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Measures of Complexity. p. 11. doi:10.1007/978-3-319-21852-6_3. ISBN 978-3-319-21851-9.
^ ^a ^b Martin Anthony Peter,l.Bartlett. Neural Network Learning: Theoretical Foundations,Pages-46-50.First Edition,1999.Cambridge University Press ISBN 0-521-57353-X
^ Sham Kakade and Ambuj Tewari,CMSC 35900 (Spring 2008) Learning Theory,Lecture 11

[vc-1] Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications. 16 (2): 264. doi:10.1137/1116025. This is an English translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk. 181 (4): 781. 1968. The translation was reproduced as: Vapnik, V. N.; Chervonenkis, A. Ya. (2015). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Measures of Complexity. p. 11. doi:10.1007/978-3-319-21852-6_3. ISBN 978-3-319-21851-9.

[books.google.com-2] Martin Anthony Peter,l.Bartlett. Neural Network Learning: Theoretical Foundations,Pages-46-50.First Edition,1999.Cambridge University Press ISBN 0-521-57353-X

[3] Sham Kakade and Ambuj Tewari,CMSC 35900 (Spring 2008) Learning Theory,Lecture 11

[1]

[2]

[3]

Uniform convergence theorem statement[2]