Distribution learning theory

The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 ^[1] and it was inspired from the PAC-framework introduced by Leslie Valiant ^[2].

In this framework we assume that we have a number of samples drawn from a distribution that belongs to a specific class of distributions. Based on these samples we want an efficient algorithm that determines with high probability the distribution from which the samples have been drawn. Because of its generality this framework it has been used in a large variety of different fields like machine learning, approximation algorithms, applied probability and statistics.

In this article we explain the basic definitions, tools and results in this framework from the theory of computation point of view.

Basic Definitions

Let $\textstyle X$ be the support of the distributions that we are interested in. As in the original work of Kearns et. al. ^[1] if $\textstyle X$ is finite we can assume without loss of generality that $\textstyle X=\{0,1\}^{n}$ where $\textstyle n$ is the number of bits that we have to use in order to represent any $\textstyle y\in X$ . As we said we are interested in probability distributions over $\textstyle X$ .

We describe now two possible representations of a probability distribution $\textstyle D$ over $\textstyle X$ .

probability distribution function (or evaluator) an evaluator $\textstyle E_{D}$ for $\textstyle D$ takes as input any $\textstyle y\in X$ and outputs a real number $\textstyle E_{D}[y]$ which denotes the probability that of $\textstyle y$ according to $\textstyle D$ , i.e. $\textstyle E_{D}[y]=\Pr[Y=y]$ if $\textstyle Y\sim D$

generator a generator $\textstyle G_{D}$ for $\textstyle D$ takes as input a string of truly random bits and outputs $\textstyle G_{D}[y]\in X$ according to the distribution $\textstyle D$

We say that $\textstyle D$ has a polynomial generator (respectively evaluator) if its generator (respectively evaluator) can be computed in polynomial time.

Next we give the definitions of learnability of an arbitrary class of distributions for the two different types of representation, that we described above. Let $\textstyle C_{X}$ a class of distribution over X, that is $\textstyle C_{X}$ is a set such that every $\textstyle D\in C_{X}$ is a probability distribution with support $\textstyle X$ . For now on we refer to $\textstyle C_{X}$ as $\textstyle C$ for simplicity.

Before defining learnability we have to define good approximations of a distribution $\textstyle D$ . For this purpose we have to define a way to measure the distance between two distribution. The three more common possibilities are

Kullback-Leibler divergence

Total variation distance

Kolmogorov distance

The strongest of these distances is the Kullback-Leibler divergence and the weakest is the Kolmogorov distance. Because our next definitions hold for all the distances we will write $\textstyle d(D,D')$ to denote the distance between the distribution $\textstyle D$ and the distribution $\textstyle D'$ using one of the distances that we describe above. The learnability of a class of distributions can be defined using any of these distances, but in the applications we will refer to a specific distance.

The basic input that we use in order to learn a distribution is an number of samples drawn by this distribution. For the computational point of view we assume that in order to have such a sample we spend a constant amount of time. So it's like we have access to an oracle $\textstyle GEN(D)$ that returns a sample from the distribution $\textstyle D$ . Sometimes we are interested, apart from the time complexity, in the number of samples that we use in order to learn a specific distribution $\textstyle D$ . We call this quantity sample complexity of the learning algorithm.

Definition of learnability

We call a class of distributions $\textstyle C$ efficiently learnable if for every $\textstyle \epsilon >0$ and $\textstyle 0<\delta \leq 1$ given access to $\textstyle GEN(D)$ for an unknown distribution $\textstyle D\in C$ , there exists a polynomial time algorithm $\textstyle A$ , called learning algorithm of $\textstyle C$ , that outputs an generator or an evaluator of a distribution $\textstyle D'$ such that

\Pr[d(D,D')\leq \epsilon ]\geq 1-\delta

If we know that $\textstyle D'\in C$ then $\textstyle A$ is called proper learning algorithm, otherwise is called improper learning algorithm.

In some settings the class of distributions $\textstyle C$ is a class with well known distributions which can be described by set a set of parameters. For instance $\textstyle C$ could be the class of all the Gaussian distributions $\textstyle N(\mu ,\sigma ^{2})$ and in this case the algorithm $\textstyle A$ should be able to estimate the parameters $\textstyle \mu ,\sigma$ . In this case $\textstyle A$ is called parameter learning algorithm.

Obviously the parameter learning for simple distributions is a very well studied field that is called statistical estimation and there is a very long bibliography on different estimators for different kinds of simple known distributions. Therefore in the rest we are interested in learning class distributions that have more complicated description.

First results

In their seminal work, Kearns et. al. ^[1] deal with the case where $\textstyle A$ is described in term of a finite polynomial sized circuit and they proved the following for some specific classes of distribution

$\textstyle OR$ gate distributions for this kind of distributions there is no polynomial-sized evaluator, unless $\textstyle \#P\subseteq P/{\text{poly}}$ . On the other hand this class is efficiently learnable with generator.

Parity gate distributions this class is efficiently learnable with both generator and evaluator.

Mixtures of Hamming Balls this class is efficiently learnable with both generator and evaluator.

Probabilistic Finite Automata this class is not efficiently learnable with evaluator under the Noisy Parity Assumption which is an impossibility assumption in the PAC learning framework.

$\textstyle \epsilon -$ Covers

One very common technique in order to find a learning algorithm for a class of distributions $\textstyle C$ is to first find a small $\textstyle \epsilon -$ cover of $\textstyle C$ .

Definition

A set $\textstyle C_{\epsilon }$ is called $\textstyle \epsilon$ -cover of $\textstyle C$ if for every $\textstyle D\in C$ there is a $\textstyle D'\in C_{\epsilon }$ such that $\textstyle d(D,D')\leq \epsilon$ .

An $\textstyle \epsilon -$ cover is small if it has polynomial size with respect to the parameters that describe $\textstyle D$ which may differ from one model to another.

Once there is an efficient procedure that for every $\textstyle \epsilon >0$ finds a small $\textstyle \epsilon -$ cover $\textstyle C_{\epsilon }$ of C then the only task that we have to deal with is to select from $\textstyle C_{\epsilon }$ the distribution $\textstyle D'\in C_{\epsilon }$ that is closer to the distribution $\textstyle D\in C$ that we want to learn.

The problem is that given $\textstyle D',D''\in C_{\epsilon }$ it is not trial how we can compare $\textstyle d(D,D')$ and $\textstyle d(D,D'')$ in order to decide which one is the closest to $\textstyle D$ , because $\textstyle D$ is unknown. Therefore we have to use the samples from $\textstyle D$ to do these comparisons and so the result of the comparison has a probability of error. So are task is similar with finding the minimum element in a set using noisy comparisons. There are a lot of classical algorithms for achieving this goal. The most recent one which achieves the best guarantees was proposed by Daskalakis and Kamath ^[3] and sets up a fast tournament between the elements of $\textstyle C_{\epsilon }$ where the winner $\textstyle D^{*}$ of this tournament is the element which is $\textstyle \epsilon -$ close to $\textstyle D$ (i.e. $\textstyle d(D^{*},D)\leq \epsilon$ ) with probability at least $\textstyle 1-\delta$ . In order to do so their algorithm uses $\textstyle O(\log N/\epsilon ^{2})$ samples from $\textstyle D$ and runs in $\textstyle O(N\log N/\epsilon ^{2})$ time, where $\textstyle N=|C_{\epsilon }|$ .

Learning Sums of Random Variables

As we said in basic definitions section, learning of simple well known distribution is an well studied field and there are a lot of estimators that we can use. One more complicated class of distributions is the distributions of sum of variables that follow simple distributions. These learning procedure have a close relation with limit theorems like the central limit theorem because they tent to examine the same object when the sum tends to an infinite sum. Recently there are two interesting results that we will describe here the : learning Poisson binomial distributions and learning sums of independent integer random variables. All the results below hold using the total variation distance as a distance measure.

Learning Poisson Binomial Distributions ^[4]

We consider $\textstyle n$ independent Bernoulli random variables $\textstyle X_{1},\dots ,X_{n}$ with probabilities of success $\textstyle p_{1},\dots ,p_{n}$ . A Poisson Binomial Distribution of order $\textstyle n$ is the distribution of the sum $\textstyle X=\sum _{i}X_{i}$ . For learning the class $\textstyle PBD=\{D:D{\text{ is a Poisson binomial distribution}}\}$ we have the following results, the first deals with the case of improper learning of $\textstyle PBD$ and the second with the proper learning of $\textstyle PBD$ .

Theorem

Let $\textstyle D\in PBD$ then there is an algorithm which given $\textstyle n$ , $\textstyle \epsilon >0$ , $\textstyle 0<\delta \leq 1$ and access to $\textstyle GEN(D)$ finds a $\textstyle D'$ such that $\textstyle \Pr[d(D,D')\leq \epsilon ]\geq 1-\delta$ . The sample complexity of this algorithm is $\textstyle {\tilde {O}}((1/\epsilon ^{3})\log(1/\delta ))$ and the running time is $\textstyle {\tilde {O}}((1/\epsilon ^{3})\log n\log ^{2}(1/\delta ))$ .

Theorem

Let $\textstyle D\in PBD$ then there is an algorithm which given $\textstyle n$ , $\textstyle \epsilon >0$ , $\textstyle 0<\delta \leq 1$ and access to $\textstyle GEN(D)$ finds a $\textstyle D'\in PBD$ such that $\textstyle \Pr[d(D,D')\leq \epsilon ]\geq 1-\delta$ . The sample complexity of this algorithm is $\textstyle {\tilde {O}}((1/\epsilon ^{2}))\log(1/\delta )$ and the running time is $\textstyle (1/\epsilon )^{O(\log ^{2}(1/\epsilon ))}{\tilde {O}}(\log n\log(1/\delta ))$ .

One very interesting part of the above results is that the sample complexity of the learning algorithm doesn't depend on $\textstyle n$ , although the description of $\textstyle D$ is linear in $\textstyle n$ . Also the second result is almost optimal with respect to the sample complexity because there is also a lower bound of $\textstyle O(1/\epsilon ^{2})$ .

The proof uses a small $\textstyle \epsilon -$ cover of $\textstyle PBD$ that has been produced by Daskalakis and Papadimitriou ^[5], in order to get this algorithm.

Learning Sums of Independent Integer Random Variables ^[6]

We consider $\textstyle n$ independent random variables $\textstyle X_{1},\dots ,X_{n}$ each of which follows an arbitrary distribution with support $\textstyle \{0,1,\dots ,k-1\}$ . A $\textstyle k-$ sum of independent integer random variable of order $\textstyle n$ is the distribution of the sum $\textstyle X=\sum _{i}X_{i}$ . For learning the class

$\textstyle k-SIIRV=\{D:D{\text{is a k-sum of independent integer random variable }}\}$

we have the following result

Theorem

Let $\textstyle D\in k-SIIRV$ then there is an algorithm which given $\textstyle n$ , $\textstyle \epsilon >0$ and access to $\textstyle GEN(D)$ finds a $\textstyle D'$ such that $\textstyle \Pr[d(D,D')\leq \epsilon ]\geq 1-\delta$ . The sample complexity of this algorithm is $\textstyle {\text{poly}}(k/\epsilon )$ and the running time is also $\textstyle {\text{poly}}(k/\epsilon )$ .

Again one interesting part is that the sample and the time complexity does not depend on $\textstyle n$ . Also we can conclude this independence for the previous case if we set $\textstyle k=2$ .

Learning Mixtures of Gaussians ^[7] ^[8]

Let the random variables $\textstyle X\sim N(\mu _{1},\Sigma _{1})$ and $\textstyle Y\sim N(\mu _{2},\Sigma _{2})$ . We define the random variable $\textstyle Z$ which takes the same value as $\textstyle X$ with probability $\textstyle w_{1}$ and the same value as $\textstyle Y$ with probability $\textstyle w_{2}=1-w_{1}$ . Then if $\textstyle F_{1}$ is the density of $\textstyle X$ and $\textstyle F_{2}$ is the density of $\textstyle Y$ the density of $\textstyle Z$ is $\textstyle F=w_{1}F_{1}+w_{2}F_{2}$ . In this case we say that $\textstyle Z$ follows a mixture of Gaussians. Pearson ^[9] was the first who talked about mixtures of Gaussians in his attempt to explain the probability distribution from which he got same data that he wanted to analyze. So after doing a lot of calculations by hand, he finally fitted his data to a mixture of Gaussians. The learning task in this case is to determine the parameters of the mixture $\textstyle w_{1},w_{2},\mu _{1},\mu _{2},\Sigma _{1},\Sigma _{2}$ .

The first attempt to solve this problem was from Dasgupta ^[7]. In this work Dasgupta assumes that the two means of the Gaussians are far enough from each other. That means that there is a lower bound on the distance $\textstyle ||\mu _{1}-\mu _{2}||$ . Using this assumption Dasgupta and a lot of scientists after him where able to learn the parameters of the mixture. The learning procedure starts with clustering the samples into two different clusters minimizing some metric. Using the assumption that the means of the Gaussians are far away from each other we can conclude that with high probability the samples in the first cluster correspond to samples from the fisrt Gaussian and the samples in the second cluster to samples from the second one. Now that we have partitioned the samples we can get $\textstyle \mu _{i},\Sigma _{i}$ from simple statistical estimators and $\textstyle w_{i}$ by comparing the magnitude of the clusters.

If $\textstyle GM$ is the set of all the mixtures of two Gaussians, using the above procedure we can have theorems like the following.

Theorem ^[7]

Let $\textstyle D\in GM$ with $\textstyle ||\mu _{1}-\mu _{2}||\geq c{\sqrt {n\max(\lambda _{max}(\Sigma _{1}),\lambda _{max}(\Sigma _{2})}}$ , where $\textstyle c>1/2$ and $\textstyle \lambda _{max}(A)$ the largest eigenvalue of $\textstyle A$ , then there is an algorithm which given $\textstyle \epsilon >0$ , $\textstyle 0<\delta \leq 1$ and access to $\textstyle GEN(D)$ finds an approximation $\textstyle w'_{i},\mu '_{i},\Sigma '_{i}$ of the parameters such that $\textstyle \Pr[||w_{i}-w'_{i}||\leq \epsilon ]\geq 1-\delta$ (respectively for $\textstyle \mu _{i}$ and $\textstyle \Sigma _{i}$ . The sample complexity of this algorithm is $\textstyle M=2^{O(\log ^{2}(1/(\epsilon \delta )))}$ and the running time is $\textstyle O(M^{2}d+Mdn)$ .

The above result could also be generalized in $\textstyle k-$ mixture of Gaussians ^[7].

Interestingly for the case of mixture of two Gaussians there are learning results without the assumption of the distance between their means, like the following one which uses the total variation distance as a distance measure.

Theorem ^[8]

Let $\textstyle F\in GM$ then there is an algorithm which given $\textstyle \epsilon >0$ , $\textstyle 0<\delta \leq 1$ and access to $\textstyle GEN(D)$ finds $\textstyle w'_{i},\mu '_{i},\Sigma '_{i}$ such that if $\textstyle F'=w'_{1}F'_{1}+w'_{2}F'_{2}$ , where $\textstyle F'_{i}=N(\mu '_{i},\Sigma '_{i})$ then $\textstyle \Pr[d(F,F')\leq \epsilon ]\geq 1-\delta$ . The sample complexity and the running time of this algorithm is $\textstyle {\text{poly}}(n,1/\epsilon ,1/\delta ,1/w_{1},1/w_{2},1/d(F_{1},F_{2}))$ .

It is very interesting in the above result that the distance between $\textstyle F_{1}$ and $\textstyle F_{2}$ doesn't affect the quality of the result of the algorithm but just the sample complexity and the running time.

References

^ ^a ^b ^c M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. Schapire, L. Sellie On the Learnability of Discrete Distributions. ACM Symposium on Theory of Computing, 1994
^ L. Valiant A theory of the learnable. Communications of ACM, 1984
^ C. Daskalakis, G. Kamath Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians. Annual Conference on Learning Theory, 2014
^ C. Daskalakis, I. Diakonikolas, R. Servedio Learning Poisson Binomial Distributions. ACM Symposium on Theory of Computing, 2012
^ C. Daskalakis, C. Papadimitriou Sparse Covers for Sums of Indicators. Probability Theory and Related Fields, 2014
^ C. Daskalakis, I. Diakonikolas, R. O’Donnell, R. Servedio, L. Tan Learning Sums of Independent Integer Random Variables. IEEE Symposium on Foundations of Computer Science, 2013
^ ^a ^b ^c ^d S. Dasgupta Learning Mixtures of Gaussians. IEEE Symposium on Foundations of Computer Science, 1999
^ ^a ^b A. Kalai, A. Moitra, G. Valiant Efficiently Learning Mixtures of Two Gaussians ACM Symposium on Theory of Computing, 2010
^ K. Pearson Contribution to the Mathematical Theory of Evolution. Philosophical Transaction of the Royal Society in London, 1894

[KMRRSS94-1] M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. Schapire, L. Sellie On the Learnability of Discrete Distributions. ACM Symposium on Theory of Computing, 1994

[Val84-2] L. Valiant A theory of the learnable. Communications of ACM, 1984

[DK14-3] C. Daskalakis, G. Kamath Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians. Annual Conference on Learning Theory, 2014

[DDS12-4] C. Daskalakis, I. Diakonikolas, R. Servedio Learning Poisson Binomial Distributions. ACM Symposium on Theory of Computing, 2012

[DP14-5] C. Daskalakis, C. Papadimitriou Sparse Covers for Sums of Indicators. Probability Theory and Related Fields, 2014

[DDOST13-6] C. Daskalakis, I. Diakonikolas, R. O’Donnell, R. Servedio, L. Tan Learning Sums of Independent Integer Random Variables. IEEE Symposium on Foundations of Computer Science, 2013

[Das99-7] S. Dasgupta Learning Mixtures of Gaussians. IEEE Symposium on Foundations of Computer Science, 1999

[KMV10-8] A. Kalai, A. Moitra, G. Valiant Efficiently Learning Mixtures of Two Gaussians ACM Symposium on Theory of Computing, 2010

[Pea1894-9] K. Pearson Contribution to the Mathematical Theory of Evolution. Philosophical Transaction of the Royal Society in London, 1894

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Basic Definitions

First results

ϵ − {\displaystyle \textstyle \epsilon -} Covers

Learning Sums of Random Variables

Learning Poisson Binomial Distributions [4]

Learning Sums of Independent Integer Random Variables [6]

Learning Mixtures of Gaussians [7] [8]

References

$\textstyle \epsilon -$ Covers

Learning Poisson Binomial Distributions ^[4]

Learning Sums of Independent Integer Random Variables ^[6]

Learning Mixtures of Gaussians ^[7] ^[8]