Joint probability distribution

In the study of probability, given two random variables X and Y that are defined on the same probability space, the joint distribution for X and Y defines the probability of events defined in terms of both X and Y. In the case of only two random variables, this is called a bivariate distribution, but the concept generalizes to any number of random variables, giving a multivariate distribution.

Example

Consider the roll of a dice and let $X=1$ if the number is even and $X=0$ else. Furthermore, let $Y=1$ if the number is prime and $Y=0$ else. Then, the joint distribution of $X$ and $Y$ is

\mathrm {P} (A=0,B=0)={\frac {1}{6}},\;\mathrm {P} (A=1,B=0)={\frac {2}{6}}

\mathrm {P} (A=0,B=1)={\frac {2}{6}},\;\mathrm {P} (A=1,B=1)={\frac {1}{6}}

Cumulative distribution

The cumulative distribution function for a pair of random variables is defined in terms of their joint probability distribution;

F(x,y)=P(X\leq x,Y\leq y).

Discrete case

The joint probability mass function of two discrete random variables is equal to

{\begin{aligned}\mathrm {P} (X=x\ \mathrm {and} \ Y=y)&{}=\mathrm {P} (Y=y\mid X=x)\cdot \mathrm {P} (X=x)\\&{}=\mathrm {P} (X=x\mid Y=y)\cdot \mathrm {P} (Y=y).\end{aligned}}

In general, the joint probability distribution of $n$ discrete random variables $X_{1},...,X_{n}$ is equal to

\mathrm {P} (X_{1}=x_{1},\dots ,X_{n}=x_{n})=\mathrm {P} (X_{1}=x_{1})\cdot \mathrm {P} (X_{2}=x_{2}|X_{1}=x_{1})\cdot \mathrm {P} (X_{3}=x_{3}|X_{1}=x_{1},X_{2}=x_{2})\cdot ...\cdot P(X_{n}=x_{n}|X_{1}=x_{1},\dots ,X_{n-1}=x_{n-1})

This identity is known as the chain rule of probability.

Since these are probabilities, we have

\sum _{x}\sum _{y}\mathrm {P} (X=x\ \mathrm {and} \ Y=y)=1.\;

Continuous case

Similarly for continuous random variables, the joint probability density function can be written as f_X,Y(x, y) and this is

f_{X,Y}(x,y)=f_{Y|X}(y|x)f_{X}(x)=f_{X|Y}(x|y)f_{Y}(y)\;

where f_Y|X(y|x) and f_X|Y(x|y) give the conditional distributions of Y given X = x and of X given Y = y respectively, and f_X(x) and f_Y(y) give the marginal distributions for X and Y respectively.

Again, since these are probability distributions, one has

\int _{x}\int _{y}f_{X,Y}(x,y)\;dy\;dx=1.

Mixed case

In some situations X is continuous but Y is discrete. For example, in a logistic regression, one may wish to predict the probability of a binary outcome Y conditional on the value of a continuously-distributed X. In this case, (X, Y) has neither a probability density function nor a probability mass function in the sense of the terms given above. On the other hand, a "mixed joint density" can be defined in either of two ways:

{\begin{aligned}f_{X,Y}(x,y)&=f_{X|Y}(x|y)\mathrm {P} (Y=y)\\&=\mathrm {P} (Y=y\mid X=x)f_{X}(x)\end{aligned}}

Formally, f_X,Y(x, y) is the probability density function of (X, Y) with respect to the product measure on the respective supports of X and Y. Either of these two decompositions can then be used to recover the joint cumulative distribution function:

{\begin{aligned}F_{X,Y}(x,y)&=\sum \limits _{t\leq y}\int _{s=-\infty }^{x}f_{X,Y}(s,t)\;ds\end{aligned}}

The definition generalizes to a mixture of arbitrary numbers of discrete and continuous random variables.

General multidimensional distributions

The cumulative distribution function for a vector of random variables is defined in terms of their joint probability distribution;

F(x_{1},\dots ,x_{n})=P(X_{1}\leq x_{1},\dots ,X_{n}\leq x_{n}).

The joint distribution for two random variables can be extended to many random variables X₁, ... X_n by adding them sequentially with the identity

{\begin{aligned}f_{X_{1},\ldots X_{n}}(x_{1},\ldots x_{n})=&f_{X_{n}|X_{1},\ldots X_{n-1}}(x_{n}|x_{1},\ldots x_{n-1})f_{X_{1},\ldots X_{n-1}}(x_{1},\ldots x_{n-1})\\=&f_{X_{1}}(x_{1})\\&\cdot f_{X_{2}|X_{1}}(x_{2}|x_{1})\\&\cdot \dots \\&\cdot f_{X_{n-1}|X_{1}\ldots X_{n-2}}(x_{n-1}|x_{1},\ldots x_{n-2})\\&\cdot f_{X_{n}|X_{1},\ldots X_{n-1}}(x_{n}|x_{1},\ldots x_{n-1}),\end{aligned}}

where

{\begin{aligned}f_{X_{i}|X_{1},\ldots X_{i-1}}(x_{i}|x_{1},\ldots x_{i-1})=&{\frac {f_{X_{1},\dots X_{i}}(x_{1},\dots x_{i})}{\int f_{X_{1},\dots X_{i}}(x_{1},\dots x_{i-1},u_{i})\mathrm {d} u_{i}}}\\=&{\frac {\int \dots \int f_{X_{1},\dots X_{n}}(x_{1},\dots x_{i},u_{i+1},\dots u_{n})\mathrm {d} u_{i+1}\dots \mathrm {d} u_{n}}{\int \dots \int \int f_{X_{1},\dots X_{n}}(x_{1},\dots x_{i-1},u_{i},\dots u_{n})\mathrm {d} u_{i}\,\mathrm {d} u_{i+1}\dots \mathrm {d} u_{n}}}\end{aligned}}

and

f_{X_{1},\dots X_{i}}(x_{1},\dots x_{i})=\int \dots \int f_{X_{1},\dots X_{n}}(x_{1},\dots x_{i},x_{i+1},\dots x_{n})\mathrm {d} x_{i+1}\dots \mathrm {d} x_{n}

(notice, that these latter identities can be useful to generate a random variable $(X_{1},\dots X_{n})$ with given distribution function $f(x_{1},\dots x_{n})$ ); the density of the marginal distribution is

f_{X_{i}}(x_{i})=\int \dots \int \int \dots \int f_{X_{1},\dots X_{n}}(x_{1},\dots x_{i-1},x_{i},x_{i+1},\dots x_{n})\mathrm {d} x_{1}\dots \mathrm {d} x_{i-1}\,\mathrm {d} x_{i+1}\dots \mathrm {d} x_{n}.

The joint cumulative distribution function is

F_{X_{1},\dots X_{n}}\left(x_{1},\dots x_{n}\right)=\int _{-\infty }^{x_{1}}\dots \int _{-\infty }^{x_{n}}f_{X_{1},\dots X_{n}}\left(u_{1},\dots u_{n}\right)\mathrm {d} u_{1}\dots \mathrm {d} u_{n},

and the conditional distribution function is accordingly

{\begin{aligned}F_{X_{i}|X_{1},\ldots X_{i-1}}(x_{i}|x_{1},\ldots x_{i-1})=&{\frac {\int _{-\infty }^{x_{i}}f_{X_{1},\dots X_{i}}(x_{1},\dots x_{i-1},u_{i})\mathrm {d} u_{i}}{\int _{-\infty }^{\infty }f_{X_{1},\dots X_{i}}(x_{1},\dots x_{i-1},u_{i})\mathrm {d} u_{i}}}\\=&{\frac {\int _{-\infty }^{\infty }\dots \int _{-\infty }^{\infty }\int _{-\infty }^{x_{i}}f_{X_{1},\dots X_{n}}(x_{1},\dots x_{i-1},u_{i},\dots u_{n})\mathrm {d} u_{i}\dots \mathrm {d} u_{n}}{\int _{-\infty }^{\infty }\dots \int _{-\infty }^{\infty }\int _{-\infty }^{\infty }f_{X_{1},\dots X_{n}}(x_{1},\dots x_{i-1},u_{i},\dots u_{n})\mathrm {d} u_{i}\dots \mathrm {d} u_{n}}}.\end{aligned}}

Expectation reads

\mathbb {E} \left[h(X_{1},\dots X_{n})\right]=\int _{-\infty }^{\infty }\dots \int _{-\infty }^{\infty }h(x_{1},\dots x_{n})f_{X_{1},\dots X_{n}}(x_{1},\dots x_{n})\mathrm {d} x_{1}\dots \mathrm {d} x_{n};

suppose that h is smooth enough and $h(u_{1},\dots u_{n})=h(x_{1},\dots x_{n})$ for $u_{1}\geq x_{1},\dots u_{n}\geq x_{n}$ , then, by iterated integration by parts,

{\begin{aligned}\mathbb {E} \left[h(X_{1},\dots X_{n})\right]=&h(x_{1},\dots x_{n})+\\&(-1)^{n}\int _{-\infty }^{x_{1}}\dots \int _{-\infty }^{x_{n}}F_{X_{1},\dots X_{n}}(u_{1},\dots u_{n}){\frac {\partial ^{n}}{\partial x_{1}\dots \partial x_{n}}}h(u_{1},\dots u_{n})\mathrm {d} u_{1}\dots \mathrm {d} u_{n}.\end{aligned}}

Joint distribution for independent variables

If for discrete random variables $\ P(X=x\ {\mbox{and}}\ Y=y)=P(X=x)\cdot P(Y=y)$ for all x and y, or for absolutely continuous random variables $\ f_{X,Y}(x,y)=f_{X}(x)\cdot f_{Y}(y)$ for all x and y, then X and Y are said to be independent.

Joint Distribution for conditionally independent variables

If a subset $A$ of the variables $X_{1},\cdots ,X_{n}$ is conditionally independent given another subset $B$ of these variables, then the joint distribution $\mathrm {P} (X_{1},...,X_{n})$ is equal to $P(B)\cdot P(A|B)$ . Therefore, it can be efficiently represented by the lower-dimensional probability distributions $P(B)$ and $P(A|B)$ . Such conditional independence relations can be represented with a Bayesian network.