CDF-based nonparametric confidence interval
In statistics, cumulative distribution function (CDF)-based nonparametric confidence intervals are a general class of confidence intervals around statistical functionals of a distribution. To calculate these confidence intervals, all that is required is a set of independently and identically distributed (iid) samples from the distribution and known bounds on the support of the distribution. The latter requirement simply means that all the nonzero probability mass of the distribution must be contained in some known interval .
Intuition
The intuition behind the CDF-based approach is that bounds on the CDF of a distribution can be translated into bounds on statistical functionals of that distribution. Given an upper and lower bound on the CDF, the approach involves finding the CDFs within the bounds that maximize and minimize the statistical functional of interest.
Properties of the Bounds
Unlike approaches that make asymptotic assumptions, including bootstrap approaches and those that rely on the Central Limit Theorem, CDF-based bounds are valid for finite sample sizes. And unlike bounds based on inequalities such as Hoeffding's and McDiarmid's inequalities, CDF-based bounds use properties of the entire sample and thus often produce significantly tighter bounds.
CDF Bounds
CDF-based confidence intervals require a probabilistic bound on the CDF of the distribution from which the samples were generated from. A variety of methods exist for generating confidence intervals for the CDF of a distribution, , given iid samples drawn from the distribution. These methods are all based on the empirical distribution function (empirical CDF). Given iid samples, , the empirical CDF is defined to be
where is the indicator of event A. The Dvoretzky–Kiefer–Wolfowitz inequality,[1] whose tight constant was determined by Massart,[2] places a confidence interval around the Kolmogorov-Smirrnov statistic between the CDF and the empirical CDF. Given iid samples from , the bound states
This can be viewed as a confidence envelope that runs parallel to, and is equally above and below, the empirical CDF.

The equally spaced confidence interval around the empirical CDF
allows for different rates of violations across the support of the
distribution. In particular, it is more common for a CDF to be
outside of the CDF bound estimated using the Dvoretzky-Kiefer-Wolfowitz inequality near the
median of the distribution than near the endpoints of the distribution. In contrast, the order statistics-based
bound introduced by Learned-Miller and DeStefano[3] allows for an equal rate
of violation across all of the order statistics. This in turn
results in a bound that is tighter near the ends of the support of the distribution
and looser in the middle of the support. Other types of bounds can be generated
by varying the rate of violation for the order statistics. For example, if a tighter
bound on the distribution is desired on the upper portion of the support, a higher rate of
violation can be allowed at the upper portion of the support at
the expense of having a lower rate of violation, and thus a looser
bound, for the lower portion of the support.
A Nonparametric Bound on the Mean
Assume without loss of generality that the support of the distribution is contained in Given a confidence envelope for the CDF of it is easy to derive a corresponding confidence interval for the mean of . It can be shown[4] that the CDF that maximizes the mean is the one that runs along the lower confidence envelope, , and the CDF that minimizes the mean is the one that runs along the upper envelope, . Using the identity
the confidence interval for the mean can be computed as
A Nonparametric Bound on the Variance
Assume without loss of generality that the support of the distribution of interest, , is contained in . Given a confidence envelope for , it can be shown[5] that the CDF within the envelope that minimizes the variance begins on the lower envelope, has a jump continuity to the upper envelope, and then continues along the upper envelope. Further, it can be shown that this variance-minimizing CDF, F', must satisfy the constraint that the jump discontinuity occurs at . The variance maximizing CDF begins on the upper envelope, horizontally transitions to the lower envelope, then continues along the lower envelope. Explicit algorithms for calculating these variance-maximizing and minimizing CDFs are given by Romano and Wolf.[5]
Bounds on other Statistical Functionals
The CDF-based framework for generating confidence intervals is very general and can be applied to a variety of other statistical functionals including
See also
References
- ^ A., Dvoretzky (1956). "Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator". The Annals of Mathematical Statistics. 27 (3): 642–669.
{{cite journal}}
: Unknown parameter|coauthors=
ignored (|author=
suggested) (help) - ^ Massart, P. (1990). "The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality". The Annals of Probability: 1269–1283.
- ^ a b Learned-Miller, E. (2008). "A probabilistic upper bound on differential entropy". IEEE Transactions on Information Theory. 54 (11): 5223–5230.
{{cite journal}}
: Unknown parameter|coauthors=
ignored (|author=
suggested) (help) - ^ Anderson, T.W. (1969). "Confidence limits for the value of an arbitrary bounded random variable with a continuous distribution function". Bulletin of The International and Statistical Institute. 43: 249–251.
- ^ a b Romano, J.P. (2002). "Explicit nonparametric confidence intervals for the variance with guaranteed coverage". Communications in Statistics-Theory and Methods. 31 (8): 1231–1250.
{{cite journal}}
: Unknown parameter|coauthors=
ignored (|author=
suggested) (help) - ^ VanderKraats, N.D. (2011). "A finite-sample, distribution-free, probabilistic lower bound on mutual information". Neural Computation. 23 (7): 1862–1898.
{{cite journal}}
: Unknown parameter|coauthors=
ignored (|author=
suggested) (help)
External links
- Confidence Interval - An explanation of confidence intervals.
- Bootstrap: A Statistical Method - An overview of bootstrap methods