Wikipedia talk:WikiProject Statistics/Manual of Style

Statistics Project‑class

	This page is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics
Project	This page does not require a rating on Wikipedia's content assessment scale.

Discussion

There are several things that could be discussed:

Random variables

Is it ok to discourage the use of capital letters to denote the random variables? It certainly seems that statisticians don't use that convention anymore. When we have a sample x₁, …, x_n then each of these quantities is itself a random variable, an iid copy from some common distribution F.

quantities? Maybe that usage should be deprecated.

Oppose. A sequence of random variables should be X1, X2, ...; a sequence of quantities (variates?) x1, x2, ... --P64 (talk) 15:06, 14 May 2010 (UTC)[reply]

Parentheses / brackets

It seems a common practice to use square brackets with the expectation: E[x], whereas for variance it is half/half: either Var[x] or Var(x), for covariance it is already mostly the parentheses Cov(x, y). So what should our recommendation be?

Small / capital correlation

The symbols for expectation, variance and covariance are all traditionally uppercase: E, Var, Cov; whereas correlation is almost always seen in lowercase: corr. Should we convert it to uppercase as well, or leave it be?

Transposition

The common symbol in statistics to denote transposition is ′: x′. This contradicts the MOS:MATH recommendation which is to use the \top or x^T.

Italic / straight distributions

Some common distributions such as normal N(μ, σ²), t-distribution t_k, F-distribution F_k,ℓ, chi-squared χ_k², uniform U(a, b) are all traditionally written in italic. For other, not-so-common distributions, there is no tradition. Should we require them to be in italic as well (eg: Poisson, Exponential, Binomial)? If so should they be abbreviated if possible (eg: Poi, Exp, B)?

Abbreviations recommended for general use should not be shorter than Uni, Poi, Expo (not Exp), Bin, Gam, Beta. (I prefer Unif, Pois, Gamma.) In other words the classical N, t, F, and chi should be exceptions.

Offhand I would deprecate italic face too. Use italics only for one-letter abbreviations.

What about the use of t, F, and chi for statistics, which may sometimes be interpreted as realizations of t, F, and chi random variables? --P64 (talk) 15:17, 14 May 2010 (UTC)[reply]

One-letter B should be a Brownian random variable, if anything. Brownian Bt, unlike binomial Bin(n,p), may be considered an extension of the classical family of one-letter exceptions {N, t, F, chi}. --P64 (talk) 15:59, 14 May 2010 (UTC)[reply]

Distributions: parameters vs. degrees of freedom

I think there is a critical distinction between the parameter, such as λ of the exponential distribution, and the degrees of freedom, such as ν of the t-distribution. In practice the λ of the exponential is rarely known, so it has to be estimated. This is why it is a parameter, and it is for example meaningful to ask what is the Hessian of the log-likelihood of the distribution with respect to this parameter. On the other hand the degrees of freedom “parameter” is not a true parameter since it is always known beforehand in applications and never estimated. In particular the Fisher information with respect to this ν does not exist (although technically it could probably be calculated). The distinction between these “estimable” and “non-estimable” parameters is that the former are given in parentheses, like N(μ, σ²), while the latter as a subscript: t_k. If we make this into a rule, then some of the distributions will have to be changed, for example the binomial B_n(p).

Do not encourage to string the symbols for rarely estimated parameters together as subscripts that precede parentheses, such as B_n(p). If some notational distinction is valuable, why not use inside the parentheses a separator alternative to the comma? For example if the semicolon is adopted: Bin(n;p) or perhaps Bin(p;n) for the binomial family of distributions. --P64 (talk) 15:51, 14 May 2010 (UTC)[reply]

Sample size

Both n or T are viable symbols to denote the sample size. The T is more frequent in time series models, whereas n in iid settings. However it should be forbidden to use these symbols to denote anything else other than the sample size (for example like the Numerical methods for linear least squares article), otherwise it would cause too much confusion.

So you would not only deprecate but forbid the use of T for a random variable such as stopping time or hitting time or return time? Or is that the random size of a kind of "sample"?

Why not I for the index set, and thus commonly the sample size, where i is the index variable; J where j is the index variable; K where k is the index variable; T where t is the index variable? This is consistent with permission for I, J, K to be random sample sizes aka index sets where appropriate.

In company with I for the index set, indicator variables should use some fontface rendition of 1 (one) rather than some fontface rendition of I. --P64 (talk) 15:41, 14 May 2010 (UTC)[reply]