This is an archive of past discussions about Statistical hypothesis test. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Distaste not criticism

"... surely, God loves the .06 nearly as much as the .05." (Rosnell and Rosenthal 1989)

"How has the virtually barren technique of hypothesis testing come to assume such importance in the process by which we arrive at our conclusions from our data?" (Loftus 1991)

"Despite the stranglehold that hypothesis testing has on experimental psychology, I find it difficult to imagine a less insightful means of transiting from data to conclusions." (Loftus 1991)

The above are not criticisms of hypothesis testing they are statements expressing one’s distaste for hypothesis testing that offer nothing in the way of argument.

--Ivan 06:39, 8 September 2006 (UTC)

The criticism of hypothesis testing is principally that people have doing hypothesis tests when it would be more useful to estimate the difference between effects and give a confidence interval. Blaise (talk) 09:19, 8 March 2009 (UTC)

Unpooled df formula

What's the source for the two-sample unpooled t-test formula? The formula for the degrees-of-freedom shown here is different from the Smith-Satterthwaite procedure, which is conventional, from what little I know. The S-S formula for df is

$df={\frac {\left({\frac {s_{1}^{2}}{n_{1}}}+{\frac {s_{2}^{2}}{n_{2}}}\right)^{2}}{{\frac {\left({\frac {s_{1}^{2}}{n_{1}}}\right)^{2}}{n_{1}-1}}+{\frac {\left({\frac {s_{2}^{2}}{n_{2}}}\right)^{2}}{n_{2}-1}}}}$

Did someone just try to simplify this and make a few errors along the way?

--Drpritch 19:40, 4 October 2006 (UTC)

References?

There are a number of references in the criticism section, e.g., Cohen 1990, but these references are not specified anywhere in the article. Tpellman 04:43, 10 November 2006 (UTC)

Can we clarify this article to make it more readable?

The article should be made more accessible to lay users by an explanation of some of the symbols (or at least a table of variables) used in the formulas. 69.140.173.15 03:19, 10 December 2006 (UTC)

24-Oct-2007: Good idea. I have added a "Definition of symbols" row to the bottom of the table. Forgetting to define symbols is a common wiki problem. Please flag undefined symbols in other articles, as well. Thanks. -Wikid77 10:16, 24 October 2007 (UTC)

Issues from 2007

Sufficient statistic

The article states that the statistic used for testing a hypothesis is called a sufficient statistic. This is false. In some cases the test statistic happens to be a sufficient statistic. For most distributions a sufficient statistic does not even exist. This is especially so if there are nuisance parameters. When a test can be based on a sufficient statistic it is advantageous to do so. 203.97.74.172 22:19, 21 January 2007 (UTC)Terry Moore

What's inappropriate about the link?

Re this edit: "11:42, 14 February 2007 SiobhanHansa (Talk | contribs) m (Revert inappropriate addition of exernal link)" Please tell me what is inappropriate about the addition of the link. It looks like a book, available in full online, about statistics. This could be a useful resource for readers of this article. --Coppertwig 13:47, 14 February 2007 (UTC)

Sorry for the late response. The link has been spammed across multiple articles and European language Wikipedias by several IP addresses who have not made other edits. Standard practice is to revert mass additions that appear to be promoting a particular point of view or resource. If regular editors of this article think it is a good addition to the article then it should stay. -- Siobhan Hansa 15:02, 14 March 2007 (UTC)

question

Isn't the alternate degrees of freedom for a two sample unpooled t-test equal to min{n₁,n₂}-1? —The preceding unsigned comment was added by 68.57.50.210 (talk) 02:21, 19 March 2007 (UTC).

24-Oct-2007: Agreed. I am putting " - 1". -Wikid77 10:18, 24 October 2007 (UTC)

perm test

what is the application of permutation test in estimation of insurance claims.12:12, 10 May 2007 (UTC)41.204.52.52bas

Oops.

I "broke" the article by an unclosed comment which effectively deleted the following sections. Sorry.

66.53.214.119 (talk) 00:36, 11 December 2007 (UTC)

Citations

Many of the remaining requests for citations can be answered by reference to other Wikipedia articles:

http://en.wikipedia.org/wiki/Karl_Popper for the philosophy of science http://en.wikipedia.org/wiki/Mathematical_proof "A proof is a logical argument, not an empirical one." (so statistics don't prove anything).

http://en.wikipedia.org/wiki/Statistical_significance '"A statistically significant difference" simply means there is statistical evidence that there is a difference; it does not mean the difference is necessarily large, important or significant of the word.' (maybe - significant in the common meaning of the word?) http://en.wikipedia.org/wiki/Effect_size "In inferential statistics, an effect size helps to determine whether a statistically significant difference is a difference of practical concern." "The effects size helps us to know whether the difference observed is a difference that matters."

67.150.7.139 (talk) 03:32, 11 December 2007 (UTC)

Organization

Shouldn't the Contents be the first section?

66.53.213.51 (talk) 04:47, 17 December 2007 (UTC)

Issues from 2008

The Definition of Terms Section

Few of the terms have utility in Fisher's original conception of null hypothesis significance testing. They do have utility in the Neyman-Pearson formulation. Should this article discuss both? Unless so, the section should be heavily edited because the null hypothesis is never accepted in Fisher's formulation which makes the "Region of acceptance" very difficult to understand.

See the fourth paragraph under Pedagogic criticism.

67.150.5.216 (talk) 01:59, 14 January 2008 (UTC)

One-Proportion Z-test not well-defined

The One-Proportion Z-test is not well defined because the symbol p-hat does not appear in the legend. Perhaps it should be made clearer that p-hat is the hypothesis proportion, and p is the sample proportion. —Preceding unsigned comment added by 140.247.250.91 (talk) 14:58, 5 December 2008 (UTC)

Removed section

I've removed a newly placed section and am putting it here for reference. The new section read as follows:

== Test statistic ==

Central to statistical hypothesis testing is a test statistic, whose sampling distribution in general, and specific value on the observed data, drives the analysis.

For a given distribution with known parameters, a test statistic is easy computed, or one can compute the p-value directly – for example in Fisher's "Lady drinking tea", where the null hypothesis was a sequence of 8 coin flips, and the Lady correctly guessed the tea method 8 times, the (one-tailed) p-value was $1/32\approx 3\%.$ Similarly, if one has a normal distribution with given mean and variance as null hypothesis, one can apply a z-test.

In general, however, a null hypothesis will have an unknown parameter, in which case one cannot proceed as easily. Test statistic are thus generally based on pivotal quantities – functions of the observations that do not depend on the parameters. A basic example is Student's t-statistic, which functions as a test statistic for a normal distribution with unknown variance (and mean): the t-test is based on the sample mean and sample variance, while the z-test requires the parameters to be known.

The Fisher example is already given in the section titled "Example". TBH, the proposed section doesn't explain much in an already difficult article. Perhaps some of the additional text can be integrated into the Example section? Phrases like "drives the analysis", "in general, however a null hypothesis will have an unknown parameter, in which case one cannot proceed as easily", etc., further confound this extra section in an already disorganized article. ... Kenosis (talk) 04:04, 24 April 2009 (UTC)

Tone and jargon

I've just moved the 'tone' tag, from its previous position at the top of a particular section, to the head of the whole article.

It looks as if one of the problems is a problem with jargon, in that some of the passages seem to say what the authors are unlikely to have meant in plain language.

Example: the article says "in other words, we can reject the null when it's virtually certain to be true", but this seems to read like nonsense: and I'd hazard a guess, that what the author actually meant was something else, that (when said plainly) might be along the lines "in other words, the standard test procedure may encourage us to reject the null (hypothesis) when it's virtually certain to be true" -- a very different thing.

There are also other passages that seem to need attention from an expert who can suss out what was really meant, and then make the article say plainly what it means, and mean what it says. Terry0051 (talk) 19:25, 13 May 2009 (UTC)

The tone tag is very important. This article is ridden with jargon. This needs to be spread out in greater simplicity.

Wikipedia needs to be conveyed with clear writing to explain matters to the non-experts. Jargon is for experts. Experts do not need these articles. They already know the matters. We need this article to be improved, to explain the material at hand on a lay-person's level of knowledge.Dogru144 (talk) 00:26, 23 July 2009 (UTC)

Actually, though there are some statements that are arguably erroneous in the article, it's not merely "jargon". Statistical hypothesis testing is a technical art the expression of which is heavily dependent upon a set of terms that are widely accepted "terms of art" which are often difficult to reduce to plain English in a way that can be readily agreed among knowledgeable persons in this complex, highly technical topic area. Given the formula-laden approach of most statistics texts, and the relative rarity of books that explain statistics in laypersons' terms, it might be a very tall order to explain this topic in "plain English" and still meet WP:V in a way that has a chance of becoming a consensused, stable article.
..... The example given by Terry005 above is one of several instances where students of statistics (and/or those already intimately familiar) attempt to reduce statistical terms of art to plain English: The statement "in other words, we can reject the null when it's virtually certain to be true" is correct only when one already knows what the word "it's" is intended to refer to-- the words "it's virtually certain to be true" refers to any hypothesized correlation between two factors under specified conditions, which can be reasonably said to be "true" only when the null hypothesis is rejected due to evidence of a correlation to within a very high degree of confidence-- (to within a very tight "confidence interval"). Of course something like this could, and should, be written in a way that doesn't look false on its face to those unfamiliar with the process. Perhaps something like "In other words, one can reject the 'null' when the hypothesized relationship is virtually certain to be true" might be clearer. But it isn't at all an easy task in a topic that is dominated by so much formal mathematics and so many specialized terms.
.....So, one approach might be to split the topic into another introductory article. There are many examples of this approach across the wiki, some of which can be seen at User_talk:Kenosis/Research. Having noted this possibility, the same issue seems to remain-- which is actually writing it in a way that will ring true to those who are knowledgeable in statistical testing as well as being helpful to laypersons. ... Kenosis (talk) 05:26, 23 July 2009 (UTC)

Thank you for your reply. A more ideal manner of writing this article would take breaking these ideas down to smaller sized chunks, as math or science teachers do with elementary school students.

This is an example of simpler language that would be more helpful to a broader audience: "In other words, one can reject the 'null' when the hypothesized relationship is virtually certain to be true" might be clearer. Yes, the ideas are packed with specialized terms. It will take much time and patience to break it down. I have faith that the task can be accomplished.Dogru144 (talk) 08:21, 23 July 2009 (UTC)

I briefly attempted to address this issue by inserting the proposed alternative sentence, but quickly realized that the source is making an argument about excessively lax rejections of null hypotheses which lead to excessive Type I errors. So I reverted back to the earlier text in the "Straw man" criticism section here, and removed the "citation needed" templates because the basic assertion seems to properly reflect the already given source for the assertion. Whether the criticism is valid is a whole separate discussion-- a rather complex one to say the least. ... Kenosis (talk) 15:19, 27 July 2009 (UTC)

Howto tag discussed

A 'howto' tag was recently added to the section on 'test procedure'. I've added an explanatory sentence without removing the 'howto' tag, but I propose here that the tag should be removed.

No disagreement is raised about the principle that "The purpose of Wikipedia is to present facts, not to train". But here, a summary of the test procedure can serve to reveal important characteristics of the subject-matter not otherwise effectively communicated. In that way, the role here of the section on test procedure is not to train, but to present facts about the nature of the testing which is subject-matter of this article. (I leave open the possibility that the current description of the procedure could be improved to present the facts more effectively.)

In this case, procedural presentation of facts can be seen as especially important, because an important part of the current state of this article is the multi-part section describing criticisms of frequentist hypothesis-testing. Some of these criticisms are connected with, or dependent upon, features of the test procedure. It can also be controversial, or at least argued, whether some of the effectively criticised features of the procedure represent its misapplication, rather than its proper application -- for example, various practices about what inferences to make from specific numerical levels of probability.

So when (and if) the article reaches a reasonably mature encyclopedic state, I would hope that it will show two characteristics:

(1) a consensus description of the true nature and content of the testing with enough detail to enable the criticisms to be properly understood -- and this would include an adequate note of any competing views of the nature and content of the testing, perhaps among them how the testing was proposed by its originators to be used, and how it is actually/often used in current practice, and

(2) in respect of each type of criticism, how it relates to features of the nature and content of the testing.

What I'd suggest is that if, at that point, a procedure section clearly duplicates facts about the nature of the testing accurately represented elsewhere, then it would look redundant. But otherwise, I suggest it has a genuinely encyclopedic role here.

Terry0051 (talk) 12:19, 7 July 2009 (UTC)

I think the section is a valuable part of the article. It might be renamed to "summary of hyptothesis testing" or some such, and it might appear slghtly better with the numbered points being formatted as such. Possibly the last 3 points might be re-phrased away from being "instructions", but it doesn't seem to qualify as a how-to list. Melcombe (talk) 14:24, 7 July 2009 (UTC)

I agree wholeheartedly. I think the tag police should step down on this one. When discussing a technique, how can you not delve a little bit into how the technique is performed? These rules need to be applied intelligently, not in the knee-jerk way that was applied here. It's completely appropriate to discuss how hypothesis testing is done, because that defines what it is. For example, it would be inappropriate to discuss how to shift in a manual transmission car in an article on transmissions. However, it's entirely appropriate in the heel-and-toe article. Can we just put this to rest and remove the tag? Birge (talk) 21:53, 4 September 2009 (UTC)

Issues from 2009

Bayesian criticism section

I am a little confused about this section. It talks about a situation in which P(Null) is close to 1, but we got some extremely unlikely data, so P(Data|Null) is close to 0 and therefore we reject Null, even though P(Null | Data) is close to 1 too. Every claim it makes is true, but what it fails to do is actually produce a valid criticism of hypothesis testing. If P(Data|Null) is close to 0, then this particular situation will happen very rarely, and therefore we will only make the mistake of rejecting an almost certain null hypothesis very rarely. If your goal is policy making, then that's a pretty good promise.

Don't get me wrong, I am not arguing that there's nothing wrong with hypothesis testing - just that in its current form the summary of criticism from Bayesian statisticians is rather weak and not very convincing.

Ivancho.was.here (talk) 15:31, 17 September 2009 (UTC)

That could be because all criticism from Bayesian statisticians is rather weak and not very convincing. Melcombe (talk) 15:44, 17 September 2009 (UTC)

True vs. un-rejectable

It seems to me to be adequately clear that a proper focus should be not on whether or not the null hypothesis is true or false but, rather, on whether it is rejectable or not based on the results of the experiment conducted. That being said, I take issue with the following definition provided in the article for ~~"p-value"~~ "unbiased test":

or equal to the significance level when the null hypothesis is true.

In terms of the definitions given for α and β, the phrase incorrectly rejecting the null hypothesis appears consistent with the aforementioned focus, in that such rejection is improperly done not based on the hypothesis actually being true, but rather as a result of overemphasis of the evidence in favor of rejecting it. In other words, in appreciating a false-negative-result, we are not concerned that the hypothesis is true nearly as much as we are concerned with the results of the experiment insufficiently contesting the null hypothesis (yet similarly appreciating that it may be false, yet undetectable or at least undetectable by the study methods performed). My problem lies in the above quoted (indented) copy/paste in reference to the ~~p-value~~ unbiased test, in which an inordinate stress seems to be placed on the hypothesis being true. If I am not mistaken, we can never really know if the null hypothesis is true -- I'm therefor disturbed by this statement, which I believe should read:

or equal to the significance level when the null hypothesis is unrejectable as a function of the experimental results.

If what I wrote is not completely incoherent, I'd welcome comments to either bolster my argument or explain to me why my premise (i.e. my understanding of this definition of ~~p-value~~ unbiased test in terms of the "trueness" of the null hypothesis) is incorrect. DRosenbach ^{(Talk | Contribs)} 12:34, 25 September 2009 (UTC)

I don't see the phrase you are complaining about. The "definition" I see is "p-value: The probability, assuming the null hypothesis is true, of observing a result at least as extreme as the test statistic" which seems correct. I did look for your phrase specifically ... if it is still there, could you be more specific? However, the article is correct in working with probabilities calculated assuming the null hypothesis is true, while you are correct that "we can never really know if the null hypothesis is true". But that is the way significance tests work. There is a logical leap from the probabilities calculated to making a "decision" which is the subject of the so-called controversy about significance tests. Melcombe (talk) 13:19, 25 September 2009 (UTC)

Sorry -- I meant unbiased test Your point is well taken, Melcombe, but if there is such a focus on pedanticism, I think it should be across the board. DRosenbach ^{(Talk | Contribs)} 13:27, 25 September 2009 (UTC)

OK found it. But I think it is correct. As before, the probabilities are calculated assuming that one does know the true state of affairs and the condition stated is a property that one would want a "good" test to have, given that one can work out the probabilities assuming a particular version of the truth. That is, there should be a balance between the probabilities of reaching the conclusion "reject the null hypothesis" between the cases where the null hypothesis is true and the cases where the null hypothesis is false. Melcombe (talk) 14:32, 25 September 2009 (UTC)

Popper or not Popper

As a suggestion because there has been considerable discourse on phrasing of the standard hypothesis test result.

A p-value is an estimate of the probability of getting the observed result given that the null hypothesis is true.

This phrasing emphasizes that a hypothesis test is incapable of proving the null hypothesis false, nor can it prove the alternative hypothesis true. Given the outcome of the test it is up to the researcher to make a judgement call based on the outcome of the test.

This phrasing also leaves open the possibility that one could use the outcome to justify claiming that the null hypothesis is true. If a p-value of 0.05 or less is accepted as a criteria for rejecting the null hypothesis, then by extension a p-value of 0.95 or greater should be sufficient to allow for a claim that two samples are the same.

for an example see: Ebert, T.A., W.S. Fargo, B. Cartwright, F.R. Hall. 1998. Randomization tests: an example using morphological differences in Aphis gossypii (Hemiptera: Aphididae). Annals of the Entomological Society of America. 91(6):761-770.

Eumenes89 (talk) 01:52, 20 December 2009 (UTC)

To a large extent I would agree except for one thing: As for a claim that the "two samples are the same", I think not. One can only say that in view of the result they are not distinguishable by this test. (Also, it's not apparent how they are any more indistinguishable when the p-value is close to 1 than when the result fails by a narrower margin to reach significance.) Terry0051 (talk) 10:58, 20 December 2009 (UTC)

Acceptance region

Don't redirect "Acceptance region" to this page if this phrase doesn't even show up ONCE! You're giving people the wrong impression, ie. that info on the topic already exists, when it doesn't. —Preceding unsigned comment added by 67.159.64.197 (talk) 03:25, 13 February 2010 (UTC)

Issues from 2010

Section: An introductory example, null hypothesis

Presently the null hypothesis in the example with the clairvoyant is:

H_{0}:p\leq {\tfrac {1}{4}}

but should be

H_{0}:p={\tfrac {1}{4}}

I was told this was a valid example of composite hypothesis by Melcombe in a history revert. A null hypothesis is always a simple hypothesis, and only alternative hypothesis can be composite. You cannot rejection regions from a null hypothesis which is composite.

The union of the null and alternative hypothesis does not need to contain everything in a hypothesis test. Even in later equations in this section it is actually implied that they are equal:

P({\mbox{reject }}H_{0}|H_{0}{\mbox{ is valid}})=P(X\geq 25|p={\tfrac {1}{4}})=\left({\tfrac {1}{4}}\right)^{25}\approx 10^{-15},

which under a composite null hypothesis should be:

P({\mbox{reject }}H_{0}|H_{0}{\mbox{ is valid}})=P(X\geq 25|p\leq {\tfrac {1}{4}})\leq \left({\tfrac {1}{4}}\right)^{25}\approx 10^{-15},

but then you have only given a bound to the probabilities of rejection of the null hypothesis given it is true. Tank (talk) 09:44, 19 May 2010 (UTC)

It is standard that null hypotheses can be composite. The requirement for a definition of "unbiased test" in the "definition of terms" section is based on the need to deal with composite tests. See Null hypothesis#Directionality for more discussion. For an explicit example, consider any of the tests in Category:Normality tests, where the null hypothesis is always composite (since the mean and variance are not specified under the null nypothesis). As discussed at Null hypothesis#Directionality, it may sometimes be possible to logically reduce a test of a composite null hypothesis to a test of a simple one, but this is not always so, and it is not so for normality tests. Other examples include likelihood ratio tests. Melcombe (talk) 10:05, 19 May 2010 (UTC)

I restored the null hypothesis into its original simple form. In this introductory example it is the simplest form of the given problem. Although a null hypothesis may be composite, there is no need in this example. Nijdam (talk) 10:08, 20 May 2010 (UTC)

It is worse than this, because this particular example demands the simpler, mutually-exclusive competing hypotheses. The null is p = 0.25, and the alternative is simply that the proportion is NOT EQUAL to 0.25; because if the subject were to guest NONE of the cards correctly, that itself is a rare event and suggests intentional misses, which is contrary to the assumption of no clairvoyance. [See Signal Detection theory, which is quite happy to acknowledge and deal with this situation.] You have to remember you are dealing with the highly-compressed proportion scale (0 to 1). To assume that the entire range from 0 to 0.25 is unimportant is ignorance. The fact that proportion data often impose floor and ceiling effects is no reason to goof up with this example. As it stands now, it is flat wrong in both principle and practice. — Preceding unsigned comment added by 61.31.150.138 (talk) 11:13, 9 November 2011 (UTC)

Section: Introduction/Examples

Hello Nijdam,

I would like to discuss your changes back on Sept 19, 2010. I felt the need to restore the tables for the following reasons.

The tables were not redundant. The tables added new information for each example. They clarified what was written in paragraph form in a more structured form.
Who is your audience? I suspect you are writing for fairly mature or expert level statisticians. I feel the audience should be freshman level undergraduates taking their first and possibly only statistics course. In the Wikipedia guidelines, the audience is not supposed to be scientists/experts but instead layman. I feel the tables added value. Maybe not to you personally but I think yes to the target audience I identified.
My attempt was to make the process methodical. Steps 1,2,3... My attempt was to make the tables a sort of "tool" for newcomers.
I borrowed the table from another page (Type I and type II errors) in an attempt to standardize it. You swapped rows and columns in your revised table for example 1 so that they were different from example 4. That inconsistency is usually quite confusing to newcomers.
In Wikipedia guidelines it is generally discouraged to delete information -- unless of course it was just plain wrong or offensive which the tables were not.
Finally, I put a lot of my own personal time which is scarce into writing those tables.

If you still disagree, after all these points then maybe we should engage others to bring this to a consensus.

Thank you for listening. Joe The-tenth-zdog (talk) 02:28, 22 September 2010 (UTC)

Frankly, yes, I disagree with you. I replaced one of your tables by a more conventional one. The other tables would be no more than a repetition of this one. The specific wording you use in the tables is, IMHO, not suited for an encyclopedia. Nijdam (talk) 22:59, 22 September 2010 (UTC)

I too disagree. Some of the principals for what Wikpedia should contain are at WP:NOTMANUAL : it is not a tutorial or textbook. All these examples are far too long, and things like "Send him home to Mama!" should not be appearing. Melcombe (talk) 14:38, 23 September 2010 (UTC)

The criticism § is an embarrassment

I hope it was not done by Americans, but I am not sanguine that it wasn't. Suggest that all that content be moved to a misuse of statistics article, creating it if necessary. It makes the wiki editorship as a whole look mathematically and scientifically illiterate/simple minded. 72.228.177.92 (talk) 23:20, 14 November 2010 (UTC)

Also looks like there may be a failure to distinguish between "criticism" and basing on different fundamental models of probability theory which is something that generally should be done for best fit with the problem at hand. So maybe a section on fundamental models and their effect should absorb the current content in the criticism section on Bayesian theory. 72.228.177.92 (talk) 02:09, 15 November 2010 (UTC)

I don't see why it's necessary to call Americans "simple minded" or "scientifically illiterate" to convey the point that you don't like a section. No more necessary than mentioning any other bigoted opinions, which are completely irrelevant to the matter at hand.Napkin65 (talk) 15:20, 17 May 2011 (UTC)

i don't think he's calling Americans such things; i think he meant to say "native English-speaker," since this section is so poorly written... as a statistically-minded American I agree. for example "The test is a flawed application of probability theory," is vague almost to the point of being meaningless (how is it `flawed'? i can think of 2 interpretations (and counter-arguments) off the top of my head), while "the test result is uninformative" can't possibly be true except in a few artificially pathological cases! Also, the "misuses and abuses"-section seems to not discriminate between the legitimate ("rigorously control the experimental design") and dishonest ("Publish the successful tests; Hide the unsuccessful tests."), while repeating points from the preceding "Selected Criticisms" section... On top of this, the grammar is just uniformly atrocious. It needs to be re-written. I would take a crack at it, except that i don't know what this section is even trying to say, and i don't want to step on toes. 209.2.238.169 (talk) 23:34, 5 July 2011 (UTC)

In fact, the original criticism above related to an earlier version that was mostly replaced as part of the contribution discussed below under "Major edit discussion invitation". However the points above about the present version seem valid. Melcombe (talk) 08:23, 6 July 2011 (UTC)

Issues from 2011

Suggesting: Null Hypothesis Statistical Significance Controversy

I read this article. Then I scanned Introductory Statistics, 5th Ed, Weiss, 1999; Statistical Methods for Psychology, 5th Ed, Howell, 2002; Statistics Unplugged, 2nd Ed, Caldwell, 2007. This article contains more criticism of the null hypothesis statistical significance test than all three books combined. Finally, I selectively read 11 papers from Research Methods in Psychology, Badia, Haber & Runyon, 1970. The majority of the criticism contained in this article was known 40 years ago.

On the basis of this modest literature survey, I suggest that the bulk of the criticism be moved to a new article entitled "Null Hypothesis Statistical Significance Test Controversy". It should receive the attention of a subject matter expert. Critical quotations have been selected for emotional impact because, "one can hardly avoid polemics when butchering sacred cows". The result does not have the appearance of impartiality. Significance tests are heavily used - half of an introductory statistics text may be dedicated to their description. While they may eventually be regarded as a mis-use of statistics, that is not the consensus opinion _today_. This article should not imply otherwise. This article should mention the controversy, summarize it and link to the more detailed article. The new article can contain the graduate level seminar while this one contains the undergraduate introduction to statistics.

Criticism of the null hypothesis statistical significance test originates from a long list of reputable researchers - many from the arena of experimental psychology. Some of the criticism may result from concerns unique to the discipline. The best available summary of the issues seems to be Null Hypothesis Significance Testing: A Review of an Old and Continuing Controversy, Nickerson, Psychological Methods, 2000. The article is 61 pages in length and contains a list of references more than 10 pages long (~300 references). For more detail, there are entire books on the controversy: What if there were no significance tests?, Harlow; The Significance Test Controversy, Morrison; The Cult of Statistical Significance, Ziliak; Statistical Significance: Rationale, Validity and Utility, Chow;... There is lots of material for a new article.

159.83.196.1 (talk) 21:06, 20 January 2011 (UTC)

Suggestion to Merge with p-value page

The wikipedia p-value page says someone has proposed merging that page with the hypothesis testing page. Please, please don't merge these pages. As explained by SN Goodman (Ann Intern Med. 1999;130:995-1004.), p-values & hypothesis tests spring from different, even opposing, heritages & purposes. RA Fisher proposed the p-value as a measure of (short-run) evidence, whereas Jerzey Neyman & Egon Pearson proposed hypothesis testing as a procedure for controlling long-run error rates. Neither Fisher, nor Neyman, nor Pearson would have agreed with combining the procedures into a single one. Both p-values & hypothesis tests are already confusing enough (given that they don't constitute the inference most frequentists claim they do); let us not further confuse the reader by merging these pages. — Preceding unsigned comment added by Khahstats (talk • contribs) 18:57, 24 December 2011 (UTC)

A few missing citations

A few missing citations:

Significance and practical importance:

References 16 & 17 are duplicates. Ziliak is the first author.

Meta-criticism:

The Task Force on Statistical Inference.(1999). Statistical methods in psychology journals: Guidelines and explanations L Wilkinson - American Psychologist, 1999

http://www.icmje.org/

Uniform Requirements for Manuscripts Submitted to Biomedical Journals: Publishing and Editorial Issues Related to Publication in Biomedical Journals: Obligation to Publish Negative Studies

"Editors should seriously consider for publication any carefully done study of an important question, relevant to their readers, whether the results for the primary or any additional outcome are statistically significant. Failure to submit or publish findings because of lack of statistical significance is an important cause of publication bias."

Practical criticism:

The "missing" citations are [26][27], found after the following sentence.

Straw man:

Most of the content of this section is repeated in the following section. Delete the duplication to silence the owl (Who? Who?).

Bayesian criticism:

This issue is discussed at length in: Null Hypothesis Significance Testing: A Review of an Old and Continuing Controversy RS Nickerson - Psychological Methods, 2000.

While Nickerson has lots of citations, I doubt that he has the best for the Bayesian approach(es).

He cites Cohen (our [29]) who has a very readable example - An accurate test for schizophrenia, which is rare, provides very different results from frequentist vs Bayesian analysis. The accuracy implies that a patient with a positive result is likely to be schizophrenic. The rarity of the disease implies that any positive result is likely to be false. The two implications are in conflict.

Something is wrong with the Lehmann citations. [6] is incomplete. In Further Reading, Lehmann's 5th edition is mentioned (1970), but his 3rd edition was published in 2005 [8]. —Preceding unsigned comment added by 159.83.196.1 (talk) 23:46, 26 January 2011 (UTC)

159.83.196.228 (talk) 20:30, 22 January 2011 (UTC)

Error in Example 2 - Clairvoyant card game

Hi. The Probability of P(X>=10|P=1/4) is given as 9.5 x 10^-7. However, this number seems incorrect. Shouldn't it be ~0.0071 (See http://stattrek.com/Tables/Binomial.aspx)? Just wanted to check on the talk page first in case I am misinterpreting something. - Akamad (talk) 22:36, 24 January 2011 (UTC)

Thanks, ypu're right. Value has been changed. I have restored it. Nijdam (talk) 10:48, 25 January 2011 (UTC)

Disconnect between The testing process & Interpretation

Disconnect between The testing process & Interpretation:

Using an inconsistent formulation causes difficulties.

While the Interpretation says, "The direct interpretation is that if the p-value is less than the required significance level, then we say the null hypothesis is rejected at the given level of significance.", The testing process says nothing about the p-value or the required significance level. Both are in the Definition of terms section. While it is possible to formulate in terms of regions, p-values or confidence intervals, there is merit in consistency. The Criticism sections and many introductory statistics books use p-values.

159.83.196.130 (talk) 20:15, 3 February 2011 (UTC)

Major edit discussion invitation

The proposed major edit of March 1 addressed the embedded editorial comments in the existing article, sections Potential misuse through Publication bias. Comments? —Preceding unsigned comment added by 159.83.196.1 (talk) 21:02, 8 March 2011 (UTC)

I think you are going to have to divide up your proposed changes so that that can be discussed individually rather than everyone having to try to see what the point of a whole group of changes was. However, you might start by noting the header of the article which says: Of course the promised content about the Bayesian version of what hypothesis testing tries to do is not really there or anywhere else as far as I know, but that is not the fault of this article. There have previously been suggestions for a separate article for a Bayes versus classical comparison on an equal basis, and that would been far better than adding confusion to an article that tries to explain what the classical approach is. Melcombe (talk) 10:30, 9 March 2011 (UTC)

Alas, advertising or rabble-rousing for a Wikipedia edit proposal. Call it a summary. :-)

Organization Before: ...

Potential misuse
Criticism
- Significance and practical importance
- Meta-criticism
- Philosophical criticism (recently deleted)
- Pedagogic criticism
- Practical criticism
- Straw man
- Bayesian criticism
Publication bias

...

Organization After: ...

Radioactive suitcase example revisited
Types of Hypothesis Tests
Academic Status
Usage
Weakness
Controversy
- Selected Criticisms
- Misuses and abuses
- Results of the controversy
- Alternatives to significance testing
- Future of the controversy

...

All of the listed original sections contained criticism. The volume of criticism was a substantial fraction of the article length. These sections were filled with quotations chosen for emotional impact rather than for factual content. Wikipedia embedded editorial comments. Earlier discussion characterized these sections as containing expressions of distaste or as an embarrassment.

The replacement sections are more focused on the topic of the article. Several sections illuminate hypothesis testing by comparison with statistical decision theory. There is mention of historical terminology (significance testing) which should make it easier to understand the titles of some of the references. The criticism is tersely summarized in a list format with a reference to a lengthy (60 page) discussion of the issues. Six other sources of criticism are cited - 4 books and 2 journal articles.

My goals:

Focus on hypothesis testing.
Reduce the volume and the emotional impact of the criticism sections.
Summarize the criticism and provide references to it.

I managed to summarize Bayesian criticism in totally non-mathematical terms which answers one editorial comment requesting clarification.

The result is shorter than the original.

The new content generally has adequate citations to justify the claim of factual content from existing sources.

Several links to the criticism section are broken by the edit.

There is more reference to things Bayesian than desirable in an article on frequentist statistics. This is very bad since I regard the opening disambiguation as flawed. Most veterans of Statistics 101 haven't heard of Bayes Theorem, Conditional Probability or Frequentist vs. Bayesian Statistics. They will not understand the first sentence. I do not understand the nuances. I took Probability rather than Statistics and avoid Philosophy. —Preceding unsigned comment added by 159.83.196.133 (talk) 17:32, 12 March 2011 (UTC)

Melcombe, I found a reference that explains the differences between frequentist and Bayesian approaches more clearly. I will clean up the two affected sections appropriately. —Preceding unsigned comment added by 159.83.196.1 (talk) 23:33, 15 March 2011 (UTC)

I am certainly not an aribiter here, and it wasn't me who reverted your(?) attempt at a major revision in one go. I guess I was responsible for moving to this article most of the similar supposed-criticism text from other articles and, although I did do some changes to merge the stuff together, what is left certainly merits a major revamp. And a reduction in size if it is to be left in this article. There needs to be care in choosing references that claim to understand the so-called "frequentist" approach, as the "frequency probability" interpretation of probability is not needed to justify the classical approach to hypothesis testing. I do suggest that you get a wikipedia id, rather than using an IP address, as such contributions are taken more seriously, and that you learn to sign your contributions on these Talk pages. Melcombe (talk) 09:56, 16 March 2011 (UTC)

One-sample chi-squared test

I think this row in the table is messed up: it appears to be a standard test on variance but then the Assumptions or notes column looks like the assumptions for a test on the distribution of a categorical variable. 130.127.186.244 (talk) 12:32, 8 November 2011 (UTC)

Well spotted. I've deleted that row completely for now (from the table in the Statistical hypothesis testing#Common test statistics section). The formula was for the Chi-squared test#Chi-squared test for variance in a normal population which isn't commonly used, in my experience at least. I'm tempted to remove "Two-proportion z-test, unpooled for | d0 | > 0" as well for the same reason. Comments / objections? Qwfp (talk) 17:31, 8 November 2011 (UTC)

Issues from 2012

Duplicate references

References 20 & 25 appear to be duplicates. Is one citation better than the other? 159.83.196.1 (talk) 23:41, 4 February 2012 (UTC)

Now fixed. Melcombe (talk) 01:45, 5 February 2012 (UTC)

Sundry gripes - maybe opportunities for improvement

On January 30, 2012 the opinions of this article were:

Trustworthy: 3.9/5; Objective: 4.3/5; Complete: 4.0/5; Well-Written: 2.6/5.

In summary, the article has a lot of merit, but is poorly written?! The article suffers from several deficiencies:

Reader expectations are inconsistent.
- The Examples are suitable for the novice.
- The Definition of terms and The testing process are taken from texts for PhD candidates.
- The Controversy section is most suitable for graduate school seminars, not Stat 101.
The sections are unbalanced.
- The Interpretation and Importance sections are weak.
- The Definition of terms and Controversy sections are too long.
The sections are not integrated.
- The examples and test statistics are not further discussed in the text.
- The defined terminology is little used.
- The testing process uses regions, while p-values are used elsewhere
The article does not have the proper supporting figures.
- Examples & Definition of terms would benefit.

159.83.196.1 (talk) 22:06, 7 February 2012 (UTC)

I am inclined to delete the following definitions: Similar test, Most powerful test, Uniformly most powerful test (UMP), Consistent test, Unbiased test, Conservative test, Uniformly most powerful unbiased (UMPU). They are Stat 801 rather than Stat 101 material. Only one of the definitions leads to another article. Only one has a reference. I cannot utilize the definitions in other sections, provide examples, etc. This article is better with less. Comments?159.83.196.1 (talk) 01:52, 16 March 2012 (UTC)

I agree with deleting those with the exception of "Conservative test", as I seem to come across that concept reasonably often in my experience as an applied statistician, while the others seem of more theoretical interest. Qwfp (talk) 11:38, 16 March 2012 (UTC)

You should keep "most powerful" and "uniformly most powerful" tests, which appear in calculus-based statistics texts, such as Hogg & Ellis, or the Florida book (Schaeffer, Mendenhall, and a 3rd), along with the Neyman--Pearson lemma. Kiefer.Wolfowitz 21:42, 17 March 2012 (UTC)

Done.159.83.196.1 (talk) 22:43, 24 March 2012 (UTC)

Asterisks

What do the asterisks sprinkled throughout the Name column of the Common test statistics table mean? Examples: many, but not all, of the two-sample tests.159.83.196.1 (talk) 00:36, 15 February 2012 (UTC)

These originate with this version, http://en.wikipedia.org/w/index.php?title=Statistical_hypothesis_testing&oldid=293990353 (June 2009), of the article. Neither that or immediately following edits gave a meaning for * anywhere close to the table, so far as I can see. Melcombe (talk) 03:07, 15 February 2012 (UTC)

Deleted.159.83.196.1 (talk) 21:54, 21 February 2012 (UTC)

Clairvoyant example

Clearly, the clairvoyant example is badly formulated: the null hypothesis should be that P=0.25 and not that P<0.25. Indeed, a common argument in ESP studies is that getting all of the answers consistently wrong (far outside of chance) is just as sure a sign of clairvoyance as getting them all right. I don't know how to fix this example myself, though.linas (talk) 19:03, 11 April 2012 (UTC)

The example specifically addresses your issue, "But what if the subject did not guess any cards at all? ... While the subject can't guess the cards correctly, dismissing H0 in favour of H1 would be an error. In fact, the result would suggest a trait on the subject's part of avoiding calling the correct card. A test of this could be formulated: for a selected 1% error rate the subject would have to answer correctly at least twice, for us to believe that card calling is based purely on guessing." You clearly disagree with the editor of the example.159.83.196.1 (talk) 21:30, 17 April 2012 (UTC)

All wrong

I have commented the part in the Clairvoyant example about the situation where all cards are identified wrong. What goal does it serve adding this? Nijdam (talk) 12:36, 29 April 2012 (UTC)

See section "Clairvoyant example" almost immediately above on this talk page.Melcombe (talk) 23:21, 30 April 2012 (UTC)

There would be some advantage to providing both one-tailed and two-tailed solutions to answer the question raised above. All current examples are of one-tailed tests and the sole remaining statement implies that the use of one-tailed tests is misleading.159.83.196.1 (talk) 22:31, 1 May 2012 (UTC)

The cumulative probability for c=12: 0.0107 (by spreadsheet). Requirement: less than 1%. By the definition provided, c=13. Region of acceptance: 0-12. Region of rejection: 13-25. Resulting significance level: 0.0034. The cumulative probability for c=12 was confirmed by: http://stattrek.com/online-calculator/binomial.aspx 159.83.196.1 (talk) 19:24, 5 May 2012 (UTC)

Split Controversy?

The controversy section has grown again. While the main topic remains of tolerable length, the controversy section probably belongs in a separate article.

Hypothesis testing and its controversies are typically covered in different texts intended for different audiences. Hypothesis testing is not controversial in all fields, but has been intensely controversial in some.

It has proven impossible to limit the size of the section.159.83.196.1 (talk) 20:57, 7 December 2012 (UTC)

On the one hand, the controversy is an entire field of study unto itself and deserves it's own page. On the other, an entire field of study exists criticizing NHST (not necessarily hypothesis testings per se), and this is important information that should be emphasized to any new statistics student who may only visit the main page and fail to follow the controversy link.

NHST continues to be conveyed in textbooks as an universally accepted (and objective) method on which all experiments should be based, and it is of upmost importance that people coming to this page are made aware that this is not true. The textbooks are not giving a balanced view, while wikipedia has the opportunity to do so, which may be most effective by not creating a separate page. Personally I think it should have it's own page.207.229.179.97 (talk) 08:11, 8 December 2012 (UTC)