Misuse of p-values
The p-value fallacy is the binary classification of experimental results as true or false based on whether or not they are statistically significant. It derives from the assumption that a p-value can be used to summarize an experiment's results, rather than being a heuristic that is not always useful.[1][2]
Dividing data into significant and nonsignificant effects can be highly misleading, and is generally inferior to the use of Bayes factors (also called likelihood ratios).[1][2] For instance, analysis of nearly identical datasets can result in p-values that differ greatly in significance.[1] In medical research, p-values were a considerable improvement over previous approaches, but misunderstandings of p-values have become more important for reasons such as the increased statistical complexity of published research.[2] It has been suggested that in fields such as psychology, where studies typically have low statistical power, using significance testing can lead to increased error rates.[1][3]
In the p-value fallacy, a single number is used to represent both the false positive rate under the null hypothesis H0 and also the strength of the evidence against H0. However, there is a trade-off between these factors, and it is not logically possible to do both at once.[2] Neyman and Pearson described the trade-off as between being able to control error rates over the long term and being able to evaluate conclusions of specific experiments in the short term, but a common misinterpretation of p-values is that the trade-off can be avoided.[2] Another way to view the error is that studies in medical research are often designed using a Neyman-Pearson statistical approach but analyzed with a Fisherian approach.[4] However, this is not a contradiction between frequentist and Bayesian reasoning, but a basic property of p-values that applies in both cases.[5]
This fallacy is contrary to the intent of the statisticians who originally supported the use of p-values in research.[2][6] As described by Sterne and Smith, "An arbitrary division of results, into 'significant' or 'non-significant' according to the P value, was not the intention of the founders of statistical inference."[6] In contrast, common interpretations of p-values discourage the ability to distinguish statistical results from scientific conclusions, and discourage the consideration of background knowledge such as previous experimental results.[2] The correct use of p-values is to guide behavior, not to classify results;[1] that is, to inform a researcher's choice of which hypothesis to accept, not provide an inference about which hypothesis is true.[2]
The use of significance testing as the basis for decisions has also been criticized as a whole, because of the p-value fallacy and other widespread misunderstandings about the process.[2][6][7] For example, p-values do not address the probability of the null hypothesis being true or false, which can only be done with the Bayes factor,[8] and the choice of significance threshold should not be arbitrary but instead informed by the consequences of a false positive.[1] It is possible to use Bayes factors for calibration, which allows the use of p-values while reducing the impact of the p-value fallacy, although these approaches introduce other biases as well.[5]
See also
References
- ^ a b c d e f Dixon P (2003). "The p-value fallacy and how to avoid it". Can J Exp Psychol. 57 (3): 189–202. PMID 14596477.
- ^ a b c d e f g h i Goodman SN (1999). "Toward evidence-based medical statistics. 1: The P value fallacy". Ann Intern Med. 130 (12): 995–1004. PMID 10383371.
- ^ Hunter, John E. (1997). "Needed: A Ban on the Significance Test". Psychological Science. 8 (1). American Statistical Association: 3–7. doi:10.1111/j.1467-9280.1997.tb00534.x. Retrieved 20 February 2016.
- ^ de Moraes AC, Cassenote AJ, Moreno LA, Carvalho HB (2014). "Potential biases in the classification, analysis and interpretations in cross-sectional study: commentaries - surrounding the article "resting heart rate: its correlations and potential for screening metabolic dysfunctions in adolescents"". BMC Pediatr. 14: 117. doi:10.1186/1471-2431-14-117. PMC 4012522. PMID 24885992.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) CS1 maint: unflagged free DOI (link) - ^ a b Sellke, Thomas; Bayarri, M.J.; Berger, James O. (2001). "Calibration of p values for testing precise null hypotheses". The American Statistician. 55 (1). American Statistical Association: 62–71. doi:10.1198/000313001300339950. Retrieved 20 February 2016.
- ^ a b c Sterne, J. A. C.; Smith, G. Davey (2001). "Sifting the evidence–what's wrong with significance tests?". BMJ (Clinical research ed.). 322 (7280): 226–231. doi:10.1136/bmj.322.7280.226. PMC 1119478. PMID 11159626.
- ^ Schervish, M. J. (1996). "P Values: What They Are and What They Are Not". The American Statistician. 50 (3): 203. doi:10.2307/2684655. JSTOR 2684655.
- ^ Emmanuel Lesaffre; Andrew B. Lawson (18 June 2012). "P-value, Bayes factor, and posterior probability". Bayesian Biostatistics. John Wiley & Sons. ISBN 978-1-118-31457-9.