This is the
talk page for discussing improvements to the
Misuse of p-values article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find medical sources: Source guidelines · PubMed · Cochrane · DOAJ · Gale · OpenMD · ScienceDirect · Springer · Trip · Wiley · TWL |
This article is rated Start-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||||||
|
An excellent start! The article could use some section headings and reorganization to improve the flow however. Perhaps the following outline would work:
Cheers. Boghog ( talk) 07:56, 21 February 2016 (UTC)
The jargon and wiki-linked phrases that many will need to follow may make this a little abstract for the lay reader. I don't know if there's any place for it, but I've found a good teaching aid in this article:
It does point up the pitfalls of small sample size and the inevitability of false positives when you look at enough measurements. Even if it's not used, it's worth a read. -- RexxS ( talk) 00:51, 22 February 2016 (UTC)
I'm having a hard time understanding what this article is trying to communicate. Let's start at the beginning. The first sentence is, "The p-value fallacy is the binary classification of experimental results as true or false based on whether or not they are statistically significant." This is excellent. I understand what it's saying and agree with it entirely. But moving ahead to the second paragraph, we are told that "Dividing data into significant and nonsignificant effects ... is generally inferior to the use of Bayes factors". Is it? From a Bayesian perspective, certainly. But a Bayesian would tell you that you shouldn't be doing hypothesis testing in the first place. Whereas a frequentist would tell you that a Bayes factor doesn't answer the question you set out to ask and that you need a p-value to do that.
In a pure Bayesian approach, you begin with a prior probability distribution (ideally one which is either weakly informative or made from good expert judgment) and use your experimental results to create a posterior distribution. The experimenter draws conclusions from the posterior, but the manner in which he or she draws conclusions is unspecified. Bayesian statistics does not do hypothesis testing, so it cannot reject a null hypothesis, ever. It does not even have a null hypothesis. At most you might produce confidence intervals, but these confidence intervals are not guaranteed to have good frequentist coverage properties; a 95% Bayesian confidence region says only that, under the prior distribution and given the observed data, there is a 95% chance that the true value of the parameter lies in the given region. It says nothing about the false positive rate under the null hypothesis because in Bayesian statistics there is no such thing as a null hypothesis or a false positive rate.
Let's move on to the "Misinterpretation" section. It begins, "In the p-value fallacy, a single number is used to represent both the false positive rate under the null hypothesis H0 and also the strength of the evidence against H0." I'm not sure what the latter half of this sentence means. The p-value is, by definition, the probability, under the null hypothesis, of observing a result at least as extreme as the test statistic, that is, the probability of a false positive. That's one part of what the article defines as the p-value fallacy. But what about "the strength of the evidence against H0"? What's that? Perhaps it's intended to be a Bayes factor; if so, then you need a prior probability distribution; but hypothesis testing can easily be carried out without any prior distribution whatsoever. Given that I don't understand the first sentence, I guess it's no surprise that I don't understand the rest of the paragraph, either. What trade-off is being discussed, exactly?
The paragraph concludes that something "is not a contradiction between frequentist and Bayesian reasoning, but a basic property of p-values that applies in both cases." This is not true. There is no such thing as a Bayesian p-value. The next paragraph says, "The correct use of p-values is to guide behavior, not to classify results; that is, to inform a researcher's choice of which hypothesis to accept, not provide an inference about which hypothesis is true." Again, this is not true. A p-value is simply a definition made in the theory of hypothesis testing. There are conventions about null hypotheses and statistical significance, but the article makes a judgmental claim about the rightness of certain uses of p-values which is not supported by their statistical meaning. Moreover, p-values are used in frequentist statistical inference.
The last paragraph claims, "p-values do not address the probability of the null hypothesis being true or false, which can only be done with the Bayes factor". The first part of the sentence is correct. Yes, p-values do not address the probability of the null hypothesis being true or false. But the latter part is not. Bayes factors do not tell you whether the null hypothesis is true or false either. They tell you about odds. Your experiment can still defy the odds, and if you run enough experiments, eventually you will defy the odds. This is true regardless of how you analyze your data; the only solution is to collect (carefully) more data.
It sounds to me like some of the references in the article are opposed to hypothesis testing. (I have not attempted to look at them, though.) I am generally skeptical of most attempts to overthrow hypothesis testing, since it is mathematically sound even if it's frequently misused and misunderstood. I think it would be much more effective for everyone to agree that 95% confidence is too low and that a minimum of 99.5% is necessary for statistical significance. Ozob ( talk) 03:59, 22 February 2016 (UTC)
I am not an expert in statistics, but I am familiar with parts of it, and I can't understand this article. Introductory textbooks teach that a p-value expresses the probability of seeing the sample, or a sample more "extreme" than it, given the null hypothesis. Is the p-value fallacy a particular misinterpretation of this conditional probability, such as the probability of the null hypothesis given the data? Try to explain where exactly the fallacious reasoning becomes fallacious. Mgnbar ( talk) 04:20, 22 February 2016 (UTC)
"...the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result."and
"the question is whether we can use a single number, a probability, to represent both the strength of the evidence against the null hypothesis and the frequency of false-positive error under the null hypothesis...it is not logically possible."The trade-off is that you can't have both of these at the same time.
"we are not discussing a conflict between frequentist and Bayesian reasoning, but are exhibiting a fundamental property of p values that is apparent from any perspective."I think my wording may have been unclear since it could be interpreted as referring to the preceding sentence and not the fallacy in general, or perhaps I shouldn't have converted "any perspective" to "these specific perspectives" (even though the latter is logically contained in the former). Part of the point of including this statement was to reinforce the point that the fallacy is not about frequentism vs Bayesianism. It's definitely not intended to claim that there are Bayesian p-values, and proposed alternative wordings would be appreciated. :-)
"in some cases the classical P-value comes close to bringing such a message,"which I left out since it didn't seem directly relevant, but I'd be fine with adding that.
Continuing on from the above, I've had another read through of this article and, the more I read, the more flawed it appears to me. It appears inconsistent as to precisely what the "p-value fallacy" is and is, more broadly, simply a Bayesian critique of p-values. There is useful material here on challenges with p-values, common mistakes, and that Bayesian critique, but all that would be better integrated into the statistical significance article, not separated off here ( WP:CFORK). I question notability for an article under the present name, and propose a merger into statistical significance. Bondegezou ( talk) 15:09, 22 February 2016 (UTC)
"analysis of nearly identical datasets can result in p-values that differ greatly in significance"- he presents two datasets, one which has a significant main effect and nonsignificant interaction effect, while the other has a significant interaction effect but a nonsignificant main effect.
Headbomb has added a section on the odds of a false positive statistically significant result. Plenty of good material there, but this is a separate issue to the purported p-value fallacy that this article is about. Why not move that section to p-value or Type I and type II errors? This looks like classic content forkery to me. This article should not be a dumping ground for all flaws with or misunderstandings of the p-value. Bondegezou ( talk) 16:39, 22 February 2016 (UTC)
I note there are 4 citations given in this section. I have looked through them all (7, 8, 9, 10): none of them use the term "p-value fallacy". Indeed, I note that other citations given in the article (2, 6, 11) do not use the term "p-value fallacy". If no citations in this section make any reference to "p-value fallacy", then I suggest again that this section is not about the p-value fallacy and constitutes WP:OR in this context. There is good material here that I am happy to see moved elsewhere, but this is not relevant here. Would others than me and Headbomb care to chime in? Can anyone firmly link this section to the p-value fallacy? Bondegezou ( talk) 10:27, 23 February 2016 (UTC)
I think it would be possible to explain the fallacy better. Here is my attempt: When an experiment is conducted, the question of interest is whether a certain hypothesis is true. However, p values don't actually answer that question. A p value actually measures the probability of the data, assuming that the hypothesis is correct. The fallacy consists in getting this backward, by believing that the p value measures the probability of the hypothesis, given the data. Or to put it a bit differently, the fallacy consists in believing that a p value answers the question one is interested in, rather than answering a related but different question. Looie496 ( talk) 18:22, 22 February 2016 (UTC)
Why don't you guys read the references? Sellke et al make it quite clear in their paper. They indicate three misinterpretations of the p-value, the first of which corresponds to the p-value fallacy sensu stricto as formulated by Goodman in 1999. The second one is what Looie496 explained above. All three misinterpretations are interrelated to some extent and merit being explained in the article, although the first one should be the one more extensively described per most of the literature.
"the focus of this article is on what could be termed the "p value fallacy," by which we mean the misinterpretation of a p value as either a direct frequentist error rate, the probability that the hypothesis is true in light of the data, or a measure of odds of H0 to H1. [The term "p value fallacy" was used, in the first of these senses, in the excellent articles Goodman (1999a,b).]" (Sellke et al., 2001)
Given the large number of publications on the topic and the fact that it has a consistently used name, I find it a terrible idea to merge into the p-value article. Neodop ( talk) 01:42, 23 February 2016 (UTC)
So let me take a stab at explaining this, based on the Colquhoun paper.
I'm not confident that this explanation is correct or even sensible. I'm just trying to convey the kind of mathematical detail that would help me understand how this particular misinterpretation of probabilities differs from another. Also, Wikipedia's audience is diverse. The article should offer an explanation in plain English. But it should also offer an explanation in mathematical notation. Mgnbar ( talk) 16:10, 23 February 2016 (UTC)
"we could not both control long-term error rates and judge whether conclusions from individual experiments were true."So it's just a more complicated description that incorporates a rationale for why the reasoning fails. The only case where I see alternative definitions is Sellke, and in that case the options still seem to refer to the same essential error, which is the use of p-values as direct evidence about the truth of the hypothesis. As noted above, that paper was also published soon after the term was coined, so it's also possible the other versions just didn't catch on. (If I'm misunderstanding this then we could provide multiple definitions, or merge it somewhere, though I think it would be WP:UNDUE to have all of this at the main p-value article.) Sunrise ( talk) 08:27, 25 February 2016 (UTC)
The defense of Bayesian analysis should not be as prominent in the article, and definitely not included in the lead (it is highly misleading, no pun intended :)). Most of the authors exposing the misuse of p-values do defend Fisherian hypothesis testing and are highly critical of "subjective" Bayesian factors, etc. particularly Colquhoun. Neodop ( talk) 14:57, 23 February 2016 (UTC)
I just put some {{ dubious}} tags on the false discovery rate section. My objection is as follows: Suppose that we run many hypothesis tests; we get some number of rejections of the null hypothesis ("discoveries"). The FDR is defined as the expected proportion of those discoveries which are erroneous, i.e., false positives. There are statistical procedures that effectively control the FDR (great!). But if we get 100 rejections using a procedure that produces an FDR of 5%, that means that we expect 5 false positives, not that we have 5 false positives. We might have 1 false positive. Or 10. Because of this, the probability that a given rejection is a false positive is not 5%, even though the FDR is 5%! All we know is that if we were to repeat our experiment over and over, we would expect 5 false positives on average, i.e., the probability of a discovery being a false positive is 5% on average (but perhaps not in the particular experiment we ran). The article does not seem to make this distinction. Ozob ( talk) 16:29, 23 February 2016 (UTC)
I have a rudimentary grasp of statistics - this article gets into the weeds way too fast. Can the folks working on this please be sure that the first section describes this more plainly, per WP:TECHNICAL? Thanks. Jytdog ( talk) 18:45, 23 February 2016 (UTC)
I'm stimulated to write since someone on twitter said that after reading this article he was more confused than ever.
I have a couple of suggestions which might make it clearer.
(1) There needs to be a clearer distinction between false discoveries that result from multiple comparisons and (2) false discoveries in single tests. I may be partially responsible for this confusion because I used the term false discovery rate for the latter, though that term had already been used for the former. I now prefer the term "false positive rate" when referring the interpretation of a single test. Thus, I claim that if you observe P = 0.047 in a single test. and claim that you've made a discovery, there is at least a 30% chance that you're wrong i.e. it's a false positive. See http://rsos.royalsocietypublishing.org/content/1/3/140216
(2) It would be useful to include a discussion of the extent to which the argument used to reach that conclusion is Bayesian, in any contentious sense of that term. I maintain that my conclusion can be reached without the need to invoke subjective probabilities. The only assumption that's needed is that it's not legitimate to assume any prior probability greater than 0.5 (to do so would be tantamount to claiming you'd made a discovery and that your evidence was based on the assumption that you are probably right). Of course if a lower prior probability were appropriate than the false positive rate would be much higher than 30%.
(3) I don't like the title of the page at all. "The P value fallacy" has no defined meaning -there are lots of fallacies. There is nothing fallacious about a P value. It does what's claimed for it. The problem arises because what the P value does is not what experimenters want to know, namely the false positive rate (though only too often these are confused).
David Colquhoun ( talk) 19:49, 23 February 2016 (UTC)
I notice that the entry on False Positive Rate discusses only the multiple comparison problem, so this also needs enlargement. David Colquhoun ( talk) 11:34, 24 February 2016 (UTC)
From an average non-sciences reader/editor who was recently and not out of personal interest self-introduced to the p-value concept via Bonferroni correction, during a content discussion/debate, I find this explanation clear and straightforward (so hopefully it is accurate as well): the Skeptic's Dictionary: From Abracadabra to Zombies: p-value fallacy.
Also, this article's title seems practical, as p-value fallacy appears to be common enough - as in, for example, randomly, The fallacy underlying the widespread use of the P value as a tool for inference in critical appraisal is well known, still little is done in the education system to correct it
- therefore, it is the search term I'd hope to find an explanation filed under, as a broad and notable enough subject that merits standalone coverage. --
Tsavage (
talk) 21:17, 23 February 2016 (UTC)
Maybe I'm missing something, but there does not yet seem to be consensus that there is a single, identifiable concept called "the p-value fallacy". For example:
Because there are multiple fallacies around p-values, Google searches for "p-value fallacy" might generate many hits, even if there is no single, identifiable fallacy.
So is this article supposed to be about all fallacies around p-values (like p-value#Misunderstandings)? If not, then is this particular p-value fallacy especially notable? Does it deserve more treatment in Wikipedia than other fallacies about the p-value? Or is the long-term plan to have a detailed article about each fallacy? Mgnbar ( talk) 15:25, 3 March 2016 (UTC)
Based on the discussion above, I'd like to propose turning the present article into a redirect to p-value#Misunderstandings. There is no one "p-value fallacy" in the literature, so the present article title will always be inappropriate. Moreover, p-value#Misunderstandings is quite short, especially given the breadth of its content. The content currently in this article would be better placed there where it's easier to find and will receive more attention. Ozob ( talk) 00:12, 4 March 2016 (UTC)
Comment: As I understand it, largely from reading in and around this discussion, what is usually considered the p-value fallacy is, in simple English, misusing p-value testing as way to determine whether any one experimental result actually proves something.
There are a number of different mistaken beliefs about p-values that can each lead to misuse that results in the same fallacious application, i.e. to establish proof. So there appears to be one distinct overall p-value fallacy that is referred to in multiple sources, regardless of any other p-value misuses that may fall entirely outside of this definition.
This would mean that we have a notable topic (per WP:GNG), and perhaps some clarification to do within that topic. Have I oversimplified or plain got it wrong? (I'm participating as a guinea pig non-technical reader/editor - this article should be clear to me! :)-- Tsavage ( talk) 20:34, 4 March 2016 (UTC)
The main sources used for the substantive content of this article are the papers by Goodman and Dixon. They coined and promulgated the term "p-value fallacy", so they can be considered primary sources. Least, that's how something like WP:MEDMOS would describe them. Bondegezou ( talk) 08:21, 6 March 2016 (UTC)
Sources for future reference:
Manul ~ talk 20:02, 8 March 2016 (UTC)
Quoting the ASA's principles may help give the Wikipedia article some focus:
Manul ~ talk 20:11, 8 March 2016 (UTC)
My assumption was that this article would eventually become the expansion of p-value#Misunderstandings, with that section of the p-value article being a summary of this one. I haven't yet delved into the recent discussions, but from an outsider's perspective I can say that that's what readers expect. It would seem super confusing for this article to cover some (or one) misunderstanding but not others. Manul ~ talk 12:45, 10 March 2016 (UTC)
I've moved the article, since it seems like we have enough support for it. If the fork becomes unnecessary, we can just merge this into the main article. Since Bondegezou still prefers that option, I left the merge tags open and pointed it to this page instead. In the meantime, help with building additional content is appreciated! Sunrise ( talk) 00:56, 11 March 2016 (UTC)
Okay, I've now made some changes aimed at the different scope, and imported a lot of content from other articles. Explanations for specific edits are in the edit summaries. I used the ASA statement a couple of times, but there's a lot more information that could be added. One important thing is to finish dividing everything into appropriate sections, especially the list of misunderstandings from p-value#Misunderstandings, which is currently in the lead. That will probably need careful reading to figure out which parts are supported by which sources. Once that's done it will be a lot easier to work on expanding the individual sections. Sunrise ( talk) 01:56, 11 March 2016 (UTC)
I finally downloaded the Goodman (1999) paper that apparently coins the term "p-value fallacy" in its narrow sense. It's very different from the statistics that I usually read, so I'm hoping for some clarification here. For starters, Goodman criticizes another source for including this passage:
My interpretation of this passage, which is admittedly out of context, is:
So my questions are:
This paper seems very useful here. Bondegezou ( talk) 09:40, 27 May 2016 (UTC)
The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. It is not connected to either. - The first sentence is unequivocally accurate. It essentially states that P(A|B) is not generally the same as P(B|A), where A is the event that the null is true, B is the observed data, and the p-value is P(B|A). However, the second sentence seems unnecessary and overly strong in saying the p-value, P(B|A), is "not connected" to the posterior probability of the null given the data, P(A|B). In fact, the two probabilities are, at least in some sense, "connected" by Bayes rule: P(A|B)=P(B|A)P(A)/P(B)
The p-value is not the probability that a finding is "merely a fluke." - I couldn't find the word "fluke" in any of the 3 sources cited for the section, so it is not clear (1) that this misunderstanding is indeed "common," and (2) what the word "fluke" means exactly in this context. If "merely a fluke" means that the null is true (i.e. the observed effect is spurious), then there seems to be no distinction between this allegedly common misunderstanding and the previous misunderstanding. That is, both misunderstandings are the confusion of P(A|B) with P(B|A), where A is the event that the null is true, B is the observed data, and the p-value is P(B|A).
The p-value is not the probability of falsely rejecting the null hypothesis. That error is a version of the so-called prosecutor's fallacy. - Here again, it is not clear exactly what this means, where in the cited sources this allegedly common misunderstanding comes from, and whether or not this "prosecutor's fallacy" is a distinct misunderstanding from the first one. The wiki article on prosecutor's fallacy suggests that there is no distinction--i.e. both misunderstandings confuse P(A|B) with P(B|A), where A is the event that the null is true, B is the observed data, and the p-value is P(B|A).
The significance level, such as 0.05, is not determined by the p-value. - Here again, is this really a common misunderstanding? Where is this allegedly common misunderstanding listed in the cited sources?
It should also be noted that the next section ("Representing probabilities of hypotheses") AGAIN seems to restate the first "common misunderstanding." The section also contains the following weirdly vague statement: "it does not apply to the hypothesis" (referring to the p-value). What is "the hypothesis?" — Preceding unsigned comment added by 50.185.206.130 ( talk) 08:46, 6 July 2016 (UTC)
Regarding your response to the first point, sure, the null is either true or false. But if someone doesn't know whether it's true or false, I don't see a problem with that person speaking in terms of probabilities based on the limited information they have access to. By analogy, if you don't know what card I've randomly drawn from a deck, you could speak of the probability of it being a red suit or a face card or the Ace of Spades, even though from an omniscient perspective there is no actual uncertainty--the card simply is what it is. I'm aware that there are different philosophical perspectives on this issue, but they are just that--perspectives. And if you're uncomfortable using the term "probability" for events in a frequentist framework, you can simply substitute "long-term frequency." In any case, I don't see how including the vague and potentially controversial statement that "it is not connected to either" is at all necessary or adds anything useful to the article section; the immediately preceding sentence is sufficient and straightforward.
Your response to the second point isn't clear to me. So "a fluke" means "unlucky?" And what is the "finding" in the phrase "finding is merely a fluke?" The data? So there is a common misunderstanding that the p-value is the probability of the data being unlucky? It's hard to see how that is even a coherent concept. Perhaps the misunderstanding just needs to be explained more clearly and with different vocabulary. Indeed, as I noted previously, the word "fluke" does not appear in any of the cited sources.
You didn't really respond to the third point, except to say that your response to the second point should apply. It seems we agree that the first misunderstanding is p = P(A|B) (even though you've noted that it's debatable whether P(A|B) is coherent in a frequentist framework). Isn't the "prosecutor's fallacy" also p = P(A|B)? In fact, the wiki article on prosecutor's fallacy appears to describe it precisely that way (except using I and E instead of A and B). Maybe part of the problem is the seemingly contradictory way the alleged misunderstanding is phrased: first it's described as thinking the p-value is the probability of falsely rejecting the null hypothesis (which appears to mean confusing the p-value with the alpha level), and then it's described as "a version of prosecutor's fallacy" (which appears to be something else entirely).
Your response to the fourth point seems to be POV. The functional difference between p=.04 and p=.035 may be relatively trivial in most cases, but p=.0000001 need not be treated as equivalent to p=.049999999 just because both are below some arbitrarily selected alpha level. Here again, there may be different perspectives on the issue, but we are supposedly talking about definitive misunderstandings, not potential controversies.
You didn't respond to my last point, regarding the "Representing probabilities of hypotheses" section. — Preceding unsigned comment added by 23.242.207.48 ( talk) 18:30, 10 July 2016 (UTC)
Your entire response to the first point is a red herring. The American Statistical Association's official statement on p-values (which is cited in this article; http://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108) notes that p-values can be used for "providing evidence against the null hypothesis"--directly contradicting the claim that p-values are "not connected" to the probability of the null hypothesis. If you insist that someone has to be called "Bayesian" to make that connection, fine--it is a connection nonetheless (and it is the connection that p-values' usefulness depends on). Furthermore, none of your response substantively speaks to the issue at hand: whether the statement "it is not connected to either" should be included in the article. Even if we accept your view that P(A|B) is meaningless, the disputed statement in the article does not communicate that premise. The article does not say, "The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. Those probabilities are not conceptually valid in the frequentist framework." Instead, the article says, "The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. It is not connected to either." Thus, even if we accept your premise, the statement is not helpful and should be removed. In fact, saying P(A|B) is "not connected" to P(B|A) might be taken to imply that the two probabilities orthogonally coexist--which would directly contradict your view. Given that there is no apparent reason for you to be attached to the disputed sentence even if all your premises are granted, I hope you will not object that I have removed it.
Regarding the second point, you defined "fluke" as "unlucky." I responded that "the probability that the finding was unlucky" (1) is an unclear concept and (2) does not obviously relate to any passages in the cited sources (neither "fluke" nor "unlucky" appear therein). Hence, with regard to your ad hominem, I do understand English--that doesn't make all combinations of English words intelligible or sensible. I repeat my suggestion that if there is an important point to be made, better vocabulary should be used to make it. Perhaps the business about "flukes" comes from the ASA's statement that "P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone" ( http://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108). Note that the statement combines the fallacy regarding the null being true and the fallacy regarding the data being produced by random chance alone into a single point. Why not use a similar approach, and similar language, in the wiki article? I hope you will not object that I have made such an adjustment.
Regarding the third point (regarding "prosecutor's fallacy"), you don't have a response adequately demonstrating that (1) the proposed misunderstanding is consistent with how prosecutor's fallacy is defined (note that the article equates prosecutor's fallacy with thinking the p-value is the probability of false rejection), (2) the proposed misunderstanding is non-redundant (i.e. prosecutors's fallacy should be distinct from the first misunderstanding), and (3) the proposed misunderstanding is listed in the cited sources (note that "prosecutor's fallacy" is not contained therein). In fact, your description of prosecutor's fallacy is EXACTLY misunderstanding #1--whether you're a "quasi-Bayesian" or a "Bayesian," the fallacy is exactly the same: P(A|B)=P(B|A). What framework is used to derive or refute that fallacy doesn't change the fallacy itself.
Regarding the fourth issue, if the point is that the alpha level must be designated a priori rather than as convenient for the obtained p-value, then we are certainly in agreement. I have not removed the item. But if this point is indeed commonly misunderstood, how about providing a citation?
Regarding the final issue, I see the point you are making. I hope you will not object that I have slightly adjusted the language to more closely match the ASA's statement that the p-value is "a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself." — Preceding unsigned comment added by 23.242.207.48 ( talk) 14:36, 11 July 2016 (UTC) 23.242.207.48 ( talk) 14:45, 11 July 2016 (UTC)
I am pleased that we are finding common ground. You describe prosecutor's fallacy as "asserting thatp-values measure the probability that the studied hypothesis is true" (direct quote). In the article, misconception #1 is described as thinking the p-value is "the probability that the null hypothesis is true" (direct quote). That makes misconception #1 essentially a word-for-word match for your definition of prosecutor's fallacy. It's hard to see how one could justify saying those are two different misconceptions when they are identically defined. It seems that you are making a distinction between two versions of an objection to the fallacy rather than between two different fallacies; perhaps the reference to prosecutor's fallacy should be moved to misconception #1.
Note also that your definition of prosecutor's fallacy doesn't match the way it's described in the bold text of misconception #3. Indeed, "the probability of falsely rejecting the null hypothesis" (article's words) is certainly not the same thing as "the probability that the null hypothesis is true" (your words). Thus, there is another reason the reference to prosecutor's fallacy does not seem to belong where it appears. — Preceding unsigned comment added by 23.242.207.48 ( talk) 10:22, 12 July 2016 (UTC)
I can't say I'm convinced. I'm also wary that expanding on the ASA's descriptions would violate wikipedia standards on original research, unless there are other reputable sources that explicitly identify two distinct common misunderstandings as you propose.
I am also find misunderstanding #4 a bit peculiar: "The p-value is not the probability that replicating the experiment would yield the same conclusion." Are there really people who think a very low p-value means the results aren't likely to be replicable? It's hard to imagine someone saying, "p =.0001, so we almost certainly won't get significance if we repeat the experiment." I doubt many people think super-low p-values indicate less reliable conclusions. I also couldn't find this misunderstanding listed in any of the cited sources. Which paper and passage did it come from? 23.242.207.48 ( talk) 00:07, 14 July 2016 (UTC)
That rewriting still seems weird to me. So, many people think that a very high p-value (e.g. p=.8) means they will probably get significance if they repeat the experiment? I've never heard that. I'm removing misunderstanding #4 pending a sourced explanation. 23.242.207.48 ( talk) 11:08, 14 July 2016 (UTC)
Based on the discussion so far, it seems like the quality of the list of misunderstandings is doubtful. I feel like we need to clean up the list: Each misunderstanding should come with an inline citation, and the language should be carefully checked to ensure that it is correct and reflects what is in the source. Would anyone like to volunteer? Or propose a different solution? Ozob ( talk) 23:16, 14 July 2016 (UTC)
This is what I have so far:
Proposal
|
---|
The following list addresses several common misconceptions regarding the interpretation of p-values:
References:
|
Improvements would be appreciated. Have I interpreted everything correctly? Did I miss anything? Can we find citations for the unsourced parts? (at least a couple of them should be easy) There's also a comment in Sterne that directly addresses prevalence of a misconception, specifically that the most common one is (quote)"that the P value is the probability that the null hypothesis is true, so that a significant result means that the null hypothesis is very unlikely to be true," but I wasn't sure about how to best include that. Perhaps that (or other parts of the section) could be useful for the main p-value article. Sunrise ( talk) 08:11, 17 July 2016 (UTC)
Reasons:
1. The FDR section appears to refer to a misinterpretation of alpha levels--not a misinterpretation of p-values (note that p0 in the formula is the alpha level, not the p-value). Thus, the section is irrelevant to the article.
2. The statement that FDR increases when the number of tests increases is false. In fact, the FDR can either increase, decrease, or stay the same when the number of tests increases.
3. The given definition of the FDR appears to be incorrect ("the odds of incorrectly rejecting the null hypothesis"). The FDR is conventionally defined is the expected proportion of rejections that are incorrect (and defined as 0 when there are no rejections). — Preceding unsigned comment added by 2601:644:100:74B7:494A:EDB1:8541:281E ( talk) 06:59, 7 July 2016 (UTC)
As I noted in my previous comment, the section does not even correctly define the FDR itself--let alone the "relation between the FDR and the p-value." 23.242.207.48 ( talk) 01:08, 9 July 2016 (UTC)
- That is simply incorrect. The FDR is not, as you have claimed, the probability of obtaining a false positive. The FDR is the expected proportion of significant tests (i.e., "positives") that are false positives. You are perhaps confusing the FDR with the familywise Type I error rate--just as the FDR section of this article does. Look at the formula given in the section--it is the formula for the familywise error rate, NOT FOR THE FDR! I again encourage you to actually read the wiki article on the FDR or, better yet, read the original Benjamini & Hochberg article that introduced the quantity in the first place. 23.242.207.48 ( talk) 03:42, 9 July 2016 (UTC)
As clever as the comic strip may be, it doesn't seem very encyclopedic to spend a paragraph summarizing it in this article. Similarly, it wouldn't make sense to dedicate a paragraph to summarizing the film Jaws in an article about great white sharks (though the film might be briefly mentioned in such an article).
The paragraph is also somewhat confusingly written (e.g. what does "to p > .05" mean?, what does "threshold that the results are due to statistical effects" mean?, and shouldn't "criteria of p > 0.05" be "criteria of p < 0.05?").
Another concern is that the punchline "Only 5% chance of coincidence!" is potentially confusing, because "5% chance of coincidence" is not an accurate framing of p < .05 even when there is only a single comparison.
If the jellybean example is informative enough to merit inclusion, I suggest either rewriting the summary more clearly and concisely (and without verbatim transcriptions such as "5% chance of coincidence"), or simply removing the summary and linking to the comic strip in the further reading section. 23.242.207.48 ( talk) 17:51, 12 July 2016 (UTC)
I've cleaned up the example so it references the comic without doing a frame-by-frame summary and without the confusing language. Perhaps this is a reasonable compromise? I'm still of the mind that the reference to the comic should probably be removed altogether, but it should at the very least be grammatically and scientifically correct in the meantime. 23.242.207.48 ( talk) 10:19, 13 July 2016 (UTC)
I'm inclined to agree with Bondegezou that detailed repetition of information available in other articles is unnecessary. By the same token, this whole article is arguably unnecessary and would be better as a short section in the p-value article than as a sprawling article all to itself, without much unique material (but it seems that issue has been previously discussed and consensus is to keep it). — Preceding unsigned comment added by 23.242.207.48 ( talk) 11:00, 14 July 2016 (UTC)
Please keep it! This makes the article more understandable than just a bunch of math when we can just how ridiculous these situations are when you ignore the implications! I urge editors to keep this example and add more to other sections, because right now it seems to be in danger of becoming like all the other math pages: useless unless you already know the topic or are a mathematician. You talk about null or alternative hypotheses, but never give any example! Who exactly do you can understand this? You think someone who sees a health claim in a nutrition blog that checks a paper with a conclusion that prune juice cures cancer p < 0.05 knows that the null hypothesis means prune juice doesn't cure cancer? Or that an alternative hypothesis is that strawberries cure cancer? EXPLAIN THINGS IN WAYS PEOPLE WHO DON'T HAVE A PHD IN MATH CAN UNDERSTAND!
I am an educator at École Léandre LeGresley in Grande-Anse, NB, Canada and I agree to release my contributions under CC-BY-SA and GDFL. — Preceding unsigned comment added by 2607:FEA8:CC60:1FA:9863:1984:B360:4013 ( talk) 12:30, 21 July 2016 (UTC)
It's not clear what the statement "p-values do not account for the effects of confounding and bias" is supposed to mean. For example, what kind of "bias" is being referenced? Publication bias? Poor randomization? The experimenter's confirmation bias? Even the cited source (an opinion piece in a non-statistical journal) doesn't make this clear, which is probably why the statement in this article's misunderstandings list is the only one not accompanied by an explanation. Furthermore, the cited source doesn't even explicitly suggest that there's a common misunderstanding about the issue. So are people really under the impression that p-values account for "confounding and bias?" Those are general problems in research, not some failing of p-values in particular. I'm removing the statement pending an explanation and a better source. 23.242.207.48 ( talk) 02:07, 23 July 2016 (UTC)
This
edit request to
Misunderstandings of p-values has been answered. Set the |answered= or |ans= parameter to no to reactivate your request. |
References #4 and #5 are identical. Please edit the citation to reference #5 that follows the sentence "The p-value fallacy is a common misinterpretation of the p-value whereby a binary classification of hypotheses as true or false is made, based on whether or not the corresponding p-values are statistically significant." to refer to reference #4 instead. Amoriarty21 ( talk) 23:11, 4 September 2018 (UTC)
This issue was raised years ago, and it appears that the conclusion in this talk page was that "p value fallacy" is not a standard, consistently defined term. Apparently, a single user has been fighting for its conclusion in this article, but it seems to me that is not enough. Certainly giving "p value fallacy" an entire section in the article amounts to undue weight for a term that is hardly ever actually used in science or statistics--making the term's inclusion here misleading regarding what terminology is commonly used. Moreover, as was pointed out in an earlier discussion on this talk page, in the rare cases when the term "p value fallacy" is actually used, it isn't used consistently. Thus, including a section on the "p value fallacy" is not only unnecessary for understanding the topic of the article, but is also potentially confusing. 164.67.15.175 ( talk) 21:12, 24 September 2018 (UTC)
The "top hit on Google scholar" that you're referring to (which is actually an opinion piece) defines the p-value fallacy as "the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result." That is NOT the definition given in this wiki article: "The p-value fallacy is a common misinterpretation of the p-value whereby a binary classification of hypotheses as true or false is made." Thus, the "top hit on Google scholar" actually illustrates the point that "p value fallacy" is an inconsistently defined and potentially confusing term. Furthermore, the "p value fallacy" (as defined in the "top hit on Google scholar") isn't even demonstrably a fallacy, though the authors may consider it so. Thus, including it in this wiki article amounts to POV, which is inappropriate. This is supposed to be an article about objective MISUNDERSTANDINGS, not about controversial opinions. 23.242.198.189 ( talk) 01:57, 26 September 2018 (UTC)
That seems rather backward to me. It doesn't make sense to include a section just because we like the name of the section, without consideration for whether the content of the section is actually relevant to the topic of the article. Note also that the fact that a term or phrase has "some currency" is not enough to make that term merit a section in the article. People have come up with all sorts of terms, many of which have "some currency." That doesn't mean they all belong in an article on misunderstanding p-values. 164.67.15.175 ( talk) 00:04, 29 September 2018 (UTC)
Just checking in, I see that still no one has provided any counterargument in favor of keeping the content of the "p value fallacy" section. Please remove it. 23.242.198.189 ( talk) 01:23, 12 October 2018 (UTC)
"See the refs" is not a legitimate argument--especially given that the refs were obviously already "seen" because they were addressed in this discussion. 23.242.198.189 ( talk) 07:22, 16 October 2018 (UTC)
Let's settle this once and for all, now that the inappropriately applied "semi-protected status" has been lifted. We can go through the section sentence-by-sentence and see that it is not valid.
Sentence 1: The p-value fallacy is a common misinterpretation of the p-value whereby a binary classification of hypotheses as true or false is made, based on whether or not the corresponding p-values are statistically significant.
The cited source for that sentence defining the p-value fallacy is A PAPER THAT DOES NOT EVEN CONTAIN THE TERM "P-VALUE FALLACY." So right of the bat, we can see there is something very wrong here.
Sentence 2: The term 'p-value fallacy' was coined in 1999 by Steven N. Goodman.
The "p-value fallacy" defined by Goodman in the cited article is NOT what is described in the preceding sentence (the "binary classification of hypotheses as true or false"). Instead, Goodman defines "p-value fallacy" as "the mistaken idea that a single number can capture both the long-run outcomes of an experiment andthe evidential meaning of a single result." In other words, Goodman is making a Bayesian critique of p-values. In fact, Goodman's paper is an OPINION PIECE that criticizes the use of "frequentist statistics" altogether! Goodman's opinion that using p-values in conventional frequentist null hypothesis testing is based on "fallacy" is just that--an opinion. It would be relevant in an article on controversies or debates about p-values, but this wiki article is supposed to be about MISUSES of p-values, SO including POV HERE directly contradictS wiki policy.
Sentence 3: This fallacy is contrary to the intent of the statisticians who originally supported the use of p-values in research.
This is more POV that cites the same Goodman article. Curiously, this sentence also cites a Sterne and Smith article (another opinion piece), which DOES NOT EVEN CONTAIN THE TERM "P-VALUE FALLACY."
Sentence 4: As described by Sterne and Smith, "An arbitrary division of results, into 'significant' or 'non-significant' according to the P value, was not the intention of the founders of statistical inference."
That may or may not be true. It doesn't actually matter, because again, that Sterne and Smith opinion piece DOES NOT EVEN CONTAIN THE TERM "P-VALUE FALLACY," and what Sterne and Smith are describing here does not appear to even be equivalent what Goodman defined as the p-value fallacy.
Sentence 5: In contrast, common interpretations of p-values discourage the ability to distinguish statistical results from scientific conclusions, and discourage the consideration of background knowledge such as previous experimental results.
This is POV again, that again cites the opinion piece by Goodman.
Sentence 6: It has been argued that the correct use of p-values is to guide behavior, not to classify results, that is, to inform a researcher's choice of which hypothesis to accept, not to provide an inference about which hypothesis is true.
This is POV yet again, that yet again cites the opinion piece by Goodman. At least here, the wording includes the phrase "It has been argued that..." to acknowledge the POV. It should be noted that in a ddition to citing the Goodman piece, the sentence also cites another article (one by Dixon). Dixon's article, in contrast to Goodman's, does in fact define the p-value fallacy similarly to how it is defined in Sentence 1. However, the fact is that the term SIMPLY HAS NOT CAUGHT ON. A Google scholar search shows that even the handful of articles that have cited some aspect or another of the Dixon paper have rarely (if ever) used the term "p-value fallacy." The same goes for articles that have cited the Goodman paper. In fact, if you search Google scholar for articles containing the phrase "p-value fallacy," in nearly every hit the phrase only appears in the reference section of the article (as part of a citation of the Goodman paper).
In summary, the "p-value fallacy" is: (a) not a term that is in common enough use to merit mention, (b) is a term that, even when it is used, is not used consistently, as this very wiki article illustrates, and (c) when used as the person who originally "coined" the term intended, is not even really a definitive fallacy and thus does not belong in this wiki article because it constitutes partisan Bayesian POV. It should also be noted that the problems with "p-value fallacy" section have been mentioned numerous times before in the past, going back years (search this talk page to see). It's time to put this silliness to bed once and for all. The section is unnecessary (because the term is fairly obscure), inappropriate (because it contains POV), and confusing (because it can't even agree with itself about the definition of the term it's talking about).
A final note: The main advocate for keeping the section has been the editor Headbomb, who showed similar resistance to removing the COMPLETELY INCORRECT section on the false discovery rate a while back (as shown in this talk page). When challenged to present an argument for keeping the "p-value fallacy" section (scroll up a few paragraphs), Headbomb said simply the following: "The section belongs here. See refs in the section." I hope that I have sufficiently demonstrated here that, after "seeing the refs," it is clearer than ever that the section does NOT belong here. — Preceding unsigned comment added by 23.242.198.189 ( talk) 04:47, 8 September 2019 (UTC)
Shouldn’t there be at least one? — Preceding unsigned comment added by 194.5.225.252 ( talk) 16:01, 2 December 2019 (UTC)
@ 23.242.198.189: You reverted my addition
with the comment "Revered good faith edit. It isn't clear what "in conflict" means. This seems like a subjective thing, not an objective misconception." In this case "in conflict" means to disagree about the underlying reality or to contradict each other. As far as I understand it, it is not subjective at all, but maybe we can find a wording that is better. Do you (or anyone else) have an idea to more clearly express the misconception? Nuretok ( talk) 14:24, 3 March 2021 (UTC)
References
The redirect P-hunting has been listed at redirects for discussion to determine whether its use and function meets the redirect guidelines. Readers of this page are welcome to comment on this redirect at Wikipedia:Redirects for discussion/Log/2024 April 21 § P-hunting until a consensus is reached. Utopes ( talk / cont) 17:35, 21 April 2024 (UTC)
This is the
talk page for discussing improvements to the
Misuse of p-values article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find medical sources: Source guidelines · PubMed · Cochrane · DOAJ · Gale · OpenMD · ScienceDirect · Springer · Trip · Wiley · TWL |
This article is rated Start-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||||||
|
An excellent start! The article could use some section headings and reorganization to improve the flow however. Perhaps the following outline would work:
Cheers. Boghog ( talk) 07:56, 21 February 2016 (UTC)
The jargon and wiki-linked phrases that many will need to follow may make this a little abstract for the lay reader. I don't know if there's any place for it, but I've found a good teaching aid in this article:
It does point up the pitfalls of small sample size and the inevitability of false positives when you look at enough measurements. Even if it's not used, it's worth a read. -- RexxS ( talk) 00:51, 22 February 2016 (UTC)
I'm having a hard time understanding what this article is trying to communicate. Let's start at the beginning. The first sentence is, "The p-value fallacy is the binary classification of experimental results as true or false based on whether or not they are statistically significant." This is excellent. I understand what it's saying and agree with it entirely. But moving ahead to the second paragraph, we are told that "Dividing data into significant and nonsignificant effects ... is generally inferior to the use of Bayes factors". Is it? From a Bayesian perspective, certainly. But a Bayesian would tell you that you shouldn't be doing hypothesis testing in the first place. Whereas a frequentist would tell you that a Bayes factor doesn't answer the question you set out to ask and that you need a p-value to do that.
In a pure Bayesian approach, you begin with a prior probability distribution (ideally one which is either weakly informative or made from good expert judgment) and use your experimental results to create a posterior distribution. The experimenter draws conclusions from the posterior, but the manner in which he or she draws conclusions is unspecified. Bayesian statistics does not do hypothesis testing, so it cannot reject a null hypothesis, ever. It does not even have a null hypothesis. At most you might produce confidence intervals, but these confidence intervals are not guaranteed to have good frequentist coverage properties; a 95% Bayesian confidence region says only that, under the prior distribution and given the observed data, there is a 95% chance that the true value of the parameter lies in the given region. It says nothing about the false positive rate under the null hypothesis because in Bayesian statistics there is no such thing as a null hypothesis or a false positive rate.
Let's move on to the "Misinterpretation" section. It begins, "In the p-value fallacy, a single number is used to represent both the false positive rate under the null hypothesis H0 and also the strength of the evidence against H0." I'm not sure what the latter half of this sentence means. The p-value is, by definition, the probability, under the null hypothesis, of observing a result at least as extreme as the test statistic, that is, the probability of a false positive. That's one part of what the article defines as the p-value fallacy. But what about "the strength of the evidence against H0"? What's that? Perhaps it's intended to be a Bayes factor; if so, then you need a prior probability distribution; but hypothesis testing can easily be carried out without any prior distribution whatsoever. Given that I don't understand the first sentence, I guess it's no surprise that I don't understand the rest of the paragraph, either. What trade-off is being discussed, exactly?
The paragraph concludes that something "is not a contradiction between frequentist and Bayesian reasoning, but a basic property of p-values that applies in both cases." This is not true. There is no such thing as a Bayesian p-value. The next paragraph says, "The correct use of p-values is to guide behavior, not to classify results; that is, to inform a researcher's choice of which hypothesis to accept, not provide an inference about which hypothesis is true." Again, this is not true. A p-value is simply a definition made in the theory of hypothesis testing. There are conventions about null hypotheses and statistical significance, but the article makes a judgmental claim about the rightness of certain uses of p-values which is not supported by their statistical meaning. Moreover, p-values are used in frequentist statistical inference.
The last paragraph claims, "p-values do not address the probability of the null hypothesis being true or false, which can only be done with the Bayes factor". The first part of the sentence is correct. Yes, p-values do not address the probability of the null hypothesis being true or false. But the latter part is not. Bayes factors do not tell you whether the null hypothesis is true or false either. They tell you about odds. Your experiment can still defy the odds, and if you run enough experiments, eventually you will defy the odds. This is true regardless of how you analyze your data; the only solution is to collect (carefully) more data.
It sounds to me like some of the references in the article are opposed to hypothesis testing. (I have not attempted to look at them, though.) I am generally skeptical of most attempts to overthrow hypothesis testing, since it is mathematically sound even if it's frequently misused and misunderstood. I think it would be much more effective for everyone to agree that 95% confidence is too low and that a minimum of 99.5% is necessary for statistical significance. Ozob ( talk) 03:59, 22 February 2016 (UTC)
I am not an expert in statistics, but I am familiar with parts of it, and I can't understand this article. Introductory textbooks teach that a p-value expresses the probability of seeing the sample, or a sample more "extreme" than it, given the null hypothesis. Is the p-value fallacy a particular misinterpretation of this conditional probability, such as the probability of the null hypothesis given the data? Try to explain where exactly the fallacious reasoning becomes fallacious. Mgnbar ( talk) 04:20, 22 February 2016 (UTC)
"...the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result."and
"the question is whether we can use a single number, a probability, to represent both the strength of the evidence against the null hypothesis and the frequency of false-positive error under the null hypothesis...it is not logically possible."The trade-off is that you can't have both of these at the same time.
"we are not discussing a conflict between frequentist and Bayesian reasoning, but are exhibiting a fundamental property of p values that is apparent from any perspective."I think my wording may have been unclear since it could be interpreted as referring to the preceding sentence and not the fallacy in general, or perhaps I shouldn't have converted "any perspective" to "these specific perspectives" (even though the latter is logically contained in the former). Part of the point of including this statement was to reinforce the point that the fallacy is not about frequentism vs Bayesianism. It's definitely not intended to claim that there are Bayesian p-values, and proposed alternative wordings would be appreciated. :-)
"in some cases the classical P-value comes close to bringing such a message,"which I left out since it didn't seem directly relevant, but I'd be fine with adding that.
Continuing on from the above, I've had another read through of this article and, the more I read, the more flawed it appears to me. It appears inconsistent as to precisely what the "p-value fallacy" is and is, more broadly, simply a Bayesian critique of p-values. There is useful material here on challenges with p-values, common mistakes, and that Bayesian critique, but all that would be better integrated into the statistical significance article, not separated off here ( WP:CFORK). I question notability for an article under the present name, and propose a merger into statistical significance. Bondegezou ( talk) 15:09, 22 February 2016 (UTC)
"analysis of nearly identical datasets can result in p-values that differ greatly in significance"- he presents two datasets, one which has a significant main effect and nonsignificant interaction effect, while the other has a significant interaction effect but a nonsignificant main effect.
Headbomb has added a section on the odds of a false positive statistically significant result. Plenty of good material there, but this is a separate issue to the purported p-value fallacy that this article is about. Why not move that section to p-value or Type I and type II errors? This looks like classic content forkery to me. This article should not be a dumping ground for all flaws with or misunderstandings of the p-value. Bondegezou ( talk) 16:39, 22 February 2016 (UTC)
I note there are 4 citations given in this section. I have looked through them all (7, 8, 9, 10): none of them use the term "p-value fallacy". Indeed, I note that other citations given in the article (2, 6, 11) do not use the term "p-value fallacy". If no citations in this section make any reference to "p-value fallacy", then I suggest again that this section is not about the p-value fallacy and constitutes WP:OR in this context. There is good material here that I am happy to see moved elsewhere, but this is not relevant here. Would others than me and Headbomb care to chime in? Can anyone firmly link this section to the p-value fallacy? Bondegezou ( talk) 10:27, 23 February 2016 (UTC)
I think it would be possible to explain the fallacy better. Here is my attempt: When an experiment is conducted, the question of interest is whether a certain hypothesis is true. However, p values don't actually answer that question. A p value actually measures the probability of the data, assuming that the hypothesis is correct. The fallacy consists in getting this backward, by believing that the p value measures the probability of the hypothesis, given the data. Or to put it a bit differently, the fallacy consists in believing that a p value answers the question one is interested in, rather than answering a related but different question. Looie496 ( talk) 18:22, 22 February 2016 (UTC)
Why don't you guys read the references? Sellke et al make it quite clear in their paper. They indicate three misinterpretations of the p-value, the first of which corresponds to the p-value fallacy sensu stricto as formulated by Goodman in 1999. The second one is what Looie496 explained above. All three misinterpretations are interrelated to some extent and merit being explained in the article, although the first one should be the one more extensively described per most of the literature.
"the focus of this article is on what could be termed the "p value fallacy," by which we mean the misinterpretation of a p value as either a direct frequentist error rate, the probability that the hypothesis is true in light of the data, or a measure of odds of H0 to H1. [The term "p value fallacy" was used, in the first of these senses, in the excellent articles Goodman (1999a,b).]" (Sellke et al., 2001)
Given the large number of publications on the topic and the fact that it has a consistently used name, I find it a terrible idea to merge into the p-value article. Neodop ( talk) 01:42, 23 February 2016 (UTC)
So let me take a stab at explaining this, based on the Colquhoun paper.
I'm not confident that this explanation is correct or even sensible. I'm just trying to convey the kind of mathematical detail that would help me understand how this particular misinterpretation of probabilities differs from another. Also, Wikipedia's audience is diverse. The article should offer an explanation in plain English. But it should also offer an explanation in mathematical notation. Mgnbar ( talk) 16:10, 23 February 2016 (UTC)
"we could not both control long-term error rates and judge whether conclusions from individual experiments were true."So it's just a more complicated description that incorporates a rationale for why the reasoning fails. The only case where I see alternative definitions is Sellke, and in that case the options still seem to refer to the same essential error, which is the use of p-values as direct evidence about the truth of the hypothesis. As noted above, that paper was also published soon after the term was coined, so it's also possible the other versions just didn't catch on. (If I'm misunderstanding this then we could provide multiple definitions, or merge it somewhere, though I think it would be WP:UNDUE to have all of this at the main p-value article.) Sunrise ( talk) 08:27, 25 February 2016 (UTC)
The defense of Bayesian analysis should not be as prominent in the article, and definitely not included in the lead (it is highly misleading, no pun intended :)). Most of the authors exposing the misuse of p-values do defend Fisherian hypothesis testing and are highly critical of "subjective" Bayesian factors, etc. particularly Colquhoun. Neodop ( talk) 14:57, 23 February 2016 (UTC)
I just put some {{ dubious}} tags on the false discovery rate section. My objection is as follows: Suppose that we run many hypothesis tests; we get some number of rejections of the null hypothesis ("discoveries"). The FDR is defined as the expected proportion of those discoveries which are erroneous, i.e., false positives. There are statistical procedures that effectively control the FDR (great!). But if we get 100 rejections using a procedure that produces an FDR of 5%, that means that we expect 5 false positives, not that we have 5 false positives. We might have 1 false positive. Or 10. Because of this, the probability that a given rejection is a false positive is not 5%, even though the FDR is 5%! All we know is that if we were to repeat our experiment over and over, we would expect 5 false positives on average, i.e., the probability of a discovery being a false positive is 5% on average (but perhaps not in the particular experiment we ran). The article does not seem to make this distinction. Ozob ( talk) 16:29, 23 February 2016 (UTC)
I have a rudimentary grasp of statistics - this article gets into the weeds way too fast. Can the folks working on this please be sure that the first section describes this more plainly, per WP:TECHNICAL? Thanks. Jytdog ( talk) 18:45, 23 February 2016 (UTC)
I'm stimulated to write since someone on twitter said that after reading this article he was more confused than ever.
I have a couple of suggestions which might make it clearer.
(1) There needs to be a clearer distinction between false discoveries that result from multiple comparisons and (2) false discoveries in single tests. I may be partially responsible for this confusion because I used the term false discovery rate for the latter, though that term had already been used for the former. I now prefer the term "false positive rate" when referring the interpretation of a single test. Thus, I claim that if you observe P = 0.047 in a single test. and claim that you've made a discovery, there is at least a 30% chance that you're wrong i.e. it's a false positive. See http://rsos.royalsocietypublishing.org/content/1/3/140216
(2) It would be useful to include a discussion of the extent to which the argument used to reach that conclusion is Bayesian, in any contentious sense of that term. I maintain that my conclusion can be reached without the need to invoke subjective probabilities. The only assumption that's needed is that it's not legitimate to assume any prior probability greater than 0.5 (to do so would be tantamount to claiming you'd made a discovery and that your evidence was based on the assumption that you are probably right). Of course if a lower prior probability were appropriate than the false positive rate would be much higher than 30%.
(3) I don't like the title of the page at all. "The P value fallacy" has no defined meaning -there are lots of fallacies. There is nothing fallacious about a P value. It does what's claimed for it. The problem arises because what the P value does is not what experimenters want to know, namely the false positive rate (though only too often these are confused).
David Colquhoun ( talk) 19:49, 23 February 2016 (UTC)
I notice that the entry on False Positive Rate discusses only the multiple comparison problem, so this also needs enlargement. David Colquhoun ( talk) 11:34, 24 February 2016 (UTC)
From an average non-sciences reader/editor who was recently and not out of personal interest self-introduced to the p-value concept via Bonferroni correction, during a content discussion/debate, I find this explanation clear and straightforward (so hopefully it is accurate as well): the Skeptic's Dictionary: From Abracadabra to Zombies: p-value fallacy.
Also, this article's title seems practical, as p-value fallacy appears to be common enough - as in, for example, randomly, The fallacy underlying the widespread use of the P value as a tool for inference in critical appraisal is well known, still little is done in the education system to correct it
- therefore, it is the search term I'd hope to find an explanation filed under, as a broad and notable enough subject that merits standalone coverage. --
Tsavage (
talk) 21:17, 23 February 2016 (UTC)
Maybe I'm missing something, but there does not yet seem to be consensus that there is a single, identifiable concept called "the p-value fallacy". For example:
Because there are multiple fallacies around p-values, Google searches for "p-value fallacy" might generate many hits, even if there is no single, identifiable fallacy.
So is this article supposed to be about all fallacies around p-values (like p-value#Misunderstandings)? If not, then is this particular p-value fallacy especially notable? Does it deserve more treatment in Wikipedia than other fallacies about the p-value? Or is the long-term plan to have a detailed article about each fallacy? Mgnbar ( talk) 15:25, 3 March 2016 (UTC)
Based on the discussion above, I'd like to propose turning the present article into a redirect to p-value#Misunderstandings. There is no one "p-value fallacy" in the literature, so the present article title will always be inappropriate. Moreover, p-value#Misunderstandings is quite short, especially given the breadth of its content. The content currently in this article would be better placed there where it's easier to find and will receive more attention. Ozob ( talk) 00:12, 4 March 2016 (UTC)
Comment: As I understand it, largely from reading in and around this discussion, what is usually considered the p-value fallacy is, in simple English, misusing p-value testing as way to determine whether any one experimental result actually proves something.
There are a number of different mistaken beliefs about p-values that can each lead to misuse that results in the same fallacious application, i.e. to establish proof. So there appears to be one distinct overall p-value fallacy that is referred to in multiple sources, regardless of any other p-value misuses that may fall entirely outside of this definition.
This would mean that we have a notable topic (per WP:GNG), and perhaps some clarification to do within that topic. Have I oversimplified or plain got it wrong? (I'm participating as a guinea pig non-technical reader/editor - this article should be clear to me! :)-- Tsavage ( talk) 20:34, 4 March 2016 (UTC)
The main sources used for the substantive content of this article are the papers by Goodman and Dixon. They coined and promulgated the term "p-value fallacy", so they can be considered primary sources. Least, that's how something like WP:MEDMOS would describe them. Bondegezou ( talk) 08:21, 6 March 2016 (UTC)
Sources for future reference:
Manul ~ talk 20:02, 8 March 2016 (UTC)
Quoting the ASA's principles may help give the Wikipedia article some focus:
Manul ~ talk 20:11, 8 March 2016 (UTC)
My assumption was that this article would eventually become the expansion of p-value#Misunderstandings, with that section of the p-value article being a summary of this one. I haven't yet delved into the recent discussions, but from an outsider's perspective I can say that that's what readers expect. It would seem super confusing for this article to cover some (or one) misunderstanding but not others. Manul ~ talk 12:45, 10 March 2016 (UTC)
I've moved the article, since it seems like we have enough support for it. If the fork becomes unnecessary, we can just merge this into the main article. Since Bondegezou still prefers that option, I left the merge tags open and pointed it to this page instead. In the meantime, help with building additional content is appreciated! Sunrise ( talk) 00:56, 11 March 2016 (UTC)
Okay, I've now made some changes aimed at the different scope, and imported a lot of content from other articles. Explanations for specific edits are in the edit summaries. I used the ASA statement a couple of times, but there's a lot more information that could be added. One important thing is to finish dividing everything into appropriate sections, especially the list of misunderstandings from p-value#Misunderstandings, which is currently in the lead. That will probably need careful reading to figure out which parts are supported by which sources. Once that's done it will be a lot easier to work on expanding the individual sections. Sunrise ( talk) 01:56, 11 March 2016 (UTC)
I finally downloaded the Goodman (1999) paper that apparently coins the term "p-value fallacy" in its narrow sense. It's very different from the statistics that I usually read, so I'm hoping for some clarification here. For starters, Goodman criticizes another source for including this passage:
My interpretation of this passage, which is admittedly out of context, is:
So my questions are:
This paper seems very useful here. Bondegezou ( talk) 09:40, 27 May 2016 (UTC)
The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. It is not connected to either. - The first sentence is unequivocally accurate. It essentially states that P(A|B) is not generally the same as P(B|A), where A is the event that the null is true, B is the observed data, and the p-value is P(B|A). However, the second sentence seems unnecessary and overly strong in saying the p-value, P(B|A), is "not connected" to the posterior probability of the null given the data, P(A|B). In fact, the two probabilities are, at least in some sense, "connected" by Bayes rule: P(A|B)=P(B|A)P(A)/P(B)
The p-value is not the probability that a finding is "merely a fluke." - I couldn't find the word "fluke" in any of the 3 sources cited for the section, so it is not clear (1) that this misunderstanding is indeed "common," and (2) what the word "fluke" means exactly in this context. If "merely a fluke" means that the null is true (i.e. the observed effect is spurious), then there seems to be no distinction between this allegedly common misunderstanding and the previous misunderstanding. That is, both misunderstandings are the confusion of P(A|B) with P(B|A), where A is the event that the null is true, B is the observed data, and the p-value is P(B|A).
The p-value is not the probability of falsely rejecting the null hypothesis. That error is a version of the so-called prosecutor's fallacy. - Here again, it is not clear exactly what this means, where in the cited sources this allegedly common misunderstanding comes from, and whether or not this "prosecutor's fallacy" is a distinct misunderstanding from the first one. The wiki article on prosecutor's fallacy suggests that there is no distinction--i.e. both misunderstandings confuse P(A|B) with P(B|A), where A is the event that the null is true, B is the observed data, and the p-value is P(B|A).
The significance level, such as 0.05, is not determined by the p-value. - Here again, is this really a common misunderstanding? Where is this allegedly common misunderstanding listed in the cited sources?
It should also be noted that the next section ("Representing probabilities of hypotheses") AGAIN seems to restate the first "common misunderstanding." The section also contains the following weirdly vague statement: "it does not apply to the hypothesis" (referring to the p-value). What is "the hypothesis?" — Preceding unsigned comment added by 50.185.206.130 ( talk) 08:46, 6 July 2016 (UTC)
Regarding your response to the first point, sure, the null is either true or false. But if someone doesn't know whether it's true or false, I don't see a problem with that person speaking in terms of probabilities based on the limited information they have access to. By analogy, if you don't know what card I've randomly drawn from a deck, you could speak of the probability of it being a red suit or a face card or the Ace of Spades, even though from an omniscient perspective there is no actual uncertainty--the card simply is what it is. I'm aware that there are different philosophical perspectives on this issue, but they are just that--perspectives. And if you're uncomfortable using the term "probability" for events in a frequentist framework, you can simply substitute "long-term frequency." In any case, I don't see how including the vague and potentially controversial statement that "it is not connected to either" is at all necessary or adds anything useful to the article section; the immediately preceding sentence is sufficient and straightforward.
Your response to the second point isn't clear to me. So "a fluke" means "unlucky?" And what is the "finding" in the phrase "finding is merely a fluke?" The data? So there is a common misunderstanding that the p-value is the probability of the data being unlucky? It's hard to see how that is even a coherent concept. Perhaps the misunderstanding just needs to be explained more clearly and with different vocabulary. Indeed, as I noted previously, the word "fluke" does not appear in any of the cited sources.
You didn't really respond to the third point, except to say that your response to the second point should apply. It seems we agree that the first misunderstanding is p = P(A|B) (even though you've noted that it's debatable whether P(A|B) is coherent in a frequentist framework). Isn't the "prosecutor's fallacy" also p = P(A|B)? In fact, the wiki article on prosecutor's fallacy appears to describe it precisely that way (except using I and E instead of A and B). Maybe part of the problem is the seemingly contradictory way the alleged misunderstanding is phrased: first it's described as thinking the p-value is the probability of falsely rejecting the null hypothesis (which appears to mean confusing the p-value with the alpha level), and then it's described as "a version of prosecutor's fallacy" (which appears to be something else entirely).
Your response to the fourth point seems to be POV. The functional difference between p=.04 and p=.035 may be relatively trivial in most cases, but p=.0000001 need not be treated as equivalent to p=.049999999 just because both are below some arbitrarily selected alpha level. Here again, there may be different perspectives on the issue, but we are supposedly talking about definitive misunderstandings, not potential controversies.
You didn't respond to my last point, regarding the "Representing probabilities of hypotheses" section. — Preceding unsigned comment added by 23.242.207.48 ( talk) 18:30, 10 July 2016 (UTC)
Your entire response to the first point is a red herring. The American Statistical Association's official statement on p-values (which is cited in this article; http://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108) notes that p-values can be used for "providing evidence against the null hypothesis"--directly contradicting the claim that p-values are "not connected" to the probability of the null hypothesis. If you insist that someone has to be called "Bayesian" to make that connection, fine--it is a connection nonetheless (and it is the connection that p-values' usefulness depends on). Furthermore, none of your response substantively speaks to the issue at hand: whether the statement "it is not connected to either" should be included in the article. Even if we accept your view that P(A|B) is meaningless, the disputed statement in the article does not communicate that premise. The article does not say, "The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. Those probabilities are not conceptually valid in the frequentist framework." Instead, the article says, "The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. It is not connected to either." Thus, even if we accept your premise, the statement is not helpful and should be removed. In fact, saying P(A|B) is "not connected" to P(B|A) might be taken to imply that the two probabilities orthogonally coexist--which would directly contradict your view. Given that there is no apparent reason for you to be attached to the disputed sentence even if all your premises are granted, I hope you will not object that I have removed it.
Regarding the second point, you defined "fluke" as "unlucky." I responded that "the probability that the finding was unlucky" (1) is an unclear concept and (2) does not obviously relate to any passages in the cited sources (neither "fluke" nor "unlucky" appear therein). Hence, with regard to your ad hominem, I do understand English--that doesn't make all combinations of English words intelligible or sensible. I repeat my suggestion that if there is an important point to be made, better vocabulary should be used to make it. Perhaps the business about "flukes" comes from the ASA's statement that "P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone" ( http://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108). Note that the statement combines the fallacy regarding the null being true and the fallacy regarding the data being produced by random chance alone into a single point. Why not use a similar approach, and similar language, in the wiki article? I hope you will not object that I have made such an adjustment.
Regarding the third point (regarding "prosecutor's fallacy"), you don't have a response adequately demonstrating that (1) the proposed misunderstanding is consistent with how prosecutor's fallacy is defined (note that the article equates prosecutor's fallacy with thinking the p-value is the probability of false rejection), (2) the proposed misunderstanding is non-redundant (i.e. prosecutors's fallacy should be distinct from the first misunderstanding), and (3) the proposed misunderstanding is listed in the cited sources (note that "prosecutor's fallacy" is not contained therein). In fact, your description of prosecutor's fallacy is EXACTLY misunderstanding #1--whether you're a "quasi-Bayesian" or a "Bayesian," the fallacy is exactly the same: P(A|B)=P(B|A). What framework is used to derive or refute that fallacy doesn't change the fallacy itself.
Regarding the fourth issue, if the point is that the alpha level must be designated a priori rather than as convenient for the obtained p-value, then we are certainly in agreement. I have not removed the item. But if this point is indeed commonly misunderstood, how about providing a citation?
Regarding the final issue, I see the point you are making. I hope you will not object that I have slightly adjusted the language to more closely match the ASA's statement that the p-value is "a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself." — Preceding unsigned comment added by 23.242.207.48 ( talk) 14:36, 11 July 2016 (UTC) 23.242.207.48 ( talk) 14:45, 11 July 2016 (UTC)
I am pleased that we are finding common ground. You describe prosecutor's fallacy as "asserting thatp-values measure the probability that the studied hypothesis is true" (direct quote). In the article, misconception #1 is described as thinking the p-value is "the probability that the null hypothesis is true" (direct quote). That makes misconception #1 essentially a word-for-word match for your definition of prosecutor's fallacy. It's hard to see how one could justify saying those are two different misconceptions when they are identically defined. It seems that you are making a distinction between two versions of an objection to the fallacy rather than between two different fallacies; perhaps the reference to prosecutor's fallacy should be moved to misconception #1.
Note also that your definition of prosecutor's fallacy doesn't match the way it's described in the bold text of misconception #3. Indeed, "the probability of falsely rejecting the null hypothesis" (article's words) is certainly not the same thing as "the probability that the null hypothesis is true" (your words). Thus, there is another reason the reference to prosecutor's fallacy does not seem to belong where it appears. — Preceding unsigned comment added by 23.242.207.48 ( talk) 10:22, 12 July 2016 (UTC)
I can't say I'm convinced. I'm also wary that expanding on the ASA's descriptions would violate wikipedia standards on original research, unless there are other reputable sources that explicitly identify two distinct common misunderstandings as you propose.
I am also find misunderstanding #4 a bit peculiar: "The p-value is not the probability that replicating the experiment would yield the same conclusion." Are there really people who think a very low p-value means the results aren't likely to be replicable? It's hard to imagine someone saying, "p =.0001, so we almost certainly won't get significance if we repeat the experiment." I doubt many people think super-low p-values indicate less reliable conclusions. I also couldn't find this misunderstanding listed in any of the cited sources. Which paper and passage did it come from? 23.242.207.48 ( talk) 00:07, 14 July 2016 (UTC)
That rewriting still seems weird to me. So, many people think that a very high p-value (e.g. p=.8) means they will probably get significance if they repeat the experiment? I've never heard that. I'm removing misunderstanding #4 pending a sourced explanation. 23.242.207.48 ( talk) 11:08, 14 July 2016 (UTC)
Based on the discussion so far, it seems like the quality of the list of misunderstandings is doubtful. I feel like we need to clean up the list: Each misunderstanding should come with an inline citation, and the language should be carefully checked to ensure that it is correct and reflects what is in the source. Would anyone like to volunteer? Or propose a different solution? Ozob ( talk) 23:16, 14 July 2016 (UTC)
This is what I have so far:
Proposal
|
---|
The following list addresses several common misconceptions regarding the interpretation of p-values:
References:
|
Improvements would be appreciated. Have I interpreted everything correctly? Did I miss anything? Can we find citations for the unsourced parts? (at least a couple of them should be easy) There's also a comment in Sterne that directly addresses prevalence of a misconception, specifically that the most common one is (quote)"that the P value is the probability that the null hypothesis is true, so that a significant result means that the null hypothesis is very unlikely to be true," but I wasn't sure about how to best include that. Perhaps that (or other parts of the section) could be useful for the main p-value article. Sunrise ( talk) 08:11, 17 July 2016 (UTC)
Reasons:
1. The FDR section appears to refer to a misinterpretation of alpha levels--not a misinterpretation of p-values (note that p0 in the formula is the alpha level, not the p-value). Thus, the section is irrelevant to the article.
2. The statement that FDR increases when the number of tests increases is false. In fact, the FDR can either increase, decrease, or stay the same when the number of tests increases.
3. The given definition of the FDR appears to be incorrect ("the odds of incorrectly rejecting the null hypothesis"). The FDR is conventionally defined is the expected proportion of rejections that are incorrect (and defined as 0 when there are no rejections). — Preceding unsigned comment added by 2601:644:100:74B7:494A:EDB1:8541:281E ( talk) 06:59, 7 July 2016 (UTC)
As I noted in my previous comment, the section does not even correctly define the FDR itself--let alone the "relation between the FDR and the p-value." 23.242.207.48 ( talk) 01:08, 9 July 2016 (UTC)
- That is simply incorrect. The FDR is not, as you have claimed, the probability of obtaining a false positive. The FDR is the expected proportion of significant tests (i.e., "positives") that are false positives. You are perhaps confusing the FDR with the familywise Type I error rate--just as the FDR section of this article does. Look at the formula given in the section--it is the formula for the familywise error rate, NOT FOR THE FDR! I again encourage you to actually read the wiki article on the FDR or, better yet, read the original Benjamini & Hochberg article that introduced the quantity in the first place. 23.242.207.48 ( talk) 03:42, 9 July 2016 (UTC)
As clever as the comic strip may be, it doesn't seem very encyclopedic to spend a paragraph summarizing it in this article. Similarly, it wouldn't make sense to dedicate a paragraph to summarizing the film Jaws in an article about great white sharks (though the film might be briefly mentioned in such an article).
The paragraph is also somewhat confusingly written (e.g. what does "to p > .05" mean?, what does "threshold that the results are due to statistical effects" mean?, and shouldn't "criteria of p > 0.05" be "criteria of p < 0.05?").
Another concern is that the punchline "Only 5% chance of coincidence!" is potentially confusing, because "5% chance of coincidence" is not an accurate framing of p < .05 even when there is only a single comparison.
If the jellybean example is informative enough to merit inclusion, I suggest either rewriting the summary more clearly and concisely (and without verbatim transcriptions such as "5% chance of coincidence"), or simply removing the summary and linking to the comic strip in the further reading section. 23.242.207.48 ( talk) 17:51, 12 July 2016 (UTC)
I've cleaned up the example so it references the comic without doing a frame-by-frame summary and without the confusing language. Perhaps this is a reasonable compromise? I'm still of the mind that the reference to the comic should probably be removed altogether, but it should at the very least be grammatically and scientifically correct in the meantime. 23.242.207.48 ( talk) 10:19, 13 July 2016 (UTC)
I'm inclined to agree with Bondegezou that detailed repetition of information available in other articles is unnecessary. By the same token, this whole article is arguably unnecessary and would be better as a short section in the p-value article than as a sprawling article all to itself, without much unique material (but it seems that issue has been previously discussed and consensus is to keep it). — Preceding unsigned comment added by 23.242.207.48 ( talk) 11:00, 14 July 2016 (UTC)
Please keep it! This makes the article more understandable than just a bunch of math when we can just how ridiculous these situations are when you ignore the implications! I urge editors to keep this example and add more to other sections, because right now it seems to be in danger of becoming like all the other math pages: useless unless you already know the topic or are a mathematician. You talk about null or alternative hypotheses, but never give any example! Who exactly do you can understand this? You think someone who sees a health claim in a nutrition blog that checks a paper with a conclusion that prune juice cures cancer p < 0.05 knows that the null hypothesis means prune juice doesn't cure cancer? Or that an alternative hypothesis is that strawberries cure cancer? EXPLAIN THINGS IN WAYS PEOPLE WHO DON'T HAVE A PHD IN MATH CAN UNDERSTAND!
I am an educator at École Léandre LeGresley in Grande-Anse, NB, Canada and I agree to release my contributions under CC-BY-SA and GDFL. — Preceding unsigned comment added by 2607:FEA8:CC60:1FA:9863:1984:B360:4013 ( talk) 12:30, 21 July 2016 (UTC)
It's not clear what the statement "p-values do not account for the effects of confounding and bias" is supposed to mean. For example, what kind of "bias" is being referenced? Publication bias? Poor randomization? The experimenter's confirmation bias? Even the cited source (an opinion piece in a non-statistical journal) doesn't make this clear, which is probably why the statement in this article's misunderstandings list is the only one not accompanied by an explanation. Furthermore, the cited source doesn't even explicitly suggest that there's a common misunderstanding about the issue. So are people really under the impression that p-values account for "confounding and bias?" Those are general problems in research, not some failing of p-values in particular. I'm removing the statement pending an explanation and a better source. 23.242.207.48 ( talk) 02:07, 23 July 2016 (UTC)
This
edit request to
Misunderstandings of p-values has been answered. Set the |answered= or |ans= parameter to no to reactivate your request. |
References #4 and #5 are identical. Please edit the citation to reference #5 that follows the sentence "The p-value fallacy is a common misinterpretation of the p-value whereby a binary classification of hypotheses as true or false is made, based on whether or not the corresponding p-values are statistically significant." to refer to reference #4 instead. Amoriarty21 ( talk) 23:11, 4 September 2018 (UTC)
This issue was raised years ago, and it appears that the conclusion in this talk page was that "p value fallacy" is not a standard, consistently defined term. Apparently, a single user has been fighting for its conclusion in this article, but it seems to me that is not enough. Certainly giving "p value fallacy" an entire section in the article amounts to undue weight for a term that is hardly ever actually used in science or statistics--making the term's inclusion here misleading regarding what terminology is commonly used. Moreover, as was pointed out in an earlier discussion on this talk page, in the rare cases when the term "p value fallacy" is actually used, it isn't used consistently. Thus, including a section on the "p value fallacy" is not only unnecessary for understanding the topic of the article, but is also potentially confusing. 164.67.15.175 ( talk) 21:12, 24 September 2018 (UTC)
The "top hit on Google scholar" that you're referring to (which is actually an opinion piece) defines the p-value fallacy as "the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result." That is NOT the definition given in this wiki article: "The p-value fallacy is a common misinterpretation of the p-value whereby a binary classification of hypotheses as true or false is made." Thus, the "top hit on Google scholar" actually illustrates the point that "p value fallacy" is an inconsistently defined and potentially confusing term. Furthermore, the "p value fallacy" (as defined in the "top hit on Google scholar") isn't even demonstrably a fallacy, though the authors may consider it so. Thus, including it in this wiki article amounts to POV, which is inappropriate. This is supposed to be an article about objective MISUNDERSTANDINGS, not about controversial opinions. 23.242.198.189 ( talk) 01:57, 26 September 2018 (UTC)
That seems rather backward to me. It doesn't make sense to include a section just because we like the name of the section, without consideration for whether the content of the section is actually relevant to the topic of the article. Note also that the fact that a term or phrase has "some currency" is not enough to make that term merit a section in the article. People have come up with all sorts of terms, many of which have "some currency." That doesn't mean they all belong in an article on misunderstanding p-values. 164.67.15.175 ( talk) 00:04, 29 September 2018 (UTC)
Just checking in, I see that still no one has provided any counterargument in favor of keeping the content of the "p value fallacy" section. Please remove it. 23.242.198.189 ( talk) 01:23, 12 October 2018 (UTC)
"See the refs" is not a legitimate argument--especially given that the refs were obviously already "seen" because they were addressed in this discussion. 23.242.198.189 ( talk) 07:22, 16 October 2018 (UTC)
Let's settle this once and for all, now that the inappropriately applied "semi-protected status" has been lifted. We can go through the section sentence-by-sentence and see that it is not valid.
Sentence 1: The p-value fallacy is a common misinterpretation of the p-value whereby a binary classification of hypotheses as true or false is made, based on whether or not the corresponding p-values are statistically significant.
The cited source for that sentence defining the p-value fallacy is A PAPER THAT DOES NOT EVEN CONTAIN THE TERM "P-VALUE FALLACY." So right of the bat, we can see there is something very wrong here.
Sentence 2: The term 'p-value fallacy' was coined in 1999 by Steven N. Goodman.
The "p-value fallacy" defined by Goodman in the cited article is NOT what is described in the preceding sentence (the "binary classification of hypotheses as true or false"). Instead, Goodman defines "p-value fallacy" as "the mistaken idea that a single number can capture both the long-run outcomes of an experiment andthe evidential meaning of a single result." In other words, Goodman is making a Bayesian critique of p-values. In fact, Goodman's paper is an OPINION PIECE that criticizes the use of "frequentist statistics" altogether! Goodman's opinion that using p-values in conventional frequentist null hypothesis testing is based on "fallacy" is just that--an opinion. It would be relevant in an article on controversies or debates about p-values, but this wiki article is supposed to be about MISUSES of p-values, SO including POV HERE directly contradictS wiki policy.
Sentence 3: This fallacy is contrary to the intent of the statisticians who originally supported the use of p-values in research.
This is more POV that cites the same Goodman article. Curiously, this sentence also cites a Sterne and Smith article (another opinion piece), which DOES NOT EVEN CONTAIN THE TERM "P-VALUE FALLACY."
Sentence 4: As described by Sterne and Smith, "An arbitrary division of results, into 'significant' or 'non-significant' according to the P value, was not the intention of the founders of statistical inference."
That may or may not be true. It doesn't actually matter, because again, that Sterne and Smith opinion piece DOES NOT EVEN CONTAIN THE TERM "P-VALUE FALLACY," and what Sterne and Smith are describing here does not appear to even be equivalent what Goodman defined as the p-value fallacy.
Sentence 5: In contrast, common interpretations of p-values discourage the ability to distinguish statistical results from scientific conclusions, and discourage the consideration of background knowledge such as previous experimental results.
This is POV again, that again cites the opinion piece by Goodman.
Sentence 6: It has been argued that the correct use of p-values is to guide behavior, not to classify results, that is, to inform a researcher's choice of which hypothesis to accept, not to provide an inference about which hypothesis is true.
This is POV yet again, that yet again cites the opinion piece by Goodman. At least here, the wording includes the phrase "It has been argued that..." to acknowledge the POV. It should be noted that in a ddition to citing the Goodman piece, the sentence also cites another article (one by Dixon). Dixon's article, in contrast to Goodman's, does in fact define the p-value fallacy similarly to how it is defined in Sentence 1. However, the fact is that the term SIMPLY HAS NOT CAUGHT ON. A Google scholar search shows that even the handful of articles that have cited some aspect or another of the Dixon paper have rarely (if ever) used the term "p-value fallacy." The same goes for articles that have cited the Goodman paper. In fact, if you search Google scholar for articles containing the phrase "p-value fallacy," in nearly every hit the phrase only appears in the reference section of the article (as part of a citation of the Goodman paper).
In summary, the "p-value fallacy" is: (a) not a term that is in common enough use to merit mention, (b) is a term that, even when it is used, is not used consistently, as this very wiki article illustrates, and (c) when used as the person who originally "coined" the term intended, is not even really a definitive fallacy and thus does not belong in this wiki article because it constitutes partisan Bayesian POV. It should also be noted that the problems with "p-value fallacy" section have been mentioned numerous times before in the past, going back years (search this talk page to see). It's time to put this silliness to bed once and for all. The section is unnecessary (because the term is fairly obscure), inappropriate (because it contains POV), and confusing (because it can't even agree with itself about the definition of the term it's talking about).
A final note: The main advocate for keeping the section has been the editor Headbomb, who showed similar resistance to removing the COMPLETELY INCORRECT section on the false discovery rate a while back (as shown in this talk page). When challenged to present an argument for keeping the "p-value fallacy" section (scroll up a few paragraphs), Headbomb said simply the following: "The section belongs here. See refs in the section." I hope that I have sufficiently demonstrated here that, after "seeing the refs," it is clearer than ever that the section does NOT belong here. — Preceding unsigned comment added by 23.242.198.189 ( talk) 04:47, 8 September 2019 (UTC)
Shouldn’t there be at least one? — Preceding unsigned comment added by 194.5.225.252 ( talk) 16:01, 2 December 2019 (UTC)
@ 23.242.198.189: You reverted my addition
with the comment "Revered good faith edit. It isn't clear what "in conflict" means. This seems like a subjective thing, not an objective misconception." In this case "in conflict" means to disagree about the underlying reality or to contradict each other. As far as I understand it, it is not subjective at all, but maybe we can find a wording that is better. Do you (or anyone else) have an idea to more clearly express the misconception? Nuretok ( talk) 14:24, 3 March 2021 (UTC)
References
The redirect P-hunting has been listed at redirects for discussion to determine whether its use and function meets the redirect guidelines. Readers of this page are welcome to comment on this redirect at Wikipedia:Redirects for discussion/Log/2024 April 21 § P-hunting until a consensus is reached. Utopes ( talk / cont) 17:35, 21 April 2024 (UTC)